Web Reflection: RegExp

My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Showing posts with label RegExp. Show all posts

Thursday, February 16, 2012

Berlin JS - RegExp Slides

Here they are!
(published live)

If you want to test examples remember to replace weird keynote double-quotes with normal one :)

Enjoy JavaScript Regular Expressions

Thursday, November 12, 2009

Most of the tools we use on daily basis to develop applications with any kind of programming language need to understand our code in order to help us while we are writing, reading, or creating code.
We open an editor, we read highlighted code, we like suggestions when we write a dot and we use compilers, minifiers, compressors, obfuscators, converters, documentation generators ... but have we ever though about the program behind our same program? Have we never thought that highlights, suggestions, minifications, and syntax analyzer do not come for free and these are primordial programs we've ever used often without even realizing we are using it?
This post is about general suggestions, techniques, and practices, used to map code. In this specific case, we will analyze step by step the logic behind a code mapper using the most used programming language in the world: JavaScript.

Map VS Tokens

First of all we need to understand what we need. A Map could be considered a generic list of coordinates able to tell us "what is where" while a tokenizer is usually an extremely detailed map with extra info, where everything is deeply analyzed and hopefully never casual, as a map could be.
Every programming language has an interpreter, and usually an interpreter is able to analyze the syntax and create tokens to use runtime or to compile the language into another one (byte/machine code). As metaphoric example, the world is an open source application and we, nature and animals included, are tokens, then everything is part of a global application.
So far, services such Google Map, are applications able to map the world. Google Map does not (yet) care about our role in the system, it simply tells us what/who is where. Silly metaphors a part, being global tokens analysis much harder to perform, this post will talk about the simple way: the generic Map, but if interested, and again via JavaScript, we could have a look into Narcissus, a good old Mozilla Project able to analyze JavaScript via JavaScript itself.

Time To Start Thinking About The Problem

OK, everything seems so easy for human eyes ... we take a string, we recognize what is where, and that's it! Easy! Specially with an highlighter able to make code lecture a pleasure, isn't it?
Well, we'll see that things are not that simple as we think, and that is why historical projects such Scintilla are still under development.

What We Are Interested About

As I have said, JavaScript will be the reference language but problems and techniques are similar for every other.
The first requirement to understand is what as what are we looking for. In JavaScript case, we can split the language in these main cases:

strings, we cannot touch them!
literal regular expressions, untouchable as well
comments, we may consider to get rid of them
E4X, damn cool XML as is, again untouchable
everything else, considered code

For above categories, we could slit them in sub categories:

single quoted strings
double quoted strings
single line comments
multiline comments

So far, we have just decided what will be our map about, and nothing else. Now it's time to think how to implement this map.

The Basic Problem: Who Comes First

Somebody could think:dude, what's the fuss man, just search via RegExp and that's it... and no it's not.
This is a classic case where everything could go wrong, included your favorite JavaScript editor, but still perfectly valid syntax:


// ... code

var theWeirdCase = /"['//*"]'/;

// other code ...

Uhm, check that out, a regular expression with a string "['//*" inside plus another intersected string '//*"]' plus a single line comment // plus the beginning of a multiline one /* ... well done, surrounded by code, that case, reproducible with every combination with a string and a regexp inside, a comment with a string inside, etc etc, could fuck up all our parsing plans, isn't it?
So rule number one: understand what part of the map is more relevant, where in this case, relevance is defined by who comes first.
If we have whatever inside a string, whatever IS inside a string.
If we have whatever inside a literal RegExp, whatever IS inside a regexp.
Same is for E4X or comments, single or multi line, whatever will be part of that comment. Does anybody agree?

Char By Char VS Regular Expressions

OK, here we are, let's define the best strategy to analyze code, right? Char By Char could be the easy way to analyze code: it's simple to implement, we feel like we have control over every single char, we can spot who comes first without problems or errors, isn't it?
Regular Expressions are simply abstracts and clever char by char parsers. We delegate tedious code to the RegExp engine which aim is to understand the expression, and gives us the match, if any. Regular Expressions are somehow similar, conceptual speaking, to SQL: we type Maya and Egyptians characters and magically we obtain the result, without caring at all about lower level layers.
I can still spot the dude ... guy voice saying:
mate, I am not a noob, if RegExp are char by char parsers, my char by char parser will be faster for sure.
This time the dude guy is not completely wrong but as usual, it depends.
If we are programming with C, C++, or heyholetsGo!, without considering better suited programing languages for this purpose such Caml or OCaml, Regular Expressions will require a library with such overload for generic purpose that maybe we can do better and faster for the specific case.
On the other hand, all we know about programming is that less is better and if we can trust extremely tested and famous regular expression engines (PCRE) why on earth we should waste time writing whatever specific case parser which aim will be the same of a RegExp?
And why on earth we think our check/test/verify implementation will be more stable than a short, quick, and dirty RegExp?

JavaScript Is Not As Fast As C Is

Thanks to competitors and new strategies to manage JS dynamic code, performances are often faster even than PHP, Ruby, Python (not Iron or C-) but JS cannot compete with lower level languages compiled directly into machine code and without dynamic nature.
Rather than explain in 3 hours why char by char is not always worth it with JavaScript, here I am with a test anybody can try and with every browser.


// commented to avoid problems with highlight used
// n this blog, an old char by char parser ;-)

// PLEASE REMOVE NEXT COOMMENT CHARS "//" TO TEST

// var theWeirdCase = /"['//*"]'/;
var aLongString = new Array(150001).join(".") + theWeirdCase;
var i = 0;
var skip = false;
var time = [new Date];
var position = [(function(){
    var length = aLongString.length;
    var mlc = aLongString.indexOf("/*");
    var slc = aLongString.indexOf("//");
    var sqt = aLongString.indexOf("'");
    var dqt = aLongString.indexOf('"');
    var rex = aLongString.indexOf("/");
    if(mlc < 0)mlc = length;
    if(slc < 0)slc = length;
    if(sqt < 0)sqt = length;
    if(dqt < 0)dqt = length;
    if(rex < 0)rex = length;
    var position = Math.min(mlc, slc, sqt, dqt, rex);
    if(position === length)position = -1;
    return position;
})()];
time[i] = new Date - time[i];
time[++i] = new Date;
position[i] = (function(){
    for(var
        position = -1, c = 0, length = aLongString.length;
        c < length; ++c
    ){
        switch(aLongString.charAt(c)){
            case "'":position=c;c=length;break;
            case '"':position=c;c=length;break;
            case "/":
                if(c < length - 1){
                    switch(aLongString.charAt(c + 1)){
                        case "*":position=c;c=length;break;
                        case "/":position=c;c=length;break;
                        default:position=c;c=length;break;
                    }
                }
                break;
        };
    };
    return position;
})();
time[i] = new Date - time[i];
try{aLongString[0]}catch(e){skip=true};
if(!skip && aLongString[0] === "."){
    time[++i] = new Date;
    position[i] = (function(){
        for(var
            position = -1, c = 0, length = aLongString.length;
            c < length; ++c
        ){
            switch(aLongString[c]){
                case "'":position=c;c=length;break;
                case '"':position=c;c=length;break;
                case "/":
                    if(c < length - 1){
                        switch(aLongString.charAt(c + 1)){
                            case "*":position=c;c=length;break;
                            case "/":position=c;c=length;break;
                            default:position=c;c=length;break;
                        }
                    }
                    break;
            };
        };
        return position;
    })();
    time[i] = new Date - time[i];
};
time[++i] = new Date;
position[i] = aLongString.search(/\/*|\/\/|'|"|\//);
time[i] = new Date - time[i];

alert([
    time.join("\n"),
    position.join("\n")
].join("\n"));

Above benchmark try to replicate common techniques to find a single piece of the full map, in this case over about 150Kb of fake JavaScript code.

How To Read The Benchmark

dude ... the benh confirm that ... shut up! The bench shows that Regular Expressions are the fastest way to parse code via JavaScript. It does not matter if the score could not be the fastest in above case, what matters is that with a single Regular Expression we can grab 0, 1, or every match in the code.
If we think that in 150Kb of code to highlight showed time will increase for each mapped part in the same code, while the regular expression could be just one, we can easily see that:

the RegExp is able to find and validate at the same time what we are looking for, above benchmark misses all manual validations we need to do to understand if the case, first found char a part, is exactly the one we where looking for
with a single regular expression we save a large number of characters and the code is easier to maintain
for each manual parsed char we need to perform a manual check for the exact case, the total amount of manual checks increase with number of chars and code complexity
with a single regexp we don't care that much about code size because the engine, hopefully written in C, should be fast enough to be able to parse a large amount of data

We Still Need Good Regular Expressions

This is the most critical part, if we use bad regular expressions we could have a lot of false positive and we could mess up the map. The reason Regular Expressions are not always well considered, is that these could be hard to read, write, or understand, and people could put "chars around" without improving anyhow the regexp, adding more false positives instead. I do not pretend to give you best regular expressions ever for all cases, but it's since 2000 I am using RegExps and hopefully I know a chicky bit about them.


JSMap.parser = [
    // WebReflection Suggestion
    {
        test:/\/\/[^\1]*?(\r\n|\r|\n)/g,
        type:JSMap.COMMENT_SL,
        place:function(value, a){
            return value.charAt(2) === "@" ? value : a;
        }
    },{
        test:/\/\*[^\1]*?(\*\/)/g,
        type:JSMap.COMMENT_ML,
        place:function(value){
            return value.charAt(value.length - 3) === "@" ? value : JSMap.parser[0].place(value, "");
        }
    },{
        test:/(["'])(?:(?=(\\?))\2.)*?\1/g,
        type:JSMap.STRING
    },{
        test:/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
        type:JSMap.REGEXP
    },{
        // Note: experimental, not fully tested/supported
        test:/<>[^\1]*?(<\/>)|<(\w+|\{\w+\})(?:\s*\/|[^\>]*?>.*?<\/\2\s*)>/g,
        type:JSMap.E4X
    }
];

Above collection contains all kind of things we would like to Map for JavaScript.
Some object contains a place method which aim is to avoid, if necessary, the match replacement. In this case I have considered Internet Explorer conditional comments and nothing else, but there are other cases where a comment should not be removed (e.g. /*! my license */ )
Each object contains a map type, 'cause we would like to know what we have found there, don't we?


JSMap.CODE      = 1;
JSMap.COMMENT_SL= 2;
JSMap.COMMENT_ML= 4;
JSMap.STRING    = 8;
JSMap.REGEXP    = 16;
JSMap.E4X       = 32;
JSMap.ALL       = 63;
JSMap.DEFAULT   = ~JSMap.E4X;

Since we are creating a customizable map, we would like to choose what we want to find or not. We can decide using some bit operation.
As we can see, the default one exclude the E4X case, it's not common, since it has not been implemented yet in every browser, plus for this post and example is not perfect.
To exclude something all we need to do is to use ~ char, while if we want to decide just few thing we can always use the or | :


JSMap.COMMENT_SL | JSMap.COMMENT_ML

Above example will look for single and multi line comments. Maybe we want just understand if there is some conditional comment?
In cany case, bear in mind that who comes first is still the main problem, so if we restrict the searh we could find false positives (e.g. comments inside strings or regexps)

How To Logically Proceed With The Mapper

OK, we have reduced code size already 1/3rd thanks to god regular expressions. Now what's next?
The problem is still the same: Who Comes First!
For each performed Regular Expression we should store results somewhere. In this case we will have 4 searches performed via RegExp for the entire code, rather than a check for each possible matched character, but false positives could always be there.


function JSMap(CODE, type){
    if(!type)
        type = JSMap.DEFAULT
    ;
    for(var
        Map = [],
        length = JSMap.parser.length, i = 0,
        a, b, exec, parser, value;
        i < length; ++i
    ){
        parser = JSMap.parser[i];
        if(parser.type & type){
            while(exec = parser.test.exec(CODE)){
                value = exec[0];
                Map.push({start:exec.index, end:exec.index + value.length, value:value, type:parser.type});
            };
        };
    };
    // ... the rest of the code

OK, looping over the list of RegExp we have collected every match.
Each match has been stored as an object where properties are:

where the match start, we can reuse start and end info regardless for other purposes
where the match end
the match itself, alone could be reused or simply replaced in the current CODE
the match type, what we have found

Start and end are valid for every match and compatible with substring.
If for some reason we will change a single value and its length will change, we can easily synchronize the Map adding or removing the difference between old length and the new one for every other mapped objects after the current one.
these properties could be superfluous but we want control for any kind of occasion.

An Ordered Map Without False Positives Collisions

The simplest way to reorder the map is a native Array.sort operation, perfect to understand who comes first!


Map.sort(function(a,b){
    return b.start < a.start ? 1 : -1
});

The start point is the sort key and thanks to Regular Expressions we will rarely have same start point. If we have one, we need to rethink the RegExp because it is not good enough since, for example, a comment cannot be a regexp and viceversa.
The native sort operation is hopefully fast enough to guarantee still better performances. All we need to do now is to add, if necessary, the code.

Cleaning Adding Surrounding Code

Once we have mapped all we were looking for, we can consider CODE everything before, in between, or after our matches.


// type could NOT include CODE
// and if code starts with a comment, there
// is nothing to do ...
if(type && 0 < (i = Map[0].start))
    Map.unshift({start:0, end:i, value:CODE.substring(0, i), type:JSMap.CODE})
;
// every other Mapped case will be ordered by start
// Accordingly, the first start we will encounter
// will be the valid one ... the first match
// will be considered valid by dafault (var a)
for(length = Map.length, i = 1, a = Map[0]; i < length; ++i){
    b = Map[i];
    switch(true){
        // if there is a gap between the
        // precedent a match and b one
        // there MUST be code between
        case a.end < b.start:
            // let's add it if we need
            // continue otherwise
            if(type){
                Map.splice(i, 0, {start:a.end, end:b.start, value:CODE.substring(a.end, b.start), type:JSMap.CODE});
                ++length;
                ++i;
            };
        // if there is no gap or code
        // has been inserted
        // we can go on with the for loop
        // considering the current b
        // valid, assigned for this reason to a
        case a.end === b.start:
            a = b;
            break;
        // if there is no gap
        // which means next match starts
        // before a.end and no after
        // we have a false positive
        default:
            // remove this match in any case
            Map.splice(i, 1);
            --length;
            --i;
            break;
    };
};
// if the last match ends before the string.length
// there must be another piece of code to add
if(type && (i = Map[i - 1].end) < CODE.length)
    Map.push({start:i, end:CODE.length, value:CODE.substring(i, CODE.length), type:JSMap.CODE})
;

Finally, if there is no match but JSMap.CODE is part of the search, we can consider the entire string a piece of code:


Map.push({start:0, end:CODE.length, value:CODE, type:JSMap.CODE})

How To Use A Map

Almost the end of this tedious post. Once we have created a map of our code, we can reuse matches in any way we prefer.
This is a "paste and go" example that will let us test whatever code we want:


/**
 * JSMap test case
 */
onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    var time    = new Date,
        Map     = JSMap(this.value/*, JSMap.CODE|JSMap.COMMENT_ML|JSMap.COMMENT_SL|JSMap.REGEXP*/)
    ;
    time = new Date - time;
    if(!Map.map)Map.map=function(fn){for(var i = 0, length = this.length; i < length; ++i)this[i]=fn(this[i]);return this};
    document.body.innerHTML = 'Mapped in ' + (time/1000) + ' milliseconds.<pre>' + Map.map(function(o){
        var value = o.value.replace(/</g, "&lt;").replace(/>/g, "&gt;");
        switch(o.type){
            case JSMap.STRING:
                value = '<span style="color:#66F">' + value + '</span>';
                break;
            case JSMap.COMMENT_SL:
                value = '<span style="color:#999">' + value + '</span>';
                break;
            case JSMap.COMMENT_ML:
                value = '<span style="color:#F66">' + value + '</span>';
                break;
            case JSMap.REGEXP:
                value = '<span style="color:#F00">' + value + '</span>';
                break;
            case JSMap.E4X:
                value = '<span style="color:#00F">' + value + '</span>';
                break;
            default:
                // eventually we can parse
                // the code with numbers
                // spaces, keywords, methods
                // and whatever we need
                // that would be another Map
                // specific for the language
                // or simply a replacer via RegExp
                break;
        };
        return value;
    }).join("") + '</pre>';
  };
};

Save in an html page, paste some valid code into the textarea, disabling spell check if the source is massive, and click somewhere outside the area.

Conclusion

Even if the case is JavaScript and a JavaScript map, with this post we can hopefully better understand problems behind generic code parsing.
What is missing in this post is a code parser able to highlight or understand every piece of code present in the map.
If we loop over the returned map it is possible to understand which one is code and what is next. If a variable is going to be assigned to a regexp or string, as example, the value will be found in the next object present in the map.
I did not consider numbers 'cause these are simple to parse in a code portion, while these could slow down every other operation since these could be easily spotted inside strings, comments, regexps, or E4X syntax.
Considerations, techniques, and benchmarks, are relative for this case but generally valid for any kind of purpose. CSS selectors, HTML, and other cases, need to consider when Regular Expressions are worthy (e.g. not that fast char by char parser due to used programming language) and when a simple indexOf could be the best solution ever. I hope you enjoyed this post, I surely did writing it since the argument is not that frequent and techniques extremely different (and trust me, I have showed maximum a 30% of what we could fnd out there).
Oooops, I almost forgot the JSMap source code!

Wednesday, November 11, 2009

Literal Regular Expression Safe Regular Expression

... sorry for the redundant title but that's exactly what is this post about ... after yesterday explanation about problem, logic, and solution, to grab valid strings inside JS code, here I am with the literal RegExp able to grab literals RegExps in a generic JavaScript code.

Why Do Not Add Just A "/" Into Other Strings RegExp

One comment gave me the hint to write this second post about RegExps. While time is a bit over during days, this answer is simple, but not obvious!
Differences between strings and literal regular expressions are basically these:

there must be at least one char, or the parser will consider the literal RegExp an inline comment //
the slash does NOT necessary need to be escaped. If we have a slash inside a range [a/b] the latter one won't break the RegExp and the slash will be considered just one valid char in that range
there could be one or more chars after, where i(ignore case), g(match all), and m(multi line) can be present one or more times

Latter point is not truly a problem since this syntax will break the code in any case:


function igm();
var a = "string"igm();

But still, we need to understand first couple of points.

The RegExp Safe Regular Expression


// WebReflection Solution
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g

Since yesterday after 10 seconds somebody pointed me another solution, I bet this will happen again but so far I have tested above little monster enough to say that should work without problems but obviously only if the code is valid, otherwise we don't even need to waste our time trying to parse it.
As example, yesterday somebody told me:look, it does not work with this


a = \"string"

Well, now consider that an escaped char could be everywhere in the code but again, these regular expressions are not code sanitizer, n any case improbable since:


// tell me what do you expect and WHY!
a = "string\"
b = \"other"
\"
" c = what?!"

So any kind of weird combination wont work but if the regular expression is valid, escaped or not escaped, the precedent solution should work like a charm.

Explanation

I won't go step by step for the entire RegExp this time, things are the same described in my precedent post so please read there if you want to know more. The emulated look-behind pattern has been included in this regexp to skip groupd of possible ranges present in the regexp. When a range is encountered, starting with char "[", it is skipped till the end. If there is no end theoretically the literal RegExp is broken and the code won't execute. Same strategy is used for the other case, where no [ is encountered, if there is a char followed by a slash, we go on as described in the other post. In this way we should be sure that whatever will be, we'll find the end of the RegExp included chars. I did not spend too much time ensuring consistency for these flags since "/whatever/ii" will be part of inconsistent code which is a syntax parser problem, and not mine.

Test Cases


//comment <-- should not be matched at all
var a = /a/;
var b = /\//i;
var c = /[/]/;

I bet there are hundreds of RegExp or minifier out there able to fail with the latest one, since even different Editors have problems trying to understand what is going on.

The Test Case

Same code I have posted yesterday, except the alert will be for all arguments. I know I have used an empty replace, which is a bad practice, but that was good enough for test purpose:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
      function(){
        alert([].slice.call(arguments).join("\n"));
      }
    );
  };
};

Please let me know if you find a better solution or whatever gotcha via the test case, considering that arguments[0] should be exactly the matched RegExp, thanks.

P.S. about the inline comment, it's not worth it to avoid that case for two reasons: we can always test that match.charAt(1) !== "/" plus the problem is still: who comes first? If we have a string inside a regexp or vice-versa there is no way to exclude these cases in a single, reasonable, RegExp. As I have said, as soon as I'll find some time, I will explain how to create a proper function able to manage all JavaScript cases, stay tuned!

Tuesday, November 10, 2009

String Escape Safe Regular Expression

I should have probably investigated more but apparently I did it ... the most problematic I've encountered so far with JavaScript RegExp seems to be solved!

Update

Indeed, I should have investigated ... I just like to find solutions by my own. I am not surprised somebody already investigated this classic parsing problem.
Steve talked about it a year ago, using the lookbehind missed feature I talked in this post.
Above post has much more details than mine (and much more Edits as well).
The good part I am happy about is that both me and Steve came out with basically the same solution, but His one is definitively more compact:


// Steves Levithan compact solution
/(["'])(?:(?=(\\?))\2.)*?\1/g

The assumption of above regexp is that if there is a char followed by an escape one, there must be another char that cannot be the initial single or double quote, being the latter one outside the second uncaptured part, and after a non greedy operation.
If the second condition, \2, does not exist, the dot "." will pass the current char, no escape found, performing the char by char parsing I have described in my solution.
The dot is my [^\\], the double escape is represented by "\2.", as is for the escape plus whatever else that is not the end of the string, equivalent of my [\\(?=\1)]\1
I don't want to edit lots ot times this post, and I'll leave it as is to let you understand the problem, the logic, and the solution.
The only thing I would like to check are performances, since my less compact solution should be theoretically faster for common strings where the escape char is not present while Steve one will try to look for the escape plus will assign the possible missed match plus will pass whatever else char after, if any, considering outside there is a "break", and all these operations for whatever length, and still a char by char operation.
Whatever will be, we know we have at least two alternatives, and both mine and Steves one should be cross browser.

A Bit Of History

In all these years of programming with different languages, I have created dunno how many code parsers. WebReflection itself is using one of these parsers to highlight my sources. My good old PHP Comments Remover (2005 though ...) used another code parser. MyMin project used another one as well ... in few words, in my programming history I don't know how many times I had to deal with sources. The strategy I have always adopted, specially for JavaScript, is the char by char parser. The reason is simple, I have never created or found a good regular expression able to threat this case:


var code1 = "this is some \"test\"\\";
var code2 = "and this is \"anot\\her\" one!";

Above code, managed as a string, will become a stringe like:
"var code1 = \"this is some \\\"test\\\"\\\\";
var code2 = \"and this is \\\"anot\\her\\\" one!\";"
And if you know Regular Expressions, you know why this case is not that simple to manage isn't it?
Well, right now I was forking a project with a massive usage of Regular Expressions for CSS selectors and I could not avoid to notice the classical wrong match to manage strings:


/['"]([^'"]*?)['"]/g

Above match is almost a non-sense. If we have a string such "told'ya!" that RegExp will match told', leaving "ya!" out of the game. To make it a bit better the classic procedure is this one:


/(['"])([^\1]*?)\1/g

Whit above RegExp we are looking for quote or double quote char and we are searching the next one being sure if the first match is a single quote, the string will finish with a single one, and viceversa. There is still the problem that if we have the first matched quote or double quote and an escaped one in the middle of the string, that regular expression will truncate again the latter one giving us a untrustable result.

Why It Is More Difficult Via JavaScript

Regular Expressions in JavaScript miss at least one of most common features in PCRE world: the look-behind assertion!
Fortunately, we have an helpful Backreferences able in some case to slow down the match, but often the only or best way we have to create more clever matches!

The String Escape Safe Regular Expression


// WebReflection Solution
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g

I am not sure above little monster is the best RegExp you can find for this problem, and JavaScript features, what I am sure about, is that I have done dozen of tests and results seems to be perfect: Hooray!!!
If you are not familiar with RegExp, please let me try to explain what's going on there:


/
  // look for a single or a double quote char
  // this will be referenced as \1 in the rest of the regexp
  // in order to completely ignore the other one
  (['"])

  // the second match is performed over the string
  // that could be empty, or it could contain
  // any character included the first match, if escaped
  (

    // the second match will be a char by char parser
    // the only character we are worried about
    // is the one able to escape the first match
    (
      ?: // we are not interested about next capture
      // since the only scary char is the escape
      // but it is not necessary present
      // (let's say is less present than any other)
      // speed up the RegExp validating every char
      // but the escape ... these are all good!
      [^\\]
      |
      // if we encounter an escape char and this
      // is escaping itself we can skip 2 chars
      \\{2}
      |
      // alternatively, we could have
      // an escaped match (current one: single or double)
      // in this case we want to be sure that the escape
      // is for the matched char and not just an escape
      [\\(?=\1)]\1
      |
      // we need to validate whatever else has been
      // escaped as well so if the escape char is
      // NOT followed by the initial match or
      // another escape char it's ok
      // and we go on with next char
      [\\(?!\1)]

    // precedent cases should be performed for each
    // encountered char but these cannot be greedy
    // otherwise we risk to wrap the full string
    // var a = "a", b = "b";
    // 'a", b = "b' <-- greedy!
    )*?
  )

  // to make precedent assumptions valid
  // we need to be sure the string terminates
  // with the initial matched char
  \1
/g

That's pretty much it, if we use match method, replace, or exec, the matched[1] or RegExp.$1 will be the char used to encapsulate the string, single or double quote, while matched[2] or RegExp.$2 will contain the string itself.

In Any Case It Is Still Not Perfect

If we consider JavaScript regular expressions, same stuff used to solve the problem, we'll have another one.


var re = /ooo"yeah/;
var s = "no way";

In above example there will be some problem since the double quote inside the regular expression will be matched like a charm with my suggestion.
This is the reason we still need char by char parsers but hey ... I was trying to parse some selector and the usage of @test="case" which is even apparently not standard, so bear in mind we cannot use this RegExp unless the code won't contain literal regexps.
What is the trap here? That char by char a part, it's quite impossible to decide who comes first, "the slash or the quote"?

Quick And Dirty Solution Tester

With this code it should be simple to copy and paste some valid source to read parse after parse what is OK and what is not:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g,
      function(){
        alert([arguments[1], arguments[2]].join("\n"));
      }
    );
  };
};

Please share whatever problem you'll find with such Regular Expression or suggest me a better faster approach to solve this problem with same test cases, thanks.