Web Reflection: String

My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Showing posts with label String. Show all posts

Friday, July 29, 2011

About JavaScript apply arguments limit

Just a quick one from ECMAScript ml ... it is true that browsers may have a limited number of arguments per function. This may be actually a problem, specially when we use apply to invoke a generic function that accepts arbitrary number of arguments.

String.fromCharCode

This is a classic example that could fail with truly big collection of char codes and here my suggestion to avoid such limit:


var fromCharCode = (function ($fromCharCode, MAX_LENGTH) {
    // (C) WebReflection - DO THE FUCK YOU WANT LICENSE
    return function fromCharCode(code) {
        typeof code == "number" && (code = [code]);
        for (var
            result = [],
            i = 0,
            length = code.length;
            i < length; i += MAX_LENGTH
        ) {
            result.push($fromCharCode.apply(null, code.slice(i, i + MAX_LENGTH)));
        }
        return result.join("");
    };
}(String.fromCharCode, 2048));

// example
alert(fromCharCode(80)); // P
alert(fromCharCode([80, 81, 82, 83, 84])); // PQRST

The revisited version accepts directly an array and performs the call ceil(codes.length / MAX_LENGTH) times.
Performances impact will be irrelevant while bigger Arrays will be parsed, hopefully, without problems.
If we still have problems we should never forget that userAgents may have a limited amount of available RAM so ... split the task or the operation or stream it.

As Summary

It's not that difficult to apply same concept with whatever function may suffer the apply and number of arguments limits: just define a maximum amount of arguments, in this case 2048, so that the task will be distributed without problems.

Tuesday, November 10, 2009

String Escape Safe Regular Expression

I should have probably investigated more but apparently I did it ... the most problematic I've encountered so far with JavaScript RegExp seems to be solved!

Update

Indeed, I should have investigated ... I just like to find solutions by my own. I am not surprised somebody already investigated this classic parsing problem.
Steve talked about it a year ago, using the lookbehind missed feature I talked in this post.
Above post has much more details than mine (and much more Edits as well).
The good part I am happy about is that both me and Steve came out with basically the same solution, but His one is definitively more compact:


// Steves Levithan compact solution
/(["'])(?:(?=(\\?))\2.)*?\1/g

The assumption of above regexp is that if there is a char followed by an escape one, there must be another char that cannot be the initial single or double quote, being the latter one outside the second uncaptured part, and after a non greedy operation.
If the second condition, \2, does not exist, the dot "." will pass the current char, no escape found, performing the char by char parsing I have described in my solution.
The dot is my [^\\], the double escape is represented by "\2.", as is for the escape plus whatever else that is not the end of the string, equivalent of my [\\(?=\1)]\1
I don't want to edit lots ot times this post, and I'll leave it as is to let you understand the problem, the logic, and the solution.
The only thing I would like to check are performances, since my less compact solution should be theoretically faster for common strings where the escape char is not present while Steve one will try to look for the escape plus will assign the possible missed match plus will pass whatever else char after, if any, considering outside there is a "break", and all these operations for whatever length, and still a char by char operation.
Whatever will be, we know we have at least two alternatives, and both mine and Steves one should be cross browser.

A Bit Of History

In all these years of programming with different languages, I have created dunno how many code parsers. WebReflection itself is using one of these parsers to highlight my sources. My good old PHP Comments Remover (2005 though ...) used another code parser. MyMin project used another one as well ... in few words, in my programming history I don't know how many times I had to deal with sources. The strategy I have always adopted, specially for JavaScript, is the char by char parser. The reason is simple, I have never created or found a good regular expression able to threat this case:


var code1 = "this is some \"test\"\\";
var code2 = "and this is \"anot\\her\" one!";

Above code, managed as a string, will become a stringe like:
"var code1 = \"this is some \\\"test\\\"\\\\";
var code2 = \"and this is \\\"anot\\her\\\" one!\";"
And if you know Regular Expressions, you know why this case is not that simple to manage isn't it?
Well, right now I was forking a project with a massive usage of Regular Expressions for CSS selectors and I could not avoid to notice the classical wrong match to manage strings:


/['"]([^'"]*?)['"]/g

Above match is almost a non-sense. If we have a string such "told'ya!" that RegExp will match told', leaving "ya!" out of the game. To make it a bit better the classic procedure is this one:


/(['"])([^\1]*?)\1/g

Whit above RegExp we are looking for quote or double quote char and we are searching the next one being sure if the first match is a single quote, the string will finish with a single one, and viceversa. There is still the problem that if we have the first matched quote or double quote and an escaped one in the middle of the string, that regular expression will truncate again the latter one giving us a untrustable result.

Why It Is More Difficult Via JavaScript

Regular Expressions in JavaScript miss at least one of most common features in PCRE world: the look-behind assertion!
Fortunately, we have an helpful Backreferences able in some case to slow down the match, but often the only or best way we have to create more clever matches!

The String Escape Safe Regular Expression


// WebReflection Solution
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g

I am not sure above little monster is the best RegExp you can find for this problem, and JavaScript features, what I am sure about, is that I have done dozen of tests and results seems to be perfect: Hooray!!!
If you are not familiar with RegExp, please let me try to explain what's going on there:


/
  // look for a single or a double quote char
  // this will be referenced as \1 in the rest of the regexp
  // in order to completely ignore the other one
  (['"])

  // the second match is performed over the string
  // that could be empty, or it could contain
  // any character included the first match, if escaped
  (

    // the second match will be a char by char parser
    // the only character we are worried about
    // is the one able to escape the first match
    (
      ?: // we are not interested about next capture
      // since the only scary char is the escape
      // but it is not necessary present
      // (let's say is less present than any other)
      // speed up the RegExp validating every char
      // but the escape ... these are all good!
      [^\\]
      |
      // if we encounter an escape char and this
      // is escaping itself we can skip 2 chars
      \\{2}
      |
      // alternatively, we could have
      // an escaped match (current one: single or double)
      // in this case we want to be sure that the escape
      // is for the matched char and not just an escape
      [\\(?=\1)]\1
      |
      // we need to validate whatever else has been
      // escaped as well so if the escape char is
      // NOT followed by the initial match or
      // another escape char it's ok
      // and we go on with next char
      [\\(?!\1)]

    // precedent cases should be performed for each
    // encountered char but these cannot be greedy
    // otherwise we risk to wrap the full string
    // var a = "a", b = "b";
    // 'a", b = "b' <-- greedy!
    )*?
  )

  // to make precedent assumptions valid
  // we need to be sure the string terminates
  // with the initial matched char
  \1
/g

That's pretty much it, if we use match method, replace, or exec, the matched[1] or RegExp.$1 will be the char used to encapsulate the string, single or double quote, while matched[2] or RegExp.$2 will contain the string itself.

In Any Case It Is Still Not Perfect

If we consider JavaScript regular expressions, same stuff used to solve the problem, we'll have another one.


var re = /ooo"yeah/;
var s = "no way";

In above example there will be some problem since the double quote inside the regular expression will be matched like a charm with my suggestion.
This is the reason we still need char by char parsers but hey ... I was trying to parse some selector and the usage of @test="case" which is even apparently not standard, so bear in mind we cannot use this RegExp unless the code won't contain literal regexps.
What is the trap here? That char by char a part, it's quite impossible to decide who comes first, "the slash or the quote"?

Quick And Dirty Solution Tester

With this code it should be simple to copy and paste some valid source to read parse after parse what is OK and what is not:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g,
      function(){
        alert([arguments[1], arguments[2]].join("\n"));
      }
    );
  };
};

Please share whatever problem you'll find with such Regular Expression or suggest me a better faster approach to solve this problem with same test cases, thanks.

Friday, July 10, 2009

ECMAScript 5 Full Specs String trim, trimLeft, and trimRight

During last evenings I have updated a little bit my vice-versa project.
Since vice-versa aim is to bring in every browser what is possible to implement and, in most of the cases, already defined as standard (from W3 or MSDN when it is worthy) I decided to get rid of the Ariel Flesler fast trim proposal to introduce my lightweight full specs String.prototype.trim, trimLeft, and trimRight.
For full specs I mean that vice-versa String.prototype.trim replace exactly same characters replaced by native Firefox 3.5 implementation, rather than only characters which code is less than 33 as is for Ariel proposal.

The good part of vice-versa ( to be honest I cannot find bad parts so far ;) ) is that every single file is stand-alone, so if you do not like benefits the entire "lib" could bring, you can always adopt only one of its files, for example the String one, the Array one, or the last full specs ECMAScript 5 Date constructor, compatible with ISO strings, new Date("2009-07-10") and with a complete toISOString method for each created Date instance (of course even if replaced, new Date will produce instances of Date and their constructor will be Date itself).

If you want to give vice-versa minified a try, a little monster which size is about 5Kb gzipped, please do not hesitate to download it.

Have fun with future standards and few MSDN standard coolness for every browser ;)

Friday, February 13, 2009

After the Array, subclassed String

As I wrote months ago in this documentation about JavaScript prototypal inheritance, subclass native data type is not that simple (almost not possible at all) but since JS is loads of weird bits and bops, I often try to break its documented limits.

It was the Array, some time ago, now it is about time for String.

The concept is similar, if you subclass a native data type its methods will return that native data type like instance, so concat, charAt, toString, etc etc, will return a String unless we do not override every inherited method (so performances wont be that good).

Subclassed String


(function(Function, slice, push){
// from WebReflection: Subclassed String
function String(String){
    if(arguments.length) // clever constructor, accepts more than a string as argument
        push.apply(this, slice.call(arguments).join("").split(""));
};
String.prototype = new Function;
try{
    (new String) + "";  // exception in FireFox
    var join = Array.prototype.join;
    String.prototype.toString = String.prototype.valueOf = function(){
        return join.call(this, "");
    };
}catch(e){              // no way to retrieve the length with FireFox
    String.prototype.toString = String.prototype.valueOf = function(){
        for(var Array = [], i = 0; Array[i] = this[i]; i++);
        return Array.join("");
    };
};
(window.$ || ($ = {})).String = String; // let's put this in a namespace
})(String, Array.prototype.slice, Array.prototype.push);

That's it, every String.prototype.method should act as it has been called via native String with probably every browser.

Performances

concat, charAt, charCodeAt, toLowerCase, etc, etc will be almost the same of native strings but the constructor will be a bit slower (instances instead of regular "strings").
On the other hand, in every browser the returned instance will be an Array like String, so ...


new $.String("Here", " ", "we", " ", "are!")[0];

will be exactly the char "H" in every compatible browser.

Alternatives?

To create a valid alternative that will NOT be an instanceof String we could use the same Stack trick:


// every browser
(function(Function, join, push){
function String(String){
    if(arguments.length)
        push.apply(this, slice.call(arguments).join("").split(""));
};
String.prototype.length = 0;
String.prototype.valueOf = String.prototype.toString = function(){
    return join.call(this, "");
};
// here we can add every String prototype to the $.String prototype
(window.$ || ($ = {})).String = String;
})(String, Array.prototype.join, Array.prototype.slice, Array.prototype.push);

Have fun ;-)

Thursday, October 23, 2008

Subclassing JavaScript Native String

Just a quick post about subclassing native String constructor.
All we need is to redefine valueOf and toString methods.


function $String(__value__){
    this.length = (this.__value__ = __value__ || "").length;
};
with($String.prototype = new String)
 toString = valueOf = function(){return this.__value__};

With incoming V8, TraceMonkey, Squirrelfish, and other (if any ...) advanced engines that transforms repeated code into machine one, performances wont be a problem anymore and everybdy could create its own implementation of the String.

Of course, these statements will be preserved:


var s = new $String("abc");
s instanceof String; // true
s.constructor === String; // true

while typeof will return an object but we can easily use another method such:


$String.prototype.type = function(){
 return "string";
};

alert(typeof s);      // object
alert(s && s.type()); // string

concat as other native methods willl work, but returned object, unfortunatly, wont be a $String.