Pcre Jit
Pcre Jit
net/publication/262309709
CITATIONS READS
0 442
1 author:
Zoltan Herczeg
University of Szeged
10 PUBLICATIONS 71 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zoltan Herczeg on 21 August 2019.
Zoltán Herczeg
University of Szeged
Department of Software Engineering
13 Dugonics Square Szeged, Hungary
[email protected]
ABSTRACT 1. INTRODUCTION
High matching performance of regular expressions is a crit- Regular expressions are among the most popular tools for
ical requirement for many widely used software tools today, advanced text processing today, providing a rich and flexible
including web servers, firewalls, and intrusion detection sys- solution for custom pattern matching. Kleene defined reg-
tems. Backtracking regular expression engines have been ular sets [16] in the 1950s and Thompson implemented the
considerably improved in the last decade as a result of this first regular expression library [28] in the 1960s, since when
requirement. Today, state of the art engines use just-in-time regular expressions have evolved considerably to the point
(JIT) compilation support to generate machine code from where they cannot be called regular anymore. So far regular
regular expressions, and they use new, innovative techniques expression engines have belonged to one of the following two
to further improve the speed of the generated code. major groups:
In the present paper, we introduce a new technique called NFA based: a directed graph, a non-deterministic finite-
static backtracking, which allows simultaneous optimization state automaton, is constructed by these engines. Vertices
of both matching and backtracking. Based on this tech- of the graph represent states, and edges represent state tran-
nique, we developed a JIT compiler for the widely used sitions. Starting from the start state, the engine traverses
PCRE regular expression library. Our compiler supports all these states through the available state transitions with a
valid PCRE patterns, which shows that static backtracking backtracking depth-first search algorithm until the first ac-
is a viable choice for Perl compatible engines. cept state is reached, which means a successful match. When
We also show that our balanced, Abstract Syntax Tree all transition options are exhausted, the engine returns with
based code generator efficiently improves the performance of a failed match. Unlike their definition in computation the-
long-running, backtracking heavy regular expressions. Com- ory, state transitions of real world engines are not limited
pared to another JIT accelerated regular expression engine, to single character matches. They can represent any ac-
PCRE-JIT was able to run these patterns 1.95 times faster. tions such as matching a backreference or evaluating condi-
Since these long-running patterns dominate the total run- tional expressions. The runtime of the depth-first search al-
time, PCRE-JIT achieved 1.63 times faster matching speed gorithm is exponential in the worst case: when /(x|x){n}y/
overall. We also observed 6.36 times average speedup com- is matched to a string of n + 1 x literals, the engine performs
pared to the PCRE interpreter on 5 different CPU architec- 2n backtracks before it returns with a failed match.
tures. DFA based: a deterministic finite-state automaton is
constructed by these engines. Similar to the directed graph
Categories and Subject Descriptors of the NFA based engines, the state machine has states and
state transitions. However, this state machine has exactly
D.3.4 [Programming Languages]: Processors—Compil- one state transition for each state and input character pair,
ers, Code generation, Optimization which eliminates the need for backtracking. These engines
are closer to the DFA engines described in computation the-
General Terms ory, since their state transitions can only depend on fixed
Algorithms, Performance character sets. The runtime of a DFA based engine depends
only on the input length. The trade-off is state explosion,
e.g. /a[ˆb]{n}a/ has 3 ∗ 2n different states. Due to this ex-
Keywords ponential number of states, the runtime cost of this pattern
Regular expressions, JIT compiling, Static backtracking is exponential for those engines that generate their state
machines in advance, such as RE2 [8] from Google.
In the present paper, we introduce a new code genera-
tor technique called static backtracking, which is a radically
new approach for backtracking engines. Unlike the prior
art, it generates machine code from the Abstract Syntax
Tree (AST) [2] representation which provides more informa-
tion about the structure of a regular expression than NFA
or DFA. Using this extra information, it can optimize both
CGO ’14, February 15 - 19 2014, Orlando, FL, USA matching and backtracking simultaneously rather than fo-
.
cusing only on matching. Because our state machine is a PCRE byte code
bi-directional tree, which is traversed by a backtracking al-
gorithm different from NFA, our algorithm belongs to a third
group. ❄
The first freely available NFA based engine was created Static Backtracking based
by Henry Spencer in 1987. It was chosen as the regular PCRE-JIT compiler PCRE-JIT
expression library of the Perl language [29]. The engine in-
side Perl was extended with new, innovative features, which ❄ engine
made it increasingly popular among developers. Eventu- SLJIT compiler
ally it became the standard of regular expressions, adopted
by many languages including Python, JavaScript and .NET
❄
Framework. The Perl Compatible Regular Expression li-
brary (PCRE) [14] made by Philip Hazel also uses the Perl Machine code
syntax with minor differences.
The aim of just-in-time compilation for PCRE (PCRE-
JIT) is to speed up pattern matching of the PCRE library Figure 1: Overview of PCRE-JIT engine
by transforming the PCRE byte code into the low-level in-
termediate representation (LIR) language of the Stackless
Just-In-Time Compiler (SLJIT), which then translates it not been finished and seems to be abandoned now.
into machine executable code. Both PCRE-JIT and SLJIT The third group generates executable machine code on
projects were developed by us, and they have been part of the fly. Yet Another Regex Runtime [4] (YARR) in the
the PCRE library since version 8.20. After the initial re- open source WebKit browser engine, Irregexp [7] in the V8
lease, we have continued this work, and since version 8.32, JavaScript engine and our PCRE-JIT belong to this group.
the whole feature set of PCRE has been supported. At the Since YARR and Irregexp are the closest counterparts of
time of writing this paper, PCRE-JIT supports more Perl our engine, we compare their backtracking performance in
compatible syntax rules than any other regular expression Section 5.
engine with JIT compiler support. To make the paper self- The lightweight JIT compilers can be grouped according
contained, we outline both the PCRE-JIT and SLJIT com- to their low-level intermediate representation.
pilers. The rest of the paper is organized as follows: NanoJIT [1] and LibJIT [30] accept their LIR in Static
In Section 2, we review related work. In Section 3, we Single Assignment (SSA) [24] form. They use lightweight
outline the SLJIT compiler. In Section 4, we introduce the register allocation mechanisms, e.g. linear scan [31].
PCRE-JIT compiler, and describe its AST based code gen- Virtual machine accelerators compile the byte code repre-
erator, including the new static backtracking method. In sentation of a VM to machine code. Many of them compile
Section 5, we compare the performance of PCRE-JIT to hotspots only. They usually target the Java VM, and some
other JIT compiler accelerated engines, and we compare the of them are lightweight enough for embedded systems such
performance of PCRE-JIT and the PCRE interpreter on var- as Swift [32] and HotpathVM [12].
ious CPU architectures. Finally, in Section 6, we summarize Dynamic assemblers, namely AsmJit [18], DynASM [25]
our paper and present some plans for future work. in LuaJIT, and the macro assemblers in the Irregexp engine
target a single architecture. They provide the fastest code
2. RELATED WORK generation speed due to their simplicity.
Compiling regular expressions to machine code has a long Dynamic assemblers can be extended to become platform
history: the first engine [28] for IBM 7094 CPUs was intro- independent by defining generic architectures, whose instruc-
duced in 1968 by Ken Thompson. His technique has never tions are translated to the current CPU. Such projects are
become popular however, due to the complexity of on-the-fly GNU lightning [5], VCODE [10], SICStusAbstractMachine
machine code generation. (SAM) [13] and our SLJIT compiler.
The machine code generators for regular expressions can
be divided into three major groups: 3. OVERVIEW OF THE STACKLESS JUST-
The first group generates source code for a given lan-
guage, mostly C/C++, but JAVA is getting more popular IN-TIME COMPILER (SLJIT)
as well. To compile this source code, a fully operating com- The SLJIT compiler is designed to fit the commonly used
piler toolchain is required, which limits the use cases of these code generation techniques in lightweight compilation envi-
engines. Typical examples are lexical analyzers and/or scan- ronments. These techniques are different from static com-
ners, such as lex [21], flex [20], JFlex [17], re2c [6] and the piler optimizations, since they are balanced between compi-
lexer part of the ANTLR [26] tool. lation and execution speed. Because of their dynamic na-
The second group generates byte code for a given vir- ture, some of them are unique to JIT compilers, such as
tual machine (VM), and the VM is responsible for dynami- inline caching. These dynamic code modifications allow for
cally loading and executing the compiled regular expression. fine tuning the generated code after the compilation.
These engines can depend on the availability of a VM, since The primary difference of SLJIT from existing projects is
the very same VM is required to run them. The JAVA based the way that the instruction set was designed: instead of
FIRE/J [15] and the C# based regular expression engine of creating a simplified RISC like architecture, the common
.NET [23] and Mono [9] Frameworks can be mentioned here. features of existing, widely used CPU architectures were
The LLVM-RE [22] branch of the Unladen-Swallow Python merged together. The resulting instruction set can be effi-
engine generates LLVM [19] byte code, but this project has ciently translated to the supported CPUs, but this efficiency
sub.e.s unused, s2, s3 % Set signed result (.s)
% and equal (.e) status Root node
% flag bits after s3 is
% substracted from s2.
flags.mov s1, sig greater % Copy the value of a
% status flag bit. Alternative 1
add.k s1, s1, s1 % Multiply s1 by 2, and ❳
% keep current status ✁ ❍❍ ❳ ❳❳❳
✁ ❍❍ ❳❳❳
% flags (.k).
flags.or s1, s1, equal % Set the lowest bit of s1 (?:)+ a
% if s2 and s3 were equal.
b
✟ ❍
✟✟ ❍❍
sub s1, s1, #1 % Decrease s1 by 1. Sta-
% tus flags are undefined. ✟ ❍
Alternative 1 Alternative 2
Figure 2: Compare s2 and s3, and set s1 to -1, if s2
is lower, 0, if they are equal, and 1 otherwise. ❅
❅
a a a
may not be achieved on other platforms, such as Very Long
Instruction Word architectures. This reduces the portability
of the SLJIT compiler.
SLJIT is one of the two major components of the PCRE- Figure 3: AST representation of /(?:aa|a)+ab/ pat-
JIT engine as shown in Figure 1. It translates the output tern
of the PCRE-JIT compiler to executable code. We should
note here that the term “stackless” in the name of the SLJIT
Function call and return are implemented as single in-
compiler does not relate to static backtracking. SLJIT is an
structions on all CPUs; however the Application Binary In-
independent work, developed a few years before the PCRE-
terface (ABI) of these architectures specifies various stack
JIT compiler. Its name refers to its inability to manage local
and register handling rules, which can make these func-
variables, which are usually temporarily stored on the stack.
tion calls computationally heavy, especially for small, utility
Instead, it provides direct access to real machine registers.
functions. To reduce this overhead, SLJIT supports a fast
SLJIT provides a low-level platform independent assembly
calling mechanism across JIT generated functions, which
like language. The LIR instructions of this language are
does not save or restore any registers, or manipulate the
emitted through a simple API. This approach offers better
stack. By using this technique, the branch predictor of most
performance than compiling source code, since there is no
CPUs is capable of predicting the return address, so these
need for parsing the input. As we mentioned before, the
lightweight functions are fast. The PCRE-JIT compiler uses
language is not tied to any existing architecture, it is rather
these calls to decode UTF byte sequences and detect char-
a combination of their features.
acter types among other things.
SLJIT covers most arithmetic, shift, and bitwise operators
known from higher level languages, such as addition, arith-
metic right shift, bitwise exclusive-or, etc. In addition to the 4. PCRE-JIT COMPILER OVERVIEW
common operators, we found some other instructions that In this section we outline the code generator of the PCRE-
are similar in these CPUs. Signed or unsigned long multi- JIT compiler and the static backtracking algorithm. Before
ply, where the size of the result is twice the size of the source we go into details, we show the matching algorithm of other
operands, count leading zero, swap endianness, prefetch and backtracking engines. This overview will allow us to com-
user breakpoint instructions are widely supported. CPU pare our approach to the prior art.
status flags are also supported by all CPUs except MIPS. NFA based engines construct a directed graph from a reg-
These status flags can be set by certain arithmetic or logic ular expression. The matching algorithm of these engines
instructions, and their values can be used later. is quite similar to the try-catch based exception handling:
An SLJIT LIR dump that uses these status flags can be each state sets up a generic catch handler, and in the event
seen in Figure 2. This code fragment compares two machine of backtrack, control is transferred here by indirect jumps.
registers, s2 and s3, and sets the value of s1 according to The performance of a try-catch based approach heavily de-
the result, without using conditional branches. Typically, pends on the frequency of the exceptional event, so these
compare operators perform such tasks. The first operand engines expect that backtracking happens rarely. However,
of most LIR instructions is the destination (unless it has no we show later that this assumption is not proved in practice.
destination at all), followed by the source operands. The To reduce the number of catches, the only states that are
unused keyword can only be used as a destination operand, kept by these engines are those that provide at least one
and it indicates that the result of the computation should more state transition. For example, the state which is as-
be discarded. The s prefix in a register name means that signed to the non-capturing bracket of (?:a|b|cc) provides
it is a scratch register, so its value is not preserved across three alternatives. When the last (cc) alternative is selected,
function calls. We should mention here that SLJIT emulates there are no more possible choices, so the catch handler of
all missing instruction forms and features if they are not this state can be discarded. This optimization is called tail
available on the current CPU, so the LIR code above works recursion. We can always apply this optimization to those
even on MIPS. states that only have one possible state transition, such as
Start match ✲ Match ✲ Match ✲
Match found
Match failed ✛
A ✛ B
No match No match
Backtrack Backtrack
Try match
✲ M node Try match
✲
...
✛
pre ✲
Try match
✲
Try match post
✛ ...
Backtrack M ✛ N ✛ M Backtrack
Backtrack Backtrack
Figure 5: Execution interface of a generic M AST node with one child node (N )
character sets or backreferences. The minimum case of iter- tern and x+ such as /a(b|c)x+/, but A cannot be a|b, since
ators is another typical candidate, e.g. when x+ is matched /a|bx+/ can match a plain a without an x at the end. This
to a single x. case can be made valid by enclosing the subpattern in a non-
The PCRE-JIT engine generates machine code from the capturing group: /(?:a|b)x+/. If a pattern does not contain
AST representation, since the AST provides more informa- any capital letters, it represents a single pattern.
tion about the structure of a pattern than the NFA. For This section is divided into multiple subsections. First,
example, each node has exactly one parent, so the previ- in subsection 4.1 we overview the static backtracking algo-
ous node, where the engine backtracks, is unambiguous and rithm. In subsection 4.2 we show an example of the gener-
always known at compile time as shown in Figure 3. There- ated code, and in subsection 4.3 we extend this example to
fore, the engine can generate context sensitive backtracking cover an important corner case.
code paths, which can be connected by direct jumps. Un-
like indirect jumps on many architectures, these jumps can 4.1 Overview of the Static Backtracking Code
be conditional, which simplifies the control transfer to the Generator Algorithm
backtracking handler when a check fails. Furthermore, cer- In the following paragraph we outline the code generator
tain jump instructions can be totally eliminated by efficient of PCRE-JIT. Figure 4 shows the matching process of the
ordering of code paths. /AB/ regular expression which represents the concatenation
The code generator of PCRE-JIT is not NFA based. The of two subpatterns. Perl compatible engines are free to use
state transitions of an NFA are uni-directional, and NFA any matching algorithm as long as they produce the same
based engines use recovery points to fully restore the pre- result.
vious state when the engine backtracks. In contrast, our Existing JIT accelerated engines use two techniques for
state machine is a tree with bi-directional state transitions, backtracking. The .NET Framework and Irregexp gener-
which can be used for going back and forth. We should also ate code from the NFA representation and they follow the
note that the AST can always be converted to NFA, but the traditional NFA backtracking mechanism described in the
reverse of this statement is not true. However, all Perl com- beginning of this section. YARR uses another approach. It
patible patterns have an AST representation, so PCRE-JIT allocates global variables to store the matching progress, so
does not need to support those extra NFA cases. it supports only those constructs which can be represented
An important difference between NFA and AST based en- by a fixed number of global variables. An example of such a
gines is their optimization strategies. NFA based engines construct is a character literal with a greedy plus quantifier,
focus on efficient tail recursion, while our approach focuses because the number of matched characters and the position
on efficient code path ordering. Another key difference is of the last matched character are enough to determine the
that NFA based engines prefer complex backtracking code next possible state on backtrack. The simplicty of the gener-
paths, which perform as many tasks as possible to reduce ated code provides high performance, but many constructs
the number of indirect jumps. Instead, we prefer simple, such as /(?:ab)+/ are not supported by YARR, and it must
often empty code paths, where control can be transferred to fall back to interpreted mode.
the next code path without a jump instruction. PCRE-JIT uses a third approach. Instead of the NFA rep-
In the following, many regular expression examples rep- resentation, we generate code from the AST. More precisely
resent a group of patterns instead of a single one. To de- a code path pair is generated from each AST node. The
fine these groups, we introduce a simple notation. The gen- code generator is recursive, code paths for child nodes are
eral style of the patterns is that of Perl-style regular ex- generated by their parent, so the machine code representa-
pressions [11]. All pattern groups are enclosed in slashes. tion of a node includes all code paths of the subtree rooted
All characters between these slashes represent themselves, from this particular node.
except capital letters, which can represent any valid sub- The backtracking, depth-first search algorithm of PCRE-
pattern where the subpattern does not have any side-effect. JIT is based on a pre-defined traversing order of the syntax
E.g. /Ax+/ means the concatenation of any valid subpat- tree. Each node type defines the traversing order of its child
Try a new match. Find another match. Start
Input position A valid context
contains a valid is provided on the
starting position. top of the stack. Matching path A
Matching
❄ ❄ path of
✲ ❄
Matching path ✛ Backtracking path concat-
enation Matching path B
A match is found. ❄
No match is found.
A valid context is another does
Removes the top- Match
stored on the top of match is not
most context if ap- found match
the stack. The input
propriate. Input po-
position contains the
sition is undefined.
end of the match.
❄ ❄ Backtrack- Backtracking path B ✛
ing path of
concat-
Figure 6: Code path types and the role of their entry
and exit points
enation Backtracking path A ✛
nodes, and the traversing order of the syntax tree is the No match
combination of them. The traversing path is bi-directional,
both the successor and the predecessor nodes are known at
compile time. The only exception is the execution of child Figure 7: Simplified control flow graph of the /AB/
nodes: a parent node can transfer control to any of its child pattern
nodes any number of times, and the order can be decided at
runtime.
The various code paths are connected together using the Backtracking engines require a stack to store the status of
interface shown in Figure 5. The most important aspect a given node, since this data is required to determine the
of the Figure is that each M node has exactly two entry next possible alternative in the event of a backtrack. NFA
and two exit points, which are represented by the four outer based engines push the context and the entry address of the
arrows attached to the box of M node. The reason for gen- backtracking handler onto the stack, so the context is always
erating two code paths from each node is that a code path processed by the appropriate handler. A static backtrack-
pair has two entry and exit points. Other than this aspect, ing based compiler cannot rely on this mechanism. Instead,
these two code paths form a single function. To emphasize each parent node must know or record the calling order of its
this connection, a single box represents both M and N nodes child nodes, so it can choose the appropriate backtracking
in the Figure. handler. The recording is usually a cheap operation, because
The Figure also shows the conditional execution of a child the number of sub-nodes is usually less than or equal to one
(N ) node. As we mentioned before, code paths for child and it can be combined with other tasks. We will see an
nodes are generated by their parent node, and their exe- example for such optimization in subsection 4.2.
cution is also controlled by their parent. For example the To summarize this subsection, we show the simplified con-
greedy plus (?:)+ node matches its child node as many times trol flow graph of a frequently used AST node, namely the
as possible. When a child node finishes its execution, it must concatenation, which is shown in Figure 7. Both arrow types
transfer the control back to its parent. Due to the code lay- represent control transfers in the Figure, the difference is
out, these transfers do not require any jump instructions. that the thick arrows do not require a jump instruction.
The pre-M and post-M phases represent the instructions The concatenation is a free operation in static backtrack-
executed before and after the child node is executed, respec- ing, because the appropriate code path ordering is enough
tively. to satisfy all requirements, no extra instructions are needed
Normal functions use arguments and return values to pass for anything. These code layout optimizations are essential
data. In PCRE-JIT, the two entry and exit points of a single for an efficient depth-first search.
function can be used to represent passing and returning a
boolean value without putting these booleans into machine 4.2 An Example
registers. Assigning roles to these points makes a lot of The PCRE-JIT compiler is based on templates. All 150
argument and return value checks unnecessary, which can byte code types of PCRE have two templates, one for the
be used for optimizing the code. Figure 6 shows the names matching and another for the backtracking path. Each tem-
of these code paths (inside the boxes) and the roles of the plate is a list of LIR instructions and references to other tem-
entry and exit points. Both code paths are named after the plates. Hence templates for smaller subtasks can be shared
role of their entry point. which improves maintainability.
In Figure 6, new terms, namely input position and con- Figure 8 shows the structure of the machine code gener-
text are introduced as well. The input position is a global ated from the /(?:A)+B/ pattern. To improve readability,
variable, and it contains the position of the next input char- certain templates are not expanded in the Figure. Instead,
acter. The context is the atomic unit of stack management. they are kept as pseudo function calls such as push() or back-
// Matching path of (?:A)+ // Matching path of (?:A)+
// NULL = Beginning of the // NULL = Beginning of the
// context list mark // context list mark
push(NULL) push(NULL)
L1: if (match(A) = FAILED) L1: push(private slot)
goto L6 private slot ← input position
L2: push(input position) if (match(A) = FAILED)
// Greedy match: try again goto L6
goto L1 L2: push(input position)
// Matching path of B if (private slot != input position)
L3: if (match(B) = FAILED) goto L1
goto L5 // Discard input position
L4: return SUCCESS pop()
// Backtracking path of B // Matching path of B
L5: if (backtrack(B) = MATCHED) L3: if (match(B) = FAILED)
goto L4 goto L5
// Backtracking path of (?:A)+ L4: return SUCCESS
L6: if (backtrack(A) = MATCHED) // Backtracking path of B
goto L2 L5: if (backtrack(B) = MATCHED)
L7: input position ← pop() goto L4
if (input position != NULL) // Backtracking path of (?:A)+
goto L3 L6: if (backtrack(A) = MATCHED)
return FAIL goto L2
private slot ← pop();
input position ← pop()
Figure 8: Template structure of /(?:A)+B/ where if (input position != NULL)
A never matches an empty string goto L3
return FAIL
The general stack layout of the /(?:A)+B/ pattern is the any context data onto the stack, so the position of the saved
following after a successful match: value will be unknown later. Instead, PCRE-JIT uses global
variables called slots. These slots are allocated separately for
NULL, [context A, input position]*, ←֓ each byte code that requires them, and can only be modified
context A, context B by that particular byte code. The value of a slot cannot be
The line is broken into two segments by a ←֓ to fit into one changed by any child byte codes, since the matching process
column. The star metacharacter means that the content in- of a byte code cannot leave its own boundaries (no goto like
side the square brackets can be repeated from zero to any operation in Perl). Therefore no byte code can restart the
number of times. When a match fails, all context data is re- match of its parent byte code, and implicitly modify the
moved from the top of the stack, which satisfies the “remove slot. The only exception is recursions (a call like operation),
the topmost context if appropriate” condition in Figure 6. which must preserve the appropriate slots.
There is another issue we need to solve: a single slot is
4.3 A Slightly Extended Example not enough to store every input position for a repetition.
The example in Figure 8 has a limitation: the A subpat- Instead only the last input position is kept in the slot, and
tern cannot match an empty string, because the loop would its previous values are preserved on the stack as seen in
run forever in this case. Perl stops a repetition after an Figure 9.
empty match, and PCRE-JIT must follow this behaviour The context data of a might be empty greedy plus repe-
to remain compatible. Since PCRE defines a different byte tition is quite similar to the non-empty repetition, except it
code for might be empty matches, this case can be easily contains the previous value of private slot. Saving this slot
recognized at compile time, and the code generator uses a before the first repetition might seem unnecessary. How-
different template shown in Figure 9. ever, if the repetition itself is inside another repetition, e.g.
An empty match can be detected by checking whether the /(a(aa)+a){3}/, we must preserve the value of the slot. The
input position has been changed after a successful match. If following line shows the stack layout of the extended exam-
the input position keeps its value, the loop must be aborted. ple in a format we used for the original:
To do this comparison, we need to save the input position
before the subpattern is matched. The stack cannot be used NULL, [private slot, context A, input position]*, ←֓
for saving this value however, since the subpattern can push private slot, context A, context B
5. TEST RESULTS We emphasize here that the purpose of the following mea-
Our first measurement compares the number of backtracks surement is comparing the raw backtracking performance.
performed by the PCRE interpreter and the JIT compiler. To exclude all other optimizations, we either need to heav-
The interpreter is a traditional NFA based engine with ag- ily modify these engines or use custom tailored regular ex-
gressive tail recursion. We disabled certain optimizations in pressions. Since any modification affects the performance of
the JIT compiler, which would affect the following results, an engine, we choose the latter, which allows precise mea-
and are not supported by the interpreter. To perform this surement without any side effects. Finding the appropri-
measurement under real-world conditions we decided to dis- ate patterns, where all engines perform the same number of
able only the extra optimizations. A brief overview about backtracks, is still a challenge as we see later.
these so called backtracking elimination techiques will be Table 2 shows the runtimes and runtime ratios of our back-
provided later in this section. tracking heavy, and elimination unfriendly patterns, where
During the measurement, all 1020 HTTP content filtering the ratio is calculated by dividing the actual runtime with
patterns of the open source Snort [27] Intrusion Detection the lowest runtime in each line. The ratio is always 1.00 for
System (IDS) were matched against the HTTP stream of the fastest engine. These patterns are usually called patho-
the top 10 web sites listed by Alexa [3]. Both the PCRE logical cases, because the engine is forced to do a lot of
interpreter and JIT compiler searched for all occurences of backtracks, and they are not part of the Snort benchmark.
these Perl compatible regular expressions. The results are All measurements were performed on an Intel Xeon based
the following, where B means billion (109 ): 64 bit x86 system using CPU cycle counters. The PCRE
interpreter is also included in this measurement to show the
The interpreter attempted 30.57B byte code matches general performance progression of the JIT accelerated en-
and performed 14.06B backtracks. The engine also set gines compared to the interpreted ones.
up 6.42B backtracking handler sites. To improve readability of both pattern and input strings,
a simple notation is introduced for their repeating parts:
The code generated by PCRE-JIT attempted the the strings are divided into fragments by numbers in su-
same number of matches and performed 27.24B back- perscript, and each number denotes the repetition count
tracks, of which 12.81B (44%) were empty. The ex- of the previous fragment. Delimiters are excluded. Thus,
ecution entered the backtracking handler by a jump /(a)?2 b+c3 de/ represents the /(a)?(a)?b+cb+cb+cde/ pat-
instruction 8.72B times (32%). tern. In the following we explain the rationale behind each
pattern:
The characteristics of these engine types can be clearly
Our first example in line 1 shows the effect of a backtrack-
seen: our static backtracking based engine performed far
ing elimination, which explains the lack of certain constructs
more (89%) backtracks compared to the NFA based one.
in the following examples. Unlike Perl, JavaScript compat-
However, if we exclude the empty handlers, which do noth-
ible engines must backtrack when a repetition matches an
ing, the number is roughly the same (only 2.6% bigger). The
empty string. For example, when /(w?)+/ is matched to a
key feature of static backtracking is also visible: even if the
single character long w string, the capturing group returns
empty handlers are excluded, 40% of the backtracking han-
with a w in JavaScript and an empty string in Perl. This op-
dlers are still entered without a jump, and the remaining
timization substantially improves the matching performance
ones only need direct jumps.
of the pattern in line 1 for JavaScript, but also makes it un-
We should also note that backtracking is not a rare event.
suitable for our comparisons.
On the contrary, the number of backtracks is close to half
Patterns 2 and 3 focus on single character repetitions.
(46%) of the matching attempts, even for an NFA based
YARR is the fastest on these two patterns, closely followed
engine with aggressive tail recursion. Therefore the cost of
by PCRE-JIT. The matching algorithm of YARR makes it
backtracking is not negligible, which was the primary moti-
particularly efficient for single character repetitions as we
vation for inventing the static backtracking algorithm.
discussed in Section 4. The trade-off is lower efficiency in all
Before the next result is presented, we discuss the dif-
other cases.
ference between backtracking optimization and backtrack-
Patterns from 4 to 9 cover repetition inside repetition
ing elimination. Static backtracking is a backtracking op-
cases. YARR switches to interpreted execution for these
timization technique, because it accelerates the speed of
cases, which explains its decreased performance, so the fol-
backtracking. In contrast, backtracking elimination tries to
lowing discussion is focused on Irregexp and PCRE-JIT. In
reduce the number of backtracks in various ways. There
general, PCRE-JIT is the fastest on these patterns due to its
are dozens of such techniques, e.g. minimum match length
improved backtracking method, but some results show an in-
checks, expected character checks, or auto-possessive rewrit-
teresting behaviour. All participating engines optimize sin-
ing: /a+b/ is automatically replaced by /a++b/. Although
gle character repetitions, so pattern 4 is matched faster than
backtracking elimination can dramatically improve the per-
pattern 6. However, only Irregexp recognizes that pattern
formance, it is outside the scope of this paper, and we need
4 and 5 are essentially the same, because the inner, non-
to avoid its side effects during the following comparison.
capturing bracket can be ignored. We noticed that Irregexp
The purpose of the next measurement is comparing the
is particularly efficient in recognizing such cases. Patterns
raw backtracking performance of three JIT accelerated en-
7 and 9 show the matching overhead of capturing brackets.
gines. Since PCRE-JIT is not an improvement of an exist-
This overhead ratio is worse for JIT accelerated engines than
ing approach, and no other engine uses an AST based code
interpreted ones, so it is recommended to avoid unnecessary
generator at the moment, we choose Irregexp and YARR
capturing brackets when JIT acceleration is used.
for this comparison. All of these engines aim for very high
Patterns 10 to 12 show the matching performance of al-
performance, so they can represent the efficiency of their
ternatives. Again, Irregexp is the only engine that notices
corresponding matching algorithm described in Section 4.1.
Average < 1.5 x as fast 1.5 - 4.0 x as fast 4.0 - 8.0 x as fast 8.0 - ∞ x as fast
Target speedup % of % of total % of % of total % of % of total % of % of total
CPU (x as fast) patterns runtime patterns runtime patterns runtime patterns runtime
x86/32 6.84 24.53% 0.79% 63.49% 3.37% 7.46% 40.65% 4.51% 55.18%
x86/64 5.55 42.89% 1.36% 46.03% 5.67% 11.09% 92.97% 0.00% 0.00%
ARM-V7/32 6.76 65.85% 2.14% 21.00% 2.27% 6.87% 19.25% 6.28% 76.34%
ARM-THUMB2/32 7.24 65.85% 2.01% 22.87% 2.66% 4.42% 13.14% 6.87% 82.19%
PowerPC/32 5.48 72.82% 2.22% 17.08% 6.47% 10.11% 91.30% 0.00% 0.00%
PowerPC/64 5.55 73.01% 2.20% 16.49% 5.34% 10.50% 92.46% 0.00% 0.00%
SPARC/32 5.85 66.44% 2.04% 21.98% 3.17% 11.29% 92.08% 0.29% 2.71%
MIPS/32 7.59 65.95% 1.77% 18.55% 1.91% 8.44% 12.88% 7.07% 83.44%
Average 6.36 59.67% 1.82% 28.43% 3.86% 8.77% 56.84% 3.13% 37.48%
Std. dev. 0.79 15.92% 0.47% 15.95% 1.61% 2.26% 36.26% 3.14% 37.68%
Table 3: Matching performance improvement provided by the JIT compiler on all CPU architectures sup-
ported by SLJIT (runtime refers to interpreted runtime).
Table 4: Matching performance improvement provided by the JIT compiler on all CPU architectures sup-
ported by SLJIT (runtime refers to interpreted runtime).
that (?:a|a) is the same as a, which greatly improves its of the patterns are not accelerated at all; but these patterns
matching performance in case of pattern 10. However, if only take 2% of the total interpreted runtime! On the other
we tweak the pattern a little by adding another a to the hand, around 10% of the patterns take about 90% of the
first alternative, this optimization cannot be used anymore, total runtime, and the PCRE-JIT generated machine code
and PCRE-JIT becomes faster. YARR JIT supports these runs 4+ times faster for these patterns. In other words, the
patterns, and has similar speed to Irregexp. JIT compiler helps where help is most needed, and efficiently
The conclusion of this measurement is that each engine accelerates the matching performance of long running pat-
has its own strengths: YARR has better optimizations for terns.
character repetitions, Irregexp can generate an optimized The aim of our last measurement is to compare the match-
code path for several special cases, and PCRE-JIT has the ing performance of PCRE-JIT and Irregexp. Only those pat-
most efficient backtracking algorithm. Since these aproaches terns which produced the same matches are included in this
are independent, they can be used to further improve PCRE- measurement. The overall speedup was 1.63 times faster,
JIT in the future. but we observed large runtime differences again. Hence we
The following two measurements show the general perfor- created three groups: patterns with short (S) medium (M)
mance progression of PCRE-JIT on the Snort benchmark and long (L) runtime. A pattern belongs to a given group if
set, i.e. the results were affected by all optimizations, not both engines match it faster than the upper limits in the box
just static backtracking. The first measurement, which com- at the bottom of Figure 10. The pie chart in the left shows
pares the PCRE interpreter and PCRE-JIT on various CPU the proportion of each group. Not surprisingly, the group S
architectures, is shown in Table 4. To provide a fair com- is the largest, nearly 80% of the patterns belong here. How-
parison, the interpreter is optimized for speed by passing ever, the majority of the runtime is spent on matching the
-O3 to the GCC compiler. The second column contains the group L, as we can see in the middle chart. Similar to the
average speedup, which is about 5 to 7-fold. We noticed previous measurement, the strength of PCRE-JIT is match-
however that the gain varies greatly for different patterns, ing long running patterns: our engine was almost twice as
so we organized them into speedup classes. Each class con- fast as Irregexp in case of the group L. On the other hand,
tains those patterns whose speedup is between the minimum group S was matched 1.5 times slower by PCRE-JIT so there
and the maximum of the class. As we can see, nearly 60-70% is still room for improvement.
70 Irregexp PCRE-JIT 100%
Number of patterns
1.95x as fast
L 80%
M 50
60%
40
1.56x as slow
30
1.19x as fast
40%
S 20
20%
S: 799 (78.56%) 10
M: 123 (12.09%)
L: 95 (9.34%) 0 0%
S M L S M L
S: Patterns with short runtime (0-20 ms)
95 (9.34%) M: Patterns with medium runtime (20-200 ms)
L: Patterns with long runtime (200-2000ms)
123 (12.09%)
Figure 10: Compare PCRE-JIT to Irregexp on an x86-64 machine using the Snort pattern set.