0% found this document useful (0 votes)
69 views

A Static Load-Balancing Scheme For Parallel XML Parsing On Multicore Cpus

This document proposes a static load-balancing scheme to improve parallel XML parsing performance on multicore CPUs. It introduces an approach that partitions an XML document into chunks based on a simple "skeleton" tree structure extracted during a preparsing phase. These chunks are then parsed independently and in parallel by multiple threads to take advantage of multicore architectures. Evaluation with the libxml2 parser shows this static scheme reduces synchronization overhead compared to dynamic schemes, improving performance for XML documents with shallow structures.

Uploaded by

balagh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

A Static Load-Balancing Scheme For Parallel XML Parsing On Multicore Cpus

This document proposes a static load-balancing scheme to improve parallel XML parsing performance on multicore CPUs. It introduces an approach that partitions an XML document into chunks based on a simple "skeleton" tree structure extracted during a preparsing phase. These chunks are then parsed independently and in parallel by multiple threads to take advantage of multicore architectures. Evaluation with the libxml2 parser shows this static scheme reduces synchronization overhead compared to dynamic schemes, improving performance for XML documents with shallow structures.

Uploaded by

balagh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs

Yinfei Pan1 , Wei Lu2 , Ying Zhang1 , Kenneth Chiu1


1. Department of Computer Science, State University of New York, Binghamton
2. Computer Science Department, Indiana University
[email protected], [email protected]

Abstract struction. They are thus more applicable to SAX-style pars-


ing.
A number of techniques to improve the parsing perfor- In SAX-style parsing, the parsing results are communi-
mance of XML have been developed. Generally, however, cated to the application through a series of callbacks. The
these techniques have limited impact on the construction of callbacks essentially represent a depth-first traversal of the
a DOM tree, which can be a significant bottleneck. Mean- XML Infoset [26] represented by the document. SAX-style
while, the trend in hardware technology is toward an in- parsing has the benefit that the XML Infoset need never
creasing number of cores per CPU. As we have shown in be fully represented in memory at any one time. Integrat-
previous work, these cores can be used to parse XML in ing SAX-style parsing into an application can be awkward,
parallel, resulting in significant speedups. In this paper, however, due to the need to maintain state between call-
we introduce a new static partitioning and load-balancing backs.
mechanism. By using a static, global approach, we reduce In DOM-style parsing, an in-memory, tree data structure
synchronization and load-balancing overhead, thus improv- is constructed to represent the XML document. When fully
ing performance over dynamic schemes for a large class of constructed, the data structure is passed to the application,
XML documents. Our approach leverages libxml2 without which can then traverse or store the tree. DOM-style pars-
modification, which reduces development effort and shows ing can be intuitive and convenient to integrate into applica-
that our approach is applicable to real-world, production tions, but can be very memory intensive, both in the amount
parsers. Our scheme works well with Sun’s Niagara class of memory used, and in the high overhead of memory man-
of CMT architectures, and shows that multiple hardware agement.
threads can be effectively used for XML parsing. On the hardware front, manufacturers are increasingly
utilizing the march of Moore’s law to provide multiple cores
on a single chip, rather than faster clock speeds. Tomor-
1. Introduction row’s computers will have more cores rather than expo-
nentially faster clock speeds, and software will increas-
By overcoming the problems of syntactic and lexi- ingly need to rely on parallelism to take advantage of this
cal interoperability, the acceptance of XML as the lingua trend [20].
franca for information exchange has freed and energized In this paper, we investigate parallel XML DOM parsing
researchers to focus on the more difficult (and fundamen- on a multicore computer, and present a static scheme for
tal) issues in large-scale systems, such as semantics, auto- load-balancing the parsing load between cores. Our scheme
nomic behaviors, service composition, and service orches- is effective for up to six cores. In our previous work [14], we
tration [22]. The very characteristics of XML that have used a dynamic load-balancing scheme that assigned work
led to its success, however, such as its verbose and self- to each core on-the-fly as the XML document was being
descriptive nature, can incur significant performance penal- parsed. While effective, we observe that most large XML
ties [5, 10]. These penalties can prevent the acceptance of documents (larger than a few 100K) usually contain one or
XML in use cases that may otherwise benefit. more large arrays and their structures tend to be shallow
A number of techniques have been developed to improve (less than 10 deep at the array level). For these documents,
the performance of XML, ranging from binary XML [4, 25] static partitioning can be more efficient. Our targeted ap-
to schema-specific parsing [6, 13, 21] to hardware acceler- plication area is scientific computing, but we believe our
ation [1]. Generally speaking, however, these approaches approach is broadly applicable, and works well for what we
only speed up the actual parsing, and not the DOM con- believe to be the most common subset of large XML docu-
ments. Even if not a proper array, as long as the structure is root

shallow, our technique will work well. <root xmlns="www.indiana.edu">


<foo id="0">hello</foo>
xmlns foo bar

Even though our technique will not currently scale to <bar>


<!−− comment −−>
large numbers of cores, most machines currently have <?name pidata ?> id hello comment pidata a
<a>world</a>
somewhere from 1-4 cores, and we can scale to that num- </bar>
ber. Furthermore, the alternative might be simply letting the </root> world

extra cores go to waste while the application waits for the


XML I/O to complete. We have also found that CMT can
provide some amount of speedup. <root xmlns="www.indiana.edu"> root
Another advantage of our approach is that we fo- <foo id="0">hello</foo>
xmlns foo bar
<bar>
cus on partitioning and integrating sequentially-parsed <!−− comment −−>
<?name pidata ?>
chunks, and thus can leverage off-the-shelf sequential <a>world</a> id hello comment pidata a
parsers. We demonstrate this by using the production- </bar>
</root>
quality libxml2 [24] parser without modification, which world
shows that our work applies to real-world parsers, not just
research implementations.
Operating systems usually provide access to multiple Figure 1: The top diagram shows the XML In-
cores via kernel threads (or LWPs). In this paper, we gener- foset model of a simple XML document. The bot-
ally assume that threads are mapped to hardware threads to tom diagram shows the skeleton of the same docu-
maximize throughput, using separate cores when possible. ment.
We consider further details of scheduling and affinity issues
to be outside the scope of this paper.
ument. This structure is then used to divide the XML doc-
ument such that the divisions between the chunks occur at
2. Overview well-defined points in the XML grammar. This provides
enough information so that each chunk can be parsed start-
Concurrency could be used in a number of ways to im- ing from an unambiguous state.
prove XML parsing performance. One approach would be This seems counterproductive at first glance, since the
to pipeline the parsing process by dividing it into stages. primary purpose of XML parsing is to build a tree-
Each stage would then be executed by a different thread. structured data model (i.e., XML Infoset) from the XML
This approach may provide speedup, but software pipelin- document. However, the structural information needed to
ing is often hard to implement well, due to synchroniza- locate known grammatical points for chunk boundaries can
tion and memory access bottlenecks, and to the difficul- be significantly smaller and simpler than that ultimately
ties of balancing the pipeline stages. More promising is a generated by a full XML parser, and does not need to in-
data-parallel approach. Here, the XML document would clude all the information in the XML Infoset data model.
be divided into some number of chunks, and each thread We call this simple tree structure, specifically designed for
would work on the chunks independently. As the chunks partitioning XML documents, the skeleton of the XML doc-
are parsed, the results are merged. ument, as shown in Figure 1. Once the preparsing is com-
To divide the XML document into chunks, we could sim- plete, we use the skeleton to divide the XML document
ply treat it as a sequence of characters, and then divide the into a set of well-defined chunks (with known grammati-
document into equal-sized chunks, assigning one chunk to cal boundaries), in a process we call task partitioning. The
each thread. Any such structure-blind partitioning scheme, tasks are then divided into a collection of well-balanced
however, would require that each thread begin parsing from sets, and multiple threads are launched to parse the collec-
an arbitrary point in the XML document, which is prob- tion, with one thread assigned to each task set.
lematic. Since an XML document is the serialization of Libxml2 provides a number of public functions to facil-
a tree-structured data model (called XML Infoset [2, 26]) itate parsing XML fragments, and for gluing DOM trees
traversed in left-to-right, depth-first order, such a division together. We leverage these functions to do the actual full
will create chunks corresponding to arbitrary parts of the parsing and reintegration.
tree, and thus the parsing results will be difficult to merge When the parsing is complete, we postprocess the DOM
back into a single tree. Correctly reconstructing namespace tree to remove any temporary nodes that were inserted to
scopes and references will be especially challenging. isolate concurrent operations from one another. These tem-
This thus leads us to the parallel XML parsing (PXP) ap- porary nodes allow us to use libxml2 without modification.
proach [14]. We first use an initial pass, known as prepars- The entire process is shown in Figure 2. Further details on
ing, to determine the logical tree structure of an XML doc- preparsing can be found in our previous work [14].
Task XML chunks Parallel
partitioning parsing

XML Document XML chunks


XML Pre- Skeleton Task Parallel Temporary Post Final
document parsing (Pre-parsing partitioning Parsing DOM tree parsing DOM tree
tree) XML chunks
Task Parallel
partitioning parsing

Figure 2: The PXP architecture first uses a preparser to generate a skeleton of the XML document. We next
partition the document into chunks, and assign each task to a thread. (This stage is actually sequential, though
the diagram has multiple arrows for it.) After the parallel parsing is complete, we postprocess the document to
remove temporary nodes.

3. Task Partitioning partitioning is preferred over dynamic partitioning. How-


ever, if the problem is a highly irregular, unbalanced tree or
graph structure, dynamic schemes become correspondingly
Once the skeleton has been computed, we partition the
more beneficial.
XML document into chunks, with each chunk representing
one task. The goal of this stage is to find a set of tasks that
3.1. Subtree Tasks
can be partitioned into subsets such that each subset will
require the same amount of time to parse. In other words, A graph can be statically partitioned into a set of sub-
we want the partitioning to be well-balanced. Another goal graphs using a number of different techniques [11]. A nat-
is to maximize the size of each task, to reduce overhead ural partitioning for XML is as a set of subtrees. We call
costs. This problem is a variant of the bin packing problem. the set of connected nodes above the subtrees the top tree,
Note that these two goals are in opposition. Smaller tasks as shown in Figure 3(b).
will make it easier to partition them such that all subsets Each subtree is considered a task. Our algorithm main-
take the same amount of time. tains a list of these tasks. At each iteration, the largest task
Following the work in parallel discrete optimization al- is removed from the list, and the root of the corresponding
gorithms [9, 8, 12] and parallel depth-first search [17], we subtree is moved to the top tree. Each child of the subtree
consider two kinds of partitioning: static and dynamic. then forms a new subtree, and is placed onto the list as a
Static partitioning is performed before parallel processing new task.
begins, while dynamic partitioning is performed on-the-fly Using the skeleton, our static partitioning generates the
as the parallel processing proceeds based on the run-time set of subtrees starting from the root. We first parse the root
load status. Regardless of which scheme is used, the key node, and initialize the top tree with it. We then add all the
goal of the task partitioning is to generate a set of optimally immediate subtrees below the root to the task list.
balanced workloads. We then proceed recursively in the following fashion. At
The conventional wisdom is that dynamic partitioning every iteration, we remove the largest task from the list,
will lead to better load-balancing because the partition- parse the root of the corresponding subtree and move it to
ing and load scheduling are based on run-time knowledge, the top tree. We then add the generated subtrees which were
which will be more accurate. Static partitioning uses a under this node back to the task list. A priority queue is used
priori knowledge to estimate the time required to execute to efficiently maintain the largest task in the list. The effect
each task. Accurate estimates can be difficult to obtain in is to grow the top tree down as we recursively create sub-
some cases, which will lead to poor balancing. Further- tasks and add new DOM nodes to the bottom of the top tree.
more, static partitioning is usually a sequential preprocess- Note that the top tree consists of actual DOM nodes, while
ing step, which will limit speedup according to Amdahl’s the skeleton is the abbreviated structure in a concise form.
law[3]. When a DOM node is added to the bottom of the top tree,
However, the potential benefits of dynamic partitioning the child nodes have not yet been created, but are merely
are not free. The intensive communication and synchroniza- “logical” nodes, which exist only in so far as they are in the
tion between the threads can incur significant performance skeleton. However, the DOM nodes in the top tree need to
overhead. In contrast, when static partitioning is completed, be created by parsing the corresponding XML with a full
all threads can run independently. Hence to achieve a good parser, which creates a problem. Because XML includes all
performance solution, the choice of the partitioning scheme descendants of a node within the lexical range of the node,
should depend on the complexity of the problem. If the as defined by the start- and end-tags, we cannot use a parser
problem corresponds to a flat structure (e.g., an array), static directly on the original text to generate a single, childless
Top tree level
Top tree
PlaceHolder level
Task1

Task2 Task4 Task level


Task3 Task5 Task6

(a) (b) (c)

Figure 3: Logical structure of the top tree, tasks’ placeholder, and tasks. In (a), we see the complete tree. In (b)
we see the top tree and the subtree tasks. In (c), the placeholder level is also shown.

Top tree
DOM node. Parsing the original text would also generate
all the children nodes, leaving no work to be done during
the actual parallel parsing stage. Place Holder Node
As a workaround to this lexical problem, we create a du- Replacement
plicate, but childless element in a temporary string buffer,
Parsing
and parse this duplicate element to generate a childless
DOM node. The duplicate element is created simply by
concatenating the start- and end-tags of the original ele- XML document (Task) Sub DOM tree
ment, and omitting all of the content. For example, if the
original element were <a><c1>...</c1><c2/></a>, Figure 4: Attaching a subtree to the top tree is
then the temporary buffer would contain <a></a>. This done by replacing the placeholder node in-place,
procedure could be expensive if the top tree is large, but in after the subtree is parsed.
practice we have found that most large documents are rel-
atively shallow, and thus the top tree is small compared to
the total size.
The “size” of a task should correspond to how much time served at that time. Each new subtree task is created with a
it will take to complete. We currently use the number of pointer to its corresponding placeholder. These placehold-
characters in the corresponding XML chunk as an estimate ers are removed as the top tree grows down, so that only
of this. We have found so far that it works well, but a more the leaves of the top tree have placeholders. Even these are
sophisticated measure may be needed for some documents. removed after the parallel parsing is complete, as described
To prevent too many small subtree tasks from being gen- below. The entire process is diagrammed in Figure 5.
erated, and thus increasing overhead, we terminate the re-
cursion when the number of tasks is greater than a preset The placeholder nodes also serve to isolate concurrent
limit, currently chosen to be 100. When such a limit is operations from one another. Once subtree partitioning is
reached, there will usually be enough tasks such that they complete, each subtree task will be executed by a separate
can be assigned to threads in a well-balanced manner. thread. These threads will complete at different times, how-
As we proceed to recursively create new subtasks and ever, so if a thread directly adds the root of a subtree to the
grow the top tree down, we repeatedly add new DOM nodes parent node, this will result in different left-to-right order-
to the top tree. Below any node in the top tree, the left-to- ings depending on the completion order. The placeholder
right ordering of its children will correspond to the order in nodes avoid this problem because we can use libxml2 to di-
which they were added to that node. In a straightforward, rectly replace the placeholder nodes with the actual subtree
but incorrect, implementation, this order would correspond root nodes, in place, as shown in Figure 4. After we finalize
to the order that the corresponding subtrees were removed the task partitioning, we will see a logical three level struc-
from the task list, which would not correspond to the actual ture connecting top tree, placeholders, and tasks together as
left-to-right ordering of the elements in the XML document. Figure 3(c) shows. The nodes in the top tree are fully con-
Thus, to preserve the left-to-right ordering, we create structed, permanent DOM nodes; the nodes in the place-
placeholder children between the bottom of the top tree and holder level are temporary DOM nodes used to connect the
the subtrees below it. These placeholders are added to the tasks to the proper left-right position within the permanent
parent node immediately when it is added to the top tree, DOM nodes; and the nodes below this are logical nodes that
and thus the left-to-right ordering is known and can be pre- currently only exist in the skeleton.
Top tree generation

PlaceHolder PlaceHolder PlaceHolder PlaceHolderPlaceHolder


node node node node node
PlaceHolder PlaceHolder PlaceHolder
node node node

Task partitioning

Figure 5: As the parsing proceeds, the top tree is grown downwards. As each subtree is processed, its root
node is parsed and moved to the top tree by an in-place replacement of the corresponding placeholder node,
and additional placeholder nodes are created. At each iteration, the largest remaining subtree is chosen for
processing.

3.2. Array Tasks


Nodelist(1)->Nodelist(2)
correct order
As mentioned earlier, we have found that most large
PlaceHolder(1) PlaceHolder(2)
XML documents consist of one or more long sequences of
child elements. The sequences may have similar, or even Nodelist(1) Nodelist(2) Nodelist(2)->Nodelist(1)
identical types of elements. We call such sequences arrays. wrong order

The subtree-based task partitioning scheme is not suit-


able for arrays, which may contain a large number of child Figure 6: When an array task is completed by a
nodes. Adding all of these to the task list would be pro- thread, it cannot directly add its completed forest
hibitively inefficient. Therefore, we use a simple heuristic to the corresponding array node, since the tasks
to recognize arrays in the skeleton, and handle them differ- may complete in the wrong order.
ently, which also improves load balancing.
Our heuristic is to treat a node as an array if the number
of children is greater than some limit. During the traversal, tasks, we must create a separate placeholder for each range.
we check whether or not the number of children is greater This is because otherwise, each thread would attempt to add
than this limit. If so, we treat the children as an array. Our the child nodes in its range concurrently to the parent node,
limit is based on the number of threads, and is currently set resulting in race conditions. Thus, a range placeholder is
to 20 times the number of threads. created for each array task, to isolate the concurrent opera-
Once the current node is identified as an array, we divide tions of each thread from one another.
its elements into equal-sized, continuous ranges. The size is When a thread finishes a subtree task, it can immediately
chosen such that the number of ranges in the array is equal replace the placeholder node with the actual root node of
to the number of threads. Each range is treated as a task. the subtree. When a thread finishes an array task, however,
This differs from the subtree tasks, in that each task is now the result is a forest of subtrees that must be added to the
a forest rather than a subtree. These tasks are added to a actual array element, which is in the position of the parent
FIFO queue as they are generated, which is used during task of the range placeholder (one level higher), rather than in
assignment as described below. the position of the range placeholder itself. This operation
For subtree tasks, we create one node as the placeholder cannot occur in parallel, because otherwise multiple threads
for the entire subtree. When the task finishes, we simply re- would attempt to add child nodes at the same time, as shown
place the placeholder with the actual DOM node. For array in Figure 6.
PH task11

Array 1 RPH(1) RPH(2) RPH(3) RPH(4) PH PH task1 task2 task3 task4 task9 task10

Array 2 RPH(5) RPH(6) RPH(7) RPH(8) task5 task6 task7 task8


a. XML document b. Top tree c. Task partitioning
RPH(1) RPH(2) RPH(3) RPH(4) RPH(5) RPH(6) RPH(7) RPH(8)
task1 task2 task3 task4 task5 task6 task7 task8

d. Table of array task

Figure 7: In (a), we see the original tree structure. In (b), range placeholders have been added. In (c), we
see how each range corresponds to a task. The table in (d) is used to ensure that the ranges are added to the
corresponding array element in the correct order.

Thus, we maintain an additional table for each array to


record the proper ordering for the child range placehold-
ers, and traverse this array sequentially after the end of the
parallel parsing stage. Figure 7 (a) shows an XML docu- Nodelist1 Nodelist2 Nodelist3 Nodelist4
ment with two large arrays, its top tree with placeholders
and task partitioning are shown in (b) and (c). RPH means Nodelist1 Nodelist2 Nodelist3 Nodelist4

range placeholder, PH means a subtree task placeholder.


Figure 8: Each array task generates a forest of
subtrees, which then must be added back to the
4. Task Assignment associated array node. Intervening links, marked
with X, are deleted.
Once we have partitioned the XML document into tasks,
we now must assign them to threads in a balanced manner,
which is the goal of the task assignment stage. We have two
sets of tasks, the subtree tasks and the array tasks. each task is parsed within the context of its corresponding
We first assign the array tasks in the FIFO queue to placeholder. In our implementation, we used the libxml2
threads in a round-robin fashion. This is because for most function, xmlParseInNodeContext(). This function
large XML documents the arrays are the largest part and can parse any “well-balanced chunk” of an XML docu-
also the most easily balanced part. As we assign the tasks, ment within the context of the given node. A well-balanced
we keep track of the current workload of each thread. Be- chunk is defined as any valid content allowed by the XML
cause we have carefully chosen the range sizes, each thread grammar. Since the XML fragments generated by our task
will be guaranteed to have exactly one array task per array. partition are well-balanced, we can use this function to
Also, the FIFO queue will maintain the order of the ranges parse each task.
on each array, which will be used in post-processing stage.
During the parallel parsing, after each task is com-
After the array tasks have been assigned, we then assign the
pleted, its corresponding subtree needs to be inserted
subtree tasks to the threads. Each subtree task is assigned to
into the right place under the top tree. To do that, for
the thread that currently has the least workload.
each subtree task, we can simply replace its correspond-
Note that each thread’s task set is maintained by using a
ing placeholder with its root node using libxml2 function
private queue. This eliminates contention during the actual
xmlReplaceNode(). However, for array tasks, which
parsing phase.
produce a forest of subtrees under the range placeholder
node, we clearly cannot simply replace the placeholder and
5. Parallel Parsing as discussed earlier in Figure 6, there may exist race con-
ditions if we remove the placeholders and add back lists
After the task assignment is complete, each thread starts of nodes. Therefore, guided by the table of array tasks,
parsing the tasks in its private queue. Each task has a cor- shown in Figure 7 (d), we then remove placeholders of ar-
responding placeholder node that was created during the ray tasks sequentially during the post-parsing stage, and use
top tree generation process. Thus, they inherit the con- the libxml2 function, xmlAddChildList() to add back
text (DTD, namespaces, etc.) of their parent nodes. Thus, node lists in their correct order. This is shown in Figure 8.
<MoleculeType> 45
Preparsing
<Name>1kzk</Name> 40 Task partition
<Radius> </Radius> attribute elements
Parallel parsing
35 Post−parsing
<atom_array_1>
<atom> ... </atom> The 1st data array 30
...

Thread number
</atom_array_1>
25
<atom_array_2>
<atom> ... </atom> 20
The 2nd data array
...
</atom_array_2> 15

</MoleculeType> 10

Figure 9: The structure of the test XML docu- 5

ments. 0
0 10 20 30 40 50 60 70 80 90 100
Percentage
6. Performance Results
Figure 10: Performance breakdowns.
Our experiments were run on a Sun Fire T1000 machine,
with 6 cores and 24 hardware threads (CMT). We observed
cation that can be used to provide a more sophisticated so-
that most large XML documents, particularly in scientific
lution [15].
applications, are relatively broad rather than deep, also typ-
ically contain one or two large arrays and the structure tends
to be shallow. Hence we select a large XML file, which con- 6.1. Performance Breakdown
tains the molecular information as shown in Figure 9, rep-
resenting the typical structural shape of XML documents Our implementation performs the stages as described in
in scientific applications. This was based on XML docu- the paper, and thus we can measure the time of each stage.
ments obtained from the Protein Data Bank [19]. It consists The sequential stages include preparsing, task partitioning,
of two large array representing the molecule data as well and post-parsing, and the parallel stages are just parallel
as a couple elements for the molecule attributes. To obtain parsing which is done by all the threads in parallel. Our per-
the different sizes, the molecule data part of the documents formance breakdown experiment is done on our test XML
were repeated. document sized to 18M bytes. The result is shown on Fig-
Every test is run ten times to get the average time and ure 10, in which the different gray levels represents the per-
the measurement of the first time is discarded, so as to mea- centage of the running time of each stage. We tested from 2
sure performance with the file data already cached, rather to 32 threads, but we show only the even numbered results
than being read from disk. The programs are compiled by on Figure 10 to make the graph clearer.
Sun Workshop 5.2 CC with the option -O, and the libxml2 The most immediate feature to note is that the preparsing
library we are using is 2.6.16. is by far the most time consuming part of the sequential
During our initial experiments, we noticed poor speedup stage, and the static partitioning stage is not a significant
during a number of tests that should have performed well. limit to the parallelism. Thus, if we wish to address the
We attributed this to lock contention in malloc(). To effects of Amdahl’s law, we’ll need to target the preparsing.
avoid this, we wrote a simple, thread-optimized allocator As a general trend up to 24 threads, we observed that the
around malloc(). This allocator maintains a separate percentage on the preparsing grows from 13% to 53%. Per-
pool of memory for each thread. Thus, as long as the al- centage on task partitioning grows from 0.17% to 0.71%.
location request can be satisfied from this pool, no locks Percentage on post-processing grows from 0.16% to 0.73%.
need to be acquired. To fill the pool initially, we simply run This means that as the number of threads increase, the se-
the test once, then free all memory, returning it to each pool. quential stages take in increasing percentage of the total
Our tests with straight libxml2 use our allocator, since the time, and obviously can cause the performance to degrade.
results of straight libxml2 are better with our allocator than Meanwhile, the time cost of the parallel parsing stages drop
without it. from 87% to 46%. When the participating threads reach
Our allocator is intended simply to avoid lock con- more than 24, we run out of hardware threads, thus re-
tention, since we wanted to focus on the parsing itself. quiring the OS to start context switching threads. This
There is significant work on multi-threaded memory allo- causes the percentage of parallel parsing to increase sud-
22 1.5
Static approach speedup not including sequential stages Efficiency of static approach not including sequential stages
20 Static approach speedup including sequential stages Efficiency of static approach including sequential stages
Dynamic approach speedup not including preparsing Efficiency of dynamic approach not including preparsing
18
Dynamic approach speedup including preparsing Efficiency of dynamic approach including preparsing
16
1
14

Efficiency
Speedup

12

10

8
0.5
6

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Number of threads Number of threads

Figure 11: Speedup graph from 2 to 32 threads. Figure 12: Efficiency graph from 2 to 32 threads.

denly from 24 threads’ 46% to 26 threads’ 52%, and then Since the dynamic load-balancing is obtained by run-
we have minor reduction at each increased measurement un- time work load stealing, which incurs intensive communi-
til 32 threads’ 48%. As the number of threads increase, the cation and synchronization cost with an increasing number
post-processing stages increase from 2 threads’ 0.16% to 32 of threads, it is not scalable to the number of cores, as Fig-
threads’ 0.74%. ure 11 shows. Without preparsing, it reaches the maximum
speedup with 16 threads, which is 6.1. After that point, it
6.2. Speedup and Efficiency Analysis drops to the lowest value of 0.47 at 29 threads and continues
to below one time speedup with the increase of the threads
To show the benefit of the static approach on these kind number. And for efficiency, dynamic approach also dropped
of shallow structured XML documents, we compare the to 0.02 in the 32 threads case.
static approach with the dynamic approach introduced in These two figures show that though the dynamic ap-
our earlier work [14]. Referring to Figure 11, we did four proach has proved to be more flexible and suitable for par-
types of measurement on both the static parallel parsing ap- allel XML parsing on complex XML documents, for large
proach and dynamic parallel parsing approach. The tests XML documents with a simple array-like structure, the
were conducted on our test XML document sized to 18M static approach is more scalable than the dynamic approach.
bytes, and to better explore the cause of performance limi- Note although more sophisticated dynamic load-balancing
tations on each approach, we have two graph lines: one is technologies [12] exist to improve the scalability, for those
the total speedup and the other is the speedup not count- shallow structured documents the static scheme should be
ing the sequential stages. Speedup in this case is computed hard to challenge.
against the time obtained by a pure libxml2 parse without To better understand the pros and cons between the static
any preparsing or other steps only needed for PXP. approach and the dynamic approach, we also did the tests on
For the static approach, the sequential stages in- complex structured XML. We generated a deep and irregu-
clude preparsing, task partitioning and assignment, post- lar structured XML document with the size of 19M bytes.
processing; and for the dynamic approach, the sequential The tests results as in Figure 13, however, show that in such
stages include just preparsing. It appears that every six case the speedup of static approach is much worse than the
threads, the speedup will have a distinct drop. This is due to dynamic approach. The reason is that for complex XML
the fact that there are only six cores on the machine. Every documents it is harder to statically achieve balanced load.
six threads, the number of hardware threads per core must And thus, with the number of threads increased, load im-
increase. To better see the speedup degradation pattern, we balance becomes more serious.
generated the efficiency graph as shown in Figure 12. The In addition, we also note that the near-linear speedup of
efficiency is defined as the speedup divided by the number the parallel parsing when not including the sequential por-
of threads. With the efficiency graph, we can then clearly tion means that we can make significant improvements in
see the performance degradation. speed by targeting preparsing, which we have shown by the
6
Dynamic approach on deep and irregular structured XML
7. Related work
5 Static approach on deep and irregular structured XML
The static partition approach introduced in this paper
4 is similar with the “search-frontier splitting” policy used
Speedup

in [18, 7], which solve the discrete optimization problems


3
by the parallel Depth-first searching. By that policy the al-
2 gorithm first breadth-first searches the tree until reaching a
cutoff depth ,called “frontier”, then each sub-tree under the
1 cutoff depth will be treated as a task for the threads. How-
ever for XML documents, simply defining a cutoff frontier
0
2 2.5 3 3.5 4 4.5 5 5.5 6 will be less effective since XML documents should have
Number of threads more regular shape than the searching space of those dis-
crete optimization problems. Moreover by the preparsing,
Figure 13: Speedup not including sequential we are able to predict the size of the sub-tree while this
stages on the complex XML document. kind of prediction will be very hard for the discrete opti-
mization problems. Hence for XML documents our static
8 task partitioning algorithm is more effective to find a load-
8 threads balanced task set. The work by Reinefeld [18] also shows
7
4 threads
2 threads
the static task-partition scheme is more scalable than the
6
non−parallel dynamic scheme due to the less communication and syn-
5 chronization cost, which is consistent with our result.
Speedup

4 There are a number of approaches trying to address


3 the performance bottleneck of XML parsing. The typical
software solutions include lazy parsing [16] and schema-
2
specific parsing [6, 13, 21]. Schema-specific parsing lever-
1 ages XML schema information, by which the specific parser
0 (automaton) is built to accelerate the XML parsing. For the
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size(KB) XML documents conforming to the schema, the schema-
specific parsing will run very quickly, whereas for other
Figure 14: Speedup with various sized sim- documents an extra penalty will be paid. Most closely re-
ple.xml. lated to our work in this paper is lazy parsing because it also
needs a skeleton-like structure of the XML document for
the lazy evaluation. That is, first a skeleton is built from the
performance breakdown to account for the vast majority of XML document to indicate the basic tree structure, there-
the sequential time. after, based on the user’s access requirements, the corre-
sponding piece of the XML document will be located by
looking up the skeleton and be fully parsed. However, the
purpose of lazy parsing and parallel parsing are totally dif-
6.3. Scalability Analysis
ferent, so the structure and the use of the skeleton in the both
algorithms differs fundamentally from each other. Hard-
ware based solutions[1, 23] also are promising, particularly
To further study the scalability of the parallel XML pars-
in the industrial arena. But to our best knowledge, there is
ing with the static load-balancing scheme, we vary the size
no such work leveraging the data-parallelism model as PXP.
of a different test XML document to see how the speedup
varies with different document sizes. This test document
contained a simple linear array structure. We see from Fig- 8. Conclusion and Future Work
ure 14 that our implementation scales well across a range
of document sizes. At the smaller sizes, however, our tech- The advent of multicore machines provides a unique op-
nique begins to show less value, with no appreciable benefit portunity to improve XML parsing performance. We have
till the file size exceeds 256KB. A test file of 346,293 bytes shown that a production-quality parser can easily be adapted
shows speedups of 1.5, 2.5 and 3.7 for 2, 4, and 8 threads, to parse in parallel using a fast preparsing stage to pro-
respectively. Since our target so far has been larger XML vide structural information, namely the skeleton of a XML
documents, we found this to be acceptable, but may seek to document. Based on the skeleton, load-balancing then be-
address this in future work. comes the key to a scalable parallel algorithm. As far as
general parallel algorithms are concerned, static partition- parallel programming, pages 50–59, New York, NY, USA,
ing schemes and dynamic schemes have their own advan- 1990. ACM Press.
tages and disadvantages. For parallel XML parsing, how- [8] A. Grama and V. Kumar. Parallel processing of combina-
ever we show that the static partitioning scheme introduced torial optimization problems. ORSA Journal of Computing,
1995.
in this paper is more scalable and efficient than the dynamic [9] A. Y. Grama and V. Kumar. State of the art in parallel search
scheme on multicore machines, for what we believe to com- techniques for discrete optimization problems. IEEE Trans-
mon XML documents. This is due to the fact that most actions on Knowledge and Data Engineering, 11, 1999.
large XML documents usually contain array-like structures [10] M. R. Head, M. Govindaraju, R. van Engelen, and W. Zhang.
and their document structures tend to be shallow. Our static Grid scheduling and protocols—benchmarking xml proces-
load-balancing scheme leverages these characteristics and sors for applications in grid web services. In SC ’06: Pro-
is designed to quickly locate and partition the array struc- ceedings of the 2006 ACM/IEEE conference on Supercom-
tures. Although it introduces some sequential processing, puting, page 121, New York, NY, USA, 2006. ACM Press.
[11] G. Karypis and V. Kumar. Parallel multilevel k-way par-
the highly independent parallelism which can then be ob- titioning scheme for irregular graphs. In Supercomputing,
tained minimizes the synchronization cost during the paral- 1996.
lel parsing. Furthermore, a significant pragmatic benefit of [12] V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load
the static load-balancing scheme is it can directly use off- balancing techniques for parallel computers. J. Parallel Dis-
the-shelf XML parsers without requiring any modification. trib. Comput., 22(1):60–79, 1994.
The limitation of the static load-balancing scheme is it [13] W. M. Lowe, M. L. Noga, and T. S. Gaul. Foundations of fast
may not generalize to XML documents with the arbitrary communication via xml. Ann. Softw. Eng., 13(1-4), 2002.
[14] W. Lu, K. Chiu, and Y. Pan. A parallel approach to xml
shapes, and may even fail for some deeply-structured XML parsing. In The 7th IEEE/ACM International Conference on
documents. Hence a hybrid solution, which can provide Grid Computing, Barcelona, September 2006.
both static and dynamic load-balancing control, will be in- [15] M. M. Michael. Scalable lock-free dynamic memory alloca-
teresting. Dynamic partitioning could also be improved by tion. In PLDI ’04: Proceedings of the ACM SIGPLAN 2004
sophisticated work-stealing, scheduling policies, and lock- conference on Programming language design and imple-
free structures. Also our experiments show that with greater mentation, pages 35–46, New York, NY, USA, 2004. ACM
number of cores the sequential preparsing stage becomes Press.
[16] M. L. Noga, S. Schott, and W. Lowe. Lazy xml processing.
the major limitation to scalability.
In DocEng ’02: Proceedings of the 2002 ACM symposium
on Document engineering, 2002.
References [17] V. N. Rao and V. Kumar. Parallel depth first search. part i.
implementation. Int. J. Parallel Program., 16(6):479–499,
1987.
[1] Datapower. http://www.datapower.com/. [18] A. Reinefeld. Scalability of massively parallel depth-
[2] N. Abu-Ghazaleh and M. J. Lewis. Differential deserializa- first search. In Parallel Processing of Discrete Optimiza-
tion for optimized soap performance. SC—05 (Supercom- tion Problems, volume 22 of DIMACS Series in Discrete
puting): International Conference for High Performance Mathem. and Theor. Comp, pages 305–322, 1995.
Computing, Networking, and Storage, Seattle WA, Novem- [19] J. L. Sussman, E. E. Abola, N. O. Manning, and J. Prilusky.
ber 2005. The protein data bank: Current status and future challenges.
[3] G. M. Amdahl. Validity of the single processor approach [20] H. Sutter. The free lunch is over: A fundamental turn toward
to achieving large scale computing capabilities. pages 483– concurrency in software. Dr. Dobb’s Journal, 30, 2005.
485, 1967. [21] R. van Engelen. Constructing finite state automata for high
[4] K. Chiu, T. Devadithya, W. Lu, and A. Slominski. A binary performance xml web services. In Proceedings of the Inter-
xml for scientific applications. In Proceedings of e-Science national Symposium on Web Services(ISWS), 2004.
2005. IEEE, 2005. [22] R. van Engelen and K. Gallivan. The gsoap toolkit for web
[5] K. Chiu, M. Govindaraju, and R. Bramley. Investigating services and peer-to-peer computing networks. In the 2nd
the limits of soap performance for scientific computing. In IEEE International Symposium on Cluster Computing and
HPDC ’02: Proceedings of the 11 th IEEE International the Grid, Berlin, Germany, May 2002.
Symposium on High Performance Distributed Computing [23] J. van Lunteren, J. Bostian, B. Carey, T. Engbersen, and
HPDC-11 20002 (HPDC’02), page 246. IEEE Computer C. Larsson. Xml accelerator engine. In The First Inter-
Society, 2002. national Workshop on High Performance XML Processing,
[6] K. Chiu and W. Lu. A compiler-based approach to schema- 2004.
[24] D. Veillard. Libxml2 project web page. http://
specific xml parsing. In The First International Workshop
xmlsoft.org/, 2004.
on High Performance XML Processing, 2004.
[25] W3C. Xml binary characterization properties. http://
[7] M. Furuichi, K. Taki, and N. Ichiyoshi. A multi-level load
www.w3.org/TR/xbc-properties/.
balancing scheme for or-parallel exhaustive search programs [26] W3C. Xml information set (second edition). http://
on the multi-psi. In PPOPP ’90: Proceedings of the sec- www.w3.org/TR/xml-infoset/, 2003.
ond ACM SIGPLAN symposium on Principles & practice of

You might also like