A Static Load-Balancing Scheme For Parallel XML Parsing On Multicore Cpus
A Static Load-Balancing Scheme For Parallel XML Parsing On Multicore Cpus
Figure 2: The PXP architecture first uses a preparser to generate a skeleton of the XML document. We next
partition the document into chunks, and assign each task to a thread. (This stage is actually sequential, though
the diagram has multiple arrows for it.) After the parallel parsing is complete, we postprocess the document to
remove temporary nodes.
Figure 3: Logical structure of the top tree, tasks’ placeholder, and tasks. In (a), we see the complete tree. In (b)
we see the top tree and the subtree tasks. In (c), the placeholder level is also shown.
Top tree
DOM node. Parsing the original text would also generate
all the children nodes, leaving no work to be done during
the actual parallel parsing stage. Place Holder Node
As a workaround to this lexical problem, we create a du- Replacement
plicate, but childless element in a temporary string buffer,
Parsing
and parse this duplicate element to generate a childless
DOM node. The duplicate element is created simply by
concatenating the start- and end-tags of the original ele- XML document (Task) Sub DOM tree
ment, and omitting all of the content. For example, if the
original element were <a><c1>...</c1><c2/></a>, Figure 4: Attaching a subtree to the top tree is
then the temporary buffer would contain <a></a>. This done by replacing the placeholder node in-place,
procedure could be expensive if the top tree is large, but in after the subtree is parsed.
practice we have found that most large documents are rel-
atively shallow, and thus the top tree is small compared to
the total size.
The “size” of a task should correspond to how much time served at that time. Each new subtree task is created with a
it will take to complete. We currently use the number of pointer to its corresponding placeholder. These placehold-
characters in the corresponding XML chunk as an estimate ers are removed as the top tree grows down, so that only
of this. We have found so far that it works well, but a more the leaves of the top tree have placeholders. Even these are
sophisticated measure may be needed for some documents. removed after the parallel parsing is complete, as described
To prevent too many small subtree tasks from being gen- below. The entire process is diagrammed in Figure 5.
erated, and thus increasing overhead, we terminate the re-
cursion when the number of tasks is greater than a preset The placeholder nodes also serve to isolate concurrent
limit, currently chosen to be 100. When such a limit is operations from one another. Once subtree partitioning is
reached, there will usually be enough tasks such that they complete, each subtree task will be executed by a separate
can be assigned to threads in a well-balanced manner. thread. These threads will complete at different times, how-
As we proceed to recursively create new subtasks and ever, so if a thread directly adds the root of a subtree to the
grow the top tree down, we repeatedly add new DOM nodes parent node, this will result in different left-to-right order-
to the top tree. Below any node in the top tree, the left-to- ings depending on the completion order. The placeholder
right ordering of its children will correspond to the order in nodes avoid this problem because we can use libxml2 to di-
which they were added to that node. In a straightforward, rectly replace the placeholder nodes with the actual subtree
but incorrect, implementation, this order would correspond root nodes, in place, as shown in Figure 4. After we finalize
to the order that the corresponding subtrees were removed the task partitioning, we will see a logical three level struc-
from the task list, which would not correspond to the actual ture connecting top tree, placeholders, and tasks together as
left-to-right ordering of the elements in the XML document. Figure 3(c) shows. The nodes in the top tree are fully con-
Thus, to preserve the left-to-right ordering, we create structed, permanent DOM nodes; the nodes in the place-
placeholder children between the bottom of the top tree and holder level are temporary DOM nodes used to connect the
the subtrees below it. These placeholders are added to the tasks to the proper left-right position within the permanent
parent node immediately when it is added to the top tree, DOM nodes; and the nodes below this are logical nodes that
and thus the left-to-right ordering is known and can be pre- currently only exist in the skeleton.
Top tree generation
Task partitioning
Figure 5: As the parsing proceeds, the top tree is grown downwards. As each subtree is processed, its root
node is parsed and moved to the top tree by an in-place replacement of the corresponding placeholder node,
and additional placeholder nodes are created. At each iteration, the largest remaining subtree is chosen for
processing.
Array 1 RPH(1) RPH(2) RPH(3) RPH(4) PH PH task1 task2 task3 task4 task9 task10
Figure 7: In (a), we see the original tree structure. In (b), range placeholders have been added. In (c), we
see how each range corresponds to a task. The table in (d) is used to ensure that the ranges are added to the
corresponding array element in the correct order.
Thread number
</atom_array_1>
25
<atom_array_2>
<atom> ... </atom> 20
The 2nd data array
...
</atom_array_2> 15
</MoleculeType> 10
ments. 0
0 10 20 30 40 50 60 70 80 90 100
Percentage
6. Performance Results
Figure 10: Performance breakdowns.
Our experiments were run on a Sun Fire T1000 machine,
with 6 cores and 24 hardware threads (CMT). We observed
cation that can be used to provide a more sophisticated so-
that most large XML documents, particularly in scientific
lution [15].
applications, are relatively broad rather than deep, also typ-
ically contain one or two large arrays and the structure tends
to be shallow. Hence we select a large XML file, which con- 6.1. Performance Breakdown
tains the molecular information as shown in Figure 9, rep-
resenting the typical structural shape of XML documents Our implementation performs the stages as described in
in scientific applications. This was based on XML docu- the paper, and thus we can measure the time of each stage.
ments obtained from the Protein Data Bank [19]. It consists The sequential stages include preparsing, task partitioning,
of two large array representing the molecule data as well and post-parsing, and the parallel stages are just parallel
as a couple elements for the molecule attributes. To obtain parsing which is done by all the threads in parallel. Our per-
the different sizes, the molecule data part of the documents formance breakdown experiment is done on our test XML
were repeated. document sized to 18M bytes. The result is shown on Fig-
Every test is run ten times to get the average time and ure 10, in which the different gray levels represents the per-
the measurement of the first time is discarded, so as to mea- centage of the running time of each stage. We tested from 2
sure performance with the file data already cached, rather to 32 threads, but we show only the even numbered results
than being read from disk. The programs are compiled by on Figure 10 to make the graph clearer.
Sun Workshop 5.2 CC with the option -O, and the libxml2 The most immediate feature to note is that the preparsing
library we are using is 2.6.16. is by far the most time consuming part of the sequential
During our initial experiments, we noticed poor speedup stage, and the static partitioning stage is not a significant
during a number of tests that should have performed well. limit to the parallelism. Thus, if we wish to address the
We attributed this to lock contention in malloc(). To effects of Amdahl’s law, we’ll need to target the preparsing.
avoid this, we wrote a simple, thread-optimized allocator As a general trend up to 24 threads, we observed that the
around malloc(). This allocator maintains a separate percentage on the preparsing grows from 13% to 53%. Per-
pool of memory for each thread. Thus, as long as the al- centage on task partitioning grows from 0.17% to 0.71%.
location request can be satisfied from this pool, no locks Percentage on post-processing grows from 0.16% to 0.73%.
need to be acquired. To fill the pool initially, we simply run This means that as the number of threads increase, the se-
the test once, then free all memory, returning it to each pool. quential stages take in increasing percentage of the total
Our tests with straight libxml2 use our allocator, since the time, and obviously can cause the performance to degrade.
results of straight libxml2 are better with our allocator than Meanwhile, the time cost of the parallel parsing stages drop
without it. from 87% to 46%. When the participating threads reach
Our allocator is intended simply to avoid lock con- more than 24, we run out of hardware threads, thus re-
tention, since we wanted to focus on the parsing itself. quiring the OS to start context switching threads. This
There is significant work on multi-threaded memory allo- causes the percentage of parallel parsing to increase sud-
22 1.5
Static approach speedup not including sequential stages Efficiency of static approach not including sequential stages
20 Static approach speedup including sequential stages Efficiency of static approach including sequential stages
Dynamic approach speedup not including preparsing Efficiency of dynamic approach not including preparsing
18
Dynamic approach speedup including preparsing Efficiency of dynamic approach including preparsing
16
1
14
Efficiency
Speedup
12
10
8
0.5
6
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Number of threads Number of threads
Figure 11: Speedup graph from 2 to 32 threads. Figure 12: Efficiency graph from 2 to 32 threads.
denly from 24 threads’ 46% to 26 threads’ 52%, and then Since the dynamic load-balancing is obtained by run-
we have minor reduction at each increased measurement un- time work load stealing, which incurs intensive communi-
til 32 threads’ 48%. As the number of threads increase, the cation and synchronization cost with an increasing number
post-processing stages increase from 2 threads’ 0.16% to 32 of threads, it is not scalable to the number of cores, as Fig-
threads’ 0.74%. ure 11 shows. Without preparsing, it reaches the maximum
speedup with 16 threads, which is 6.1. After that point, it
6.2. Speedup and Efficiency Analysis drops to the lowest value of 0.47 at 29 threads and continues
to below one time speedup with the increase of the threads
To show the benefit of the static approach on these kind number. And for efficiency, dynamic approach also dropped
of shallow structured XML documents, we compare the to 0.02 in the 32 threads case.
static approach with the dynamic approach introduced in These two figures show that though the dynamic ap-
our earlier work [14]. Referring to Figure 11, we did four proach has proved to be more flexible and suitable for par-
types of measurement on both the static parallel parsing ap- allel XML parsing on complex XML documents, for large
proach and dynamic parallel parsing approach. The tests XML documents with a simple array-like structure, the
were conducted on our test XML document sized to 18M static approach is more scalable than the dynamic approach.
bytes, and to better explore the cause of performance limi- Note although more sophisticated dynamic load-balancing
tations on each approach, we have two graph lines: one is technologies [12] exist to improve the scalability, for those
the total speedup and the other is the speedup not count- shallow structured documents the static scheme should be
ing the sequential stages. Speedup in this case is computed hard to challenge.
against the time obtained by a pure libxml2 parse without To better understand the pros and cons between the static
any preparsing or other steps only needed for PXP. approach and the dynamic approach, we also did the tests on
For the static approach, the sequential stages in- complex structured XML. We generated a deep and irregu-
clude preparsing, task partitioning and assignment, post- lar structured XML document with the size of 19M bytes.
processing; and for the dynamic approach, the sequential The tests results as in Figure 13, however, show that in such
stages include just preparsing. It appears that every six case the speedup of static approach is much worse than the
threads, the speedup will have a distinct drop. This is due to dynamic approach. The reason is that for complex XML
the fact that there are only six cores on the machine. Every documents it is harder to statically achieve balanced load.
six threads, the number of hardware threads per core must And thus, with the number of threads increased, load im-
increase. To better see the speedup degradation pattern, we balance becomes more serious.
generated the efficiency graph as shown in Figure 12. The In addition, we also note that the near-linear speedup of
efficiency is defined as the speedup divided by the number the parallel parsing when not including the sequential por-
of threads. With the efficiency graph, we can then clearly tion means that we can make significant improvements in
see the performance degradation. speed by targeting preparsing, which we have shown by the
6
Dynamic approach on deep and irregular structured XML
7. Related work
5 Static approach on deep and irregular structured XML
The static partition approach introduced in this paper
4 is similar with the “search-frontier splitting” policy used
Speedup