Incrementality in Deterministic Dependency Parsing
Incrementality in Deterministic Dependency Parsing
Joakim Nivre
School of Mathematics and Systems Engineering
Växjö University
SE-35195 Växjö
Sweden
[email protected]
triples hS, I, Ai, where S is the stack (repre- ten be applied to the same configuration. Thus,
sented as a list), I is the list of (remaining) input in order to get a deterministic parser, we need
tokens, and A is the (current) arc relation for to introduce a mechanism for resolving transi-
the dependency graph. (Since the nodes of the tion conflicts. Regardless of which mechanism
dependency graph are given by the input string, is used, the parser is guaranteed to terminate
only the arc relation needs to be represented ex- after at most 2n transitions, given an input
plicitly.) Given an input string W , the parser is string of length n. Moreover, the parser is guar-
initialized to hnil, W, ∅i and terminates when it anteed to produce a dependency graph that is
reaches a configuration hS, nil, Ai (for any list acyclic and projective (and satisfies the single-
S and set of arcs A). The input string W is head constraint). This means that the depen-
accepted if the dependency graph D = (W, A) dency graph given at termination is well-formed
given at termination is well-formed; otherwise if and only if it is connected.
W is rejected. We can now define what it means for the pars-
In order to understand the constraints on ing to be incremental in this framework. Ide-
incrementality in dependency parsing, we will ally, we would like to require that the graph
begin by considering the most straightforward (W − I, A) is connected at all times. How-
parsing strategy, i.e. left-to-right bottom-up ever, given the definition of Left-Reduce and
parsing, which in this case is essentially equiva- Right-Reduce, it is impossible to connect a
lent to shift-reduce parsing with a context-free new word without shifting it to the stack first,
grammar in Chomsky normal form. The parser so it seems that a more reasonable condition is
is defined in the form of a transition system, that the size of the stack should never exceed
represented in Figure 3 (where wi and wj are 2. In this way, we require every word to be at-
arbitrary word tokens): tached somewhere in the dependency graph as
soon as it has been shifted onto the stack.
1. The transition Left-Reduce combines the
We may now ask whether it is possible
two topmost tokens on the stack, wi and
to achieve incrementality with a left-to-right
wj , by a left-directed arc wj → wi and re-
bottom-up dependency parser, and the answer
duces them to the head wj .
turns out to be no in the general case. This can
2. The transition Right-Reduce combines be demonstrated by considering all the possible
the two topmost tokens on the stack, wi projective dependency graphs containing only
and wj , by a right-directed arc wi → wj three nodes and checking which of these can be
and reduces them to the head wi . parsed incrementally. Figure 4 shows the rele-
3. The transition Shift pushes the next input vant structures, of which there are seven alto-
token wi onto the stack. gether.
We begin by noting that trees (2–5) can all be
The transitions Left-Reduce and Right- constructed incrementally by shifting the first
Reduce are subject to conditions that ensure two tokens onto the stack, then reducing – with
that the Single head condition is satisfied. For Right-Reduce in (2–3) and Left-Reduce in
Shift, the only condition is that the input list (4–5) – and then shifting and reducing again –
is non-empty. with Right-Reduce in (2) and (4) and Left-
As it stands, this transition system is non- Reduce in (3) and (5). By contrast, the three
deterministic, since several transitions can of- remaining trees all require that three tokens are
Initialization hnil, W, ∅i
? ? ? ? ? ?
(1) a b c (2) a b c (3) a b c (4) a? b
?
c
(5) a ? ? (6) a? ? (7) a ? ?
b c b c b c
shifted onto the stack before the first reduction. egy, which requires each token to have found
However, the reason why we cannot parse the all its dependents before it is combined with its
structure incrementally is different in (1) com- head. For left-dependents this is not a problem,
pared to (6–7). as can be seen in (5), which can be processed
In (6–7) the problem is that the first two to- by alternating Shift and Left-Reduce. But in
kens are not connected by a single arc in the (1) the sequence of reductions has to be per-
final dependency graph. In (6) they are sisters, formed from right to left as it were, which rules
both being dependents on the third token; in out strict incrementality. However, whereas the
(7) the first is the grandparent of the second. structures exemplified in (6–7) can never be pro-
And in pure dependency parsing without non- cessed incrementally within the present frame-
terminal symbols, every reduction requires that work, the structure in (1) can be handled by
one of the tokens reduced is the head of the modifying the parsing strategy, as we shall see
other(s). This holds necessarily, regardless of in the next section.
the algorithm used, and is the reason why it It is instructive at this point to make a com-
is impossible to achieve strict incrementality in parison with incremental parsing based on ex-
dependency parsing as defined here. However, tended categorial grammar, where the struc-
it is worth noting that (2–3), which are the mir- tures in (6–7) would normally be handled by
ror images of (6–7) can be parsed incrementally, some kind of concatenation (or product), which
even though they contain adjacent tokens that does not correspond to any real semantic com-
are not linked by a single arc. The reason is bination of the constituents (Steedman, 2000;
that in (2–3) the reduction of the first two to- Morrill, 2000). By contrast, the structure in (1)
kens makes the third token adjacent to the first. would typically be handled by function compo-
Thus, the defining characteristic of the prob- sition, which corresponds to a well-defined com-
lematic structures is that precisely the leftmost positional semantic operation. Hence, it might
tokens are not linked directly. be argued that the treatment of (6–7) is only
The case of (1) is different in that here the pseudo-incremental even in other frameworks.
problem is caused by the strict bottom-up strat- Before we leave the strict bottom-up ap-
proach, it can be noted that the algorithm de- that the Single head constraint is satisfied,
scribed in this section is essentially the algo- while the Reduce transition can only be ap-
rithm used by Yamada and Matsumoto (2003) plied if the token on top of the stack already
in combination with support vector machines, has a head. The Shift transition is the same as
except that they allow parsing to be performed before and can be applied as long as the input
in multiple passes, where the graph produced in list is non-empty.
one pass is given as input to the next pass.1 The
Comparing the two algorithms, we see that
main motivation they give for parsing in multi-
the Left-Arc transition of the arc-eager algo-
ple passes is precisely the fact that the bottom-
rithm corresponds directly to the Left-Reduce
up strategy requires each token to have found
transition of the standard bottom-up algorithm.
all its dependents before it is combined with its
The only difference is that, for reasons of sym-
head, which is also what prevents the incremen-
metry, the former applies to the token on top
tal parsing of structures like (1).
of the stack and the next input token instead
4 Arc-Eager Dependency Parsing of the two topmost tokens on the stack. If we
compare Right-Arc to Right-Reduce, how-
In order to increase the incrementality of deter-
ever, we see that the former performs no re-
ministic dependency parsing, we need to com-
duction but simply shifts the newly attached
bine bottom-up and top-down processing. More
right-dependent onto the stack, thus making
precisely, we need to process left-dependents
it possible for this dependent to have right-
bottom-up and right-dependents top-down. In
dependents of its own. But in order to allow
this way, arcs will be added to the dependency
multiple right-dependents, we must also have
graph as soon as the respective head and depen-
a mechanism for popping right-dependents off
dent are available, even if the dependent is not
the stack, and this is the function of the Re-
complete with respect to its own dependents.
duce transition. Thus, we can say that the
Following Abney and Johnson (1991), we will
action performed by the Right-Reduce tran-
call this arc-eager parsing, to distinguish it from
sition in the standard bottom-up algorithm is
the standard bottom-up strategy discussed in
performed by a Right-Arc transition in combi-
the previous section.
nation with a subsequent Reduce transition in
Using the same representation of parser con-
the arc-eager algorithm. And since the Right-
figurations as before, the arc-eager algorithm
Arc and the Reduce can be separated by an
can be defined by the transitions given in Fig-
arbitrary number of transitions, this permits
ure 5, where wi and wj are arbitrary word to-
the incremental parsing of arbitrary long right-
kens (Nivre, 2003):
dependent chains.
1. The transition Left-Arc adds an arc Defining incrementality is less straightfor-
r
wj → wi from the next input token wj ward for the arc-eager algorithm than for the
to the token wi on top of the stack and standard bottom-up algorithm. Simply consid-
pops the stack. ering the size of the stack will not do anymore,
2. The transition Right-Arc adds an arc since the stack may now contain sequences of
r
wi → wj from the token wi on top of tokens that form connected components of the
the stack to the next input token wj , and dependency graph. On the other hand, since it
pushes wj onto the stack. is no longer necessary to shift both tokens to be
combined onto the stack, and since any tokens
3. The transition Reduce pops the stack.
that are popped off the stack are connected to
4. The transition Shift (SH) pushes the next some token on the stack, we can require that
input token wi onto the stack. the graph (S, AS ) should be connected at all
times, where AS is the restriction of A to S, i.e.
The transitions Left-Arc and Right-Arc, like AS = {(wi , wj ) ∈ A|wi , wj ∈ S}.
their counterparts Left-Reduce and Right-
Reduce, are subject to conditions that ensure Given this definition of incrementality, it is
1
easy to show that structures (2–5) in Figure 4
A purely terminological, but potentially confusing, can be parsed incrementally with the arc-eager
difference is that Yamada and Matsumoto (2003) use the
term Right for what we call Left-Reduce and the term
algorithm as well as with the standard bottom-
Left for Right-Reduce (thus focusing on the position up algorithm. However, with the new algorithm
of the head instead of the position of the dependent). we can also parse structure (1) incrementally, as
Initialization hnil, W, ∅i
Left-Arc hwi |S, wj |I, Ai → hS, wj |I, A ∪ {(wj , wi )}i ¬∃wk (wk , wi ) ∈ A
Right-Arc hwi |S, wj |I, Ai → hwj |wi |S, I, A ∪ {(wi , wj )}i ¬∃wk (wk , wj ) ∈ A
is shown by the following transition sequence: of lower nodes from higher nodes, since all nodes
are given by the input string. Hence, in terms of
hnil, abc, ∅i what drives the parsing process, all algorithms
↓ (Shift) discussed here correspond to bottom-up algo-
ha, bc, ∅i rithms in context-free parsing. It is interest-
↓ (Right-Arc) ing to note that if we recast the problem of de-
hba, c, {(a, b)}i pendency parsing as context-free parsing with a
↓ (Right-Arc) CNF grammar, then the problematic structures
hcba, nil, {(a, b), (b, c)}i (1), (6–7) in Figure 4 all correspond to right-
We conclude that the arc-eager algorithm is op- branching structures, and it is well-known that
timal with respect to incrementality in depen- bottom-up parsers may require an unbounded
dency parsing, even though it still holds true amount of memory in order to process right-
that the structures (6–7) in Figure 4 cannot be branching structure (Miller and Chomsky, 1963;
parsed incrementally. This raises the question Abney and Johnson, 1991).
how frequently these structures are found in Moreover, if we analyze the two algorithms
practical parsing, which is equivalent to asking discussed here in the framework of Abney and
how often the arc-eager algorithm deviates from Johnson (1991), they do not differ at all as to
strictly incremental processing. Although the the order in which nodes are enumerated, but
answer obviously depends on which language only with respect to the order in which arcs are
and which theoretical framework we consider, enumerated; the first algorithm is arc-standard
we will attempt to give at least a partial answer while the second is arc-eager. One of the obser-
to this question in the next section. Before that, vations made by Abney and Johnson (1991), is
however, we want to relate our results to some that arc-eager strategies for context-free pars-
previous work on context-free parsing. ing may sometimes require less space than arc-
First of all, it should be observed that the standard strategies, although they may lead
terms top-down and bottom-up take on a slightly to an increase in local ambiguities. It seems
different meaning in the context of dependency that the advantage of the arc-eager strategy
parsing, as compared to their standard use in for dependency parsing with respect to struc-
context-free parsing. Since there are no nonter- ture (1) in Figure 4 can be explained along the
minal nodes in a dependency graph, top-down same lines, although the lack of nonterminal
construction means that a head is attached to nodes in dependency graphs means that there
a dependent before the dependent is attached is no corresponding increase in local ambigui-
to (some of) its dependents, whereas bottom- ties. Although a detailed discussion of the re-
up construction means that a dependent is at- lation between context-free parsing and depen-
tached to its head before the head is attached to dency parsing is beyond the scope of this paper,
its head. However, top-down construction of de- we conjecture that this may be a genuine advan-
pendency graphs does not involve the prediction tage of dependency representations in parsing.
Connected Parser configurations
components Number Percent
0 1251 7.6
1 10148 61.3
2 2739 16.6
3 1471 8.9
4 587 3.5
5 222 1.3
6 98 0.6
7 26 0.2
8 3 0.0
≤1 11399 68.9
≤3 15609 94.3
≤8 16545 100.0