Semi-supervised lexical acquisition for wide-coverage parsing
Item Status
Embargo End Date
Date
2013-07-02
Authors
Thomforde, Emily Jane
Abstract
State-of-the-art parsers suffer from incomplete lexicons, as evidenced by the fact
that they all contain built-in methods for dealing with out-of-lexicon items at parse
time. Since new labelled data is expensive to produce and no amount of it will conquer
the long tail, we attempt to address this problem by leveraging the enormous amount of
raw text available for free, and expanding the lexicon offline, with a semi-supervised
word learner. We accomplish this with a method similar to self-training, where a fully
trained parser is used to generate new parses with which the next generation of parser
is trained.
This thesis introduces Chart Inference (CI), a two-phase word-learning method
with Combinatory Categorial Grammar (CCG), operating on the level of the partial
parse as produced by a trained parser. CI uses the parsing model and lexicon to identify
the CCG category type for one unknown word in a context of known words by inferring
the type of the sentence using a model of end punctuation, then traversing the chart
from the top down, filling in each empty cell as a function of its mother and its sister.
We first specify the CI algorithm, and then compare it to two baseline wordlearning
systems over a battery of learning tasks. CI is shown to outperform the
baselines in every task, and to function in a number of applications, including grammar
acquisition and domain adaptation. This method performs consistently better than
self-training, and improves upon the standard POS-backoff strategy employed by the
baseline StatCCG parser by adding new entries to the lexicon.
The first learning task establishes lexical convergence over a toy corpus, showing
that CI’s ability to accurately model a target lexicon is more robust to initial conditions
than either of the baseline methods. We then introduce a novel natural language corpus
based on children’s educational materials, which is fully annotated with CCG derivations.
We use this corpus as a testbed to establish that CI is capable in principle of
recovering the whole range of category types necessary for a wide-coverage lexicon.
The complexity of the learning task is then increased, using the CCGbank corpus,
a version of the Penn Treebank, and showing that CI improves as its initial seed corpus
is increased. The next experiment uses CCGbank as the seed and attempts to recover
missing question-type categories in the TREC question answering corpus. The final
task extends the coverage of the CCGbank-trained parser by running CI over the raw
text of the Gigaword corpus. Where appropriate, a fine-grained error analysis is also
undertaken to supplement the quantitative evaluation of the parser performance with
deeper reasoning as to the linguistic points of the lexicon and parsing model.
This item appears in the following Collection(s)

