2008 MSC Thesis PDF
2008 MSC Thesis PDF
Evolving English
Markus Gerstel
Lincoln College
DEPARTMENT OF STATISTICS
Thesis submitted in partial fullment of the requirements for the degree of MSc. in Computer Science
September 2008
I would like to thank in particular my supervisors Stephen Clark and Jotun Hein as well as Rune Lyngs
and Ferenc Huszr and everyone at our presentation sessions for many insightful and helpful comments
This work was supported by stipend D/06/48365 from the German Academic Exchange Service, Bonn, Germany.
Table of Contents
1.1.
1.2.
Introduction
4
Cross-Language Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Models of Grammar Evolution Report Structure. . . . . . . . . . . . . . . . . . 6
2.1.
2.1.1.
2.1.2.
2.2.
2.3.
Background
8
Linguistic Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Retracing Language Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Modelling English Grammar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Existing Language Evolution Simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 12
Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Formalities
14
3.1.
Modelling a Probabilistic Context-Free Grammar . . . . . . . . . . . . . . . . . . 15
3.1.1. Context-Free Grammars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2. Probabilistic Context-Free Grammars. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3. Limitations of Natural Language CFGs/PCFGs. . . . . . . . . . . . . . . . . . . . . 16
3.1.4. Beyond Context-Free. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.5. Inversion Transduction Grammars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.
Treebanks as Grammar Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1. Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2. CCG Bank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3. Penn Chinese Treebank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4. German Treebank TIGER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5. German Treebank TBa-D/S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.
Compatibility of Language Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.
Comparability of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.
Introducing N-Gram Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.
Comparing N-Gram Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.
Distance Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8.
Language Evolution Model An Overview. . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.
4.2.
4.2.1.
4.2.2.
4.2.3.
4.3.
4.4.
7.1.
7.1.1.
7.1.2.
7.1.3.
7.2.
7.3.
7.4.
7.5.
7.6.
39
References
42
Appendix
45
The Penn Treebank Tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Phrase Level Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Clause Level Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Word Level Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Mapping Tagsets: Chinese/CTB to English/PTB. . . . . . . . . . . . . . . . . . . . 47
Mapping Tagsets: German/STTS to English/PTB. . . . . . . . . . . . . . . . . . . 49
Code: [Link]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Code: [Link]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Code: [Link]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 1
Introduction
Chapter 1 Introduction
() the best grounding for education is the Latin grammar. I say this, not because Latin is traditional and medieval, but simply because even a rudimentary
knowledge of Latin cuts down the labour and pains of learning almost any other
subject by at least 50 per cent. It is the key to the vocabulary and structure of all
the Romance languages and to the structure of all the Teutonic languages ()
Dorothy L. Sayers, The Lost Tools of Learning (1947), Oxford
1.1.
Cross-Language Structure
Anyone who ever considered learning a new language may have come across this or a similar quote.
It is quite common to find foreign words incorporated in a language. A lot of German words can be
found in the English language (Lied, Bremsstrahlung,) and vice versa (label, team,). Some of these
words are loanwords, borrowed via cultural exchange. Others have a common ancestry, for example
hand/Hand. Some words take more complicated ways: shop in German is borrowed from the English shop which has a common ancestor with the German Schuppen. Tracing the history of these
words is very easy if one has an etymological dictionary. Even more visible than words are borrowed
letters or digits, like Umlauts or the Arabic numerals.
However, once it comes to grammatical constructions it gets more difficult. A bilingual speaker
will usually only notice the cases in which grammatical constructions differ between languages when
thinking up a sentence in his native tongue and translating this sentence word for word into a foreign
language, resulting in an awkward or ungrammatical sentence. And even when sentence constructions differ it is not always obvious how exactly the sentence structure has to be changed during
translation. On the other hand a fluent speaker might notice and appreciate sentence structures
shared between both languages.
The following examples illustrate such a structural relation:
Ich
habe
dieses Buch
gelesen.
(German)
Ik
heb
dit
gelezen.
(Dutch)
(*) I
have this
boek
book read.
(English)
All three sentences express the same fact. They are word for word translations of each other in
three West Germanic languages. The German and Dutch sentences are grammatically correct, and
if you ever spoke to someone from these countries you may have heard something close to the (ungrammatical) English sentence. Correcting that sentence is easy: Moving the word read in front of
this book solves the problem. But how would one know which word to pick and where to put it?
In linguistics an asterisk in front of a phrase indicates that the following phrase is ungrammatical.
Chapter 1 Introduction
Every sentence has internal syntactic structure, which can be described by a syntax tree. In a
syntax tree the words of the sentence can be found at the leaf nodes. The nodes directly connected
to leaf nodes are annotated with so-called part of speech tags (POS tags) that describe the function
the corresponding word takes in a sentence. Inner nodes, or branch nodes, indicate the function of
the phrase beneath. For example in the noun phrase (NP) this book the word this is a determiner
(DT) and book is a singular noun (NN). The root S of each tree describes the clause in its entirety as
a sentence. The required change between the German and English sentence can now be pinpointed
S
NP
VP
VBP
have
PRP
Ich
(I)
VP
NP
VBN
read
VP
DT
NN
this
book.
VBP
VP
habe
(have)
NP
VBN
DT
NN
dieses
(this)
Buch
(book)
gelesen.
(read.)
Fig. 2 S
yntax tree for the correct German
and incorrect English sentence
Apart from the words both trees differ only in the ordering of the subtrees of the second, lower
verb phrase (VP) node. A full description of the entire syntax tree annotation can be found in the
appendix, section 7.1.
Syntax trees can also be understood as derivation trees: Beginning with a starting symbol, in
these cases S the tree is expanded by derivation rules (e.g. SNPVP,) until at the leaves only
terminal symbols (words) remain. The tuple of a starting symbol and sets of non-terminals, terminals
and these derivation rules constitutes a Context-Free Grammar (CFG). Counting the occurrences of
each derivation rule and assigning it a probability results in a Probabilistic Context-Free Grammar.
1.2.
This project aims to create a novel evolutionary model for Probabilistic Context-Free Grammars
(PCFG) of natural languages.
Chapter 2 will give a more detailed account of the linguistic and computational motivation behind
this project. The question of whether such a CFG model can give an accurate representation of the
English language, or in fact any other natural language, will be explored.
Chapter 1 Introduction
In chapter 3 the necessary formal techniques will be introduced. Beginning with the formal definition of CFG/PCFGs and an introduction to treebanks as sources for these grammars, this chapter
will also reveal weaknesses and limitations of these CFG models, especially in comparative situations.
A way around these limitations will be presented with the introduction of n-gram models. A possible
distance metric between two languages will be derived and finally an overview of the entire evolutionary model will be provided.
In chapter 4 practical results will be presented. The proposed distance metric will be tested on
four language models derived from English, German and Chinese treebanks. The English language
will then be subjected to the directed evolutionary model to produce Modified English, a different
language. The idea will be to change the English language model in such a way as to move it closer to
the German language. Finally the changes made to the English language model will be analysed and
the results interpreted.
Chapter 5 will summarize the main findings of this project. It will also give suggestions for continuation projects.
Chapter 2
Background
Chapter 2 Background
2.1.
Linguistic Motivation
Germanic languages like English and German both evolved from a hypothetical common ancestor,
Proto-Germanic. In contrast French, Latin, Welsh and Farsi (Persian) are not Germanic languages. All
these languages however do have a common root in the (again hypothetical) Proto-Indo-European
language. According to linguistic theory this unified Indo-European language existed somewhere
between 10,000 BC and 3,000 BC. There is no direct evidence for this language, and not much is
known about how it sounded or how it could have looked like in writing.
Proto-Indo-European
Celtic
Germanic
Indo-Iranian
Brythonic
Welsh
Iranian
North Germanic
Swedish
Farsi
West Germanic
Icelandic
Dutch
English
High German
2.1.1.
The two main methods by which knowledge about the Proto-Indo-European language was acquired
are called internal reconstruction and the comparative method.
The internal reconstruction method looks at the current state of variation within a language and
postulates that exceptions to a general case are newer than the general case. For example in Latin
the regular perfect forms of verbs are created by the verbs stem and the added suffix si. But there
are some exceptions to that rule, namely irregular verbs
like scribere. The internal reconstruction method postulates the original verb forms (*)figsi and (*)scribsi. It is
carpo
duco
go
scribo
carpere
ducere
gere
scribere
carpsi
duxi
xi, not ()gsi
scripsi, not ()scribsi
The earliest evidence for something that could be considered writing emerged around 6,600 BC. This however was
not writing with linguistic content and this happened in ancient China, not Europe. It may be possible that this
Proto-Indo-European language existed but has never been used in writing. There have been two notable attempts
to create a text in reconstruted Proto-Indo-European: Schleichers fable originally by A. Schleicher (1868) and The
king and the god by S. K. Sen, E. P. Hamp et al. (1994).
10
Chapter 2 Background
then tried to reconstruct the exceptions from their original forms. In this case the postulated change
is that devoicing occurs in front of voiceless consonants. This reconstruction has to satisfy two plausibility principles: Firstly, the changes have to be natural, in other words the required sound changes
have to be regular, simple and language-universal. Secondly, the postulate has to satisfy Occams
Razor, that there can not be a different postulate that has the same outcome but makes less assumptions (i.e. is simpler). Hypothetical languages created by means of internal reconstruction carry
the prefix pre- (e.g. Pre-Germanic).
The comparative method instead compares variation between two or more languages. If a systematic correspondence of some feature can be established then a hypothetical, common ancestral
form can be established. The hypothetical common ancestor of the sound correspondence example
in figure 5 could either be d, t or an unobserved element X.
Again the plausibility principles are applied: A change dt is
quite uncommon, but the change td can be observed quite
frequently in other languages. A third element X is not required
day
deus
diabolus
ten
two
dies
divine
devil
decem, not ()tecem
duo, not ()tuo
Fig. 5 S
ound correspondences
and would violate Occams Razor. Hypothetical languages created by means of the comparative
method carry the prefix proto- (e.g. Proto-Indo-European).
However, since there is not much information or evidence certainly not enough for any kind of
statistical approach for any of the Proto-languages the goal of the project will not be to create
an accurate model of historic language evolution. While the immediate focus of this thesis lies on
grammar evolution, the generic evolution model presented in the following chapters can at a later
stage be modified to accommodate accurate historic grammar evolution. A prerequisite for that is
the availability of treebanks that contain information about historic grammar evolution, which is
currently not the case.
Example and explanation from A. Lohfer, UE History of English, Linguistic Reconstruction, Session 2, April 2007
11
Chapter 2 Background
There are differences in the dialects of English spoken in Yorkshire and London. Maybe language
should be defined, prescribed instead of described, by some authority that would itself rely only on
highest quality works from selected writers. William Shakespeares works are certainly considered
well-formed English but times have changed, and so has the language. The different settings of
place, time and even social hierarchy result in different definitions of English.
The second problem is the inherent richness of language. This can be illustrated by the simple
question How many words are there in the English language? While it is a myth that the Inuit have
400 different words for snow it is certainly possible to give 400 English compounds with snow
(snowflake, snowmobile, snow-
40000
35000
30000
Distinct Words
25000
20000
15000
10000
5000
5000
10000
15000
20000
25000
30000
35000
40000
Examined Sentences
tions, for example words like blog, vlog, mlog, blogosphere and blawg. Even when one analyses a
copy of the entire Wikipedia with over 2.5 million articles and keeps a list of seen words that list
will always continue to grow.
Noam Chomsky (1957) argued that English is so complex that it cannot be described by any
context-free grammar, probabilistic or not. While there is proof that this property holds for some
languages like Swiss German, the question of whether this is true for English is still open (e.g. M.
Mohri, R. Sproat, 2006).
Working with a statistical method on something organic and infinitely complex as language clearly
requires more pragmatism than perfectionism. The grammar for the English language in this project
will be derived from the Penn Treebank, an annotated corpus built from articles in the Wall Street
Journal. It will neither be a 100% complete nor a 100% accurate Probabilistic Context-Free Grammar for the English language.
Chapter 2 Background
2.2.
12
Computer simulation has been used for very early stages of language evolution: The emergence of
simple signalling systems allowing shared communication, the appearance of syntax and syntactic
universals. The question of a required minimum complexity for any natural language has been studied as well as the question how a language can be learned at all. Simulation on the origins of language
usually involves an agent-based view, in which independent entities form a society and interact with
and learn from each other. These can be represented by sets of rules or neural networks. Evolution
then can be introduced by e.g. genetic algorithms.
Most of these simulations however involve very abstract systems that bear no strong resemblance to any existing language (A. Cangelosi and D. Parisi, 2002; H. Turner, 2002). Others are much
closer to an existing language: (M. Hare and J. L. Elman, 1995) applied a machine learning technique
to the morphology of English verbs in an effort to simulate evolution on a word level.
It appears that so far there is no existing research on syntax level evolutionary simulation on one
or more natural languages.
2.3.
Project Aims
A pure linguistic approach to estimate distance between languages usually relies on one or more of
these three methods:
The first method is word analysis. The Latin pater became the English father and German Vater.
This kind of analysis relies on known rules of how spelling and pronunciation changed over time.
Word analysis becomes more difficult the further apart two languages are: The English sister descends from the same root as the Hindi bahan, but this is not immediately obvious (via Sanskrit
svasar, Old English sweostor, cf. A. Lohfer, 2007).
The second method is estimating the distance by looking at the language family. The language
family tree shown on page 9 makes it quite clear that English is much closer related to German
than to Welsh. Of course this estimation technique is limited. Nothing can be said about whether
Dutch or English is closer to Swedish. This approach also requires a direct connection between two
languages.
The final method is the comparisons of parameters in the given languages. These parameters
can for example be the number of genders or the number of cases occurring in a language, as well
Chapter 2 Background
13
as more complex properties like word ordering. (Ryder, 2006) attempts to recreate language trees
by applying concepts from molecular biology to a set of 109 parameters per language. With these
approaches it is still quite difficult to decide whether English is closer to Welsh or to Farsi.
This project enables a novel statistical, semi-automatic approach that will work even between
languages that are not connected by the language tree (e.g. English to Chinese). It will also allow to
measure an effective distance between two languages: If a language evolved independently from
English but had the same grammatical structure the reported difference between these languages
would be very small. Later analysis would then show how to manipulate one language to get closer
to grammatical structures of the other. A model to match grammars between languages may give
practical results that can then be reused in other fields working with language: The most obvious
application to benefit from a set of fixed rewriting rules would be statistical machine translation that
could incorporate these rules in its model.
14
Chapter 3
Formalities
15
Chapter 3 Formalities
3.1.
3.1.1.
Context-Free Grammars
A context-free grammar (CFG) is usually defined as a 4-tuple (N,T,,R) with N being a set of
nonterminal symbols, T a set of terminal symbols, a starting symbol that
is also an element of N, and R a (finite) set of derivation rules. These derivation rules have of the form XY where X is an element of N, and Y is any
GCF G := (N, T, , R)
N T = , N,
(X Y ) R
XN
Y (N T )
Fig. 7 C
FG definition
The language corresponding to the grammar is then defined as the (possibly infinite) set of words
that can be generated by this grammar. A word is a string that contains only terminal symbols. In
the context of this project the set of terminal symbols will be the set of parts of speech (POS) tags.
In the resulting context-free language each word, a stream of terminal symbols, will therefore correspond to an entire phrase in the natural language.
The set of nonterminal symbols will be the set of all phrase level and sentence level annotations and a dedicated starting symbol . The use of a dedicated starting symbol together with rules
like S is required, since not every allowed phrase is a sentence. For example headlines or titles
(Models of Grammar Evolution) are considered phrases, but not sentences.
n
(DerivationTree) = i=1 P (Yi |Xi ). To account
the probabilities of all n involved derivation rules: P
for ambiguous trees the probability for a specific phrase is then defined to be the sum of the probabilities for all distinct derivation trees leading to that phrase. Since the grammars in this project will
be derived from treebanks the probabilities will be calculated by taking counts and using a simple
maximum likelihood estimation: P (Y|X) =
#(XY)
#(X) .
Chapter 3 Formalities
16
Probabilistic Context-Free Grammars are also used in a range of fields outside of Computer Science and Linguistics, for example RNA structures are routinely described by PCFGs (cf. D. Searls,
1993).
Chapter 3 Formalities
17
tions. And finally any statistical approach will require a lot more data, or data sparsity will become
a problem. The grammars used in this project will not be augmented by any additional information
beyond the POS tag itself.
18
Chapter 3 Formalities
The derivation rules have either the form X[Y] or X<Y>. X is any nonterminal symbol in N
while Y is a string that can contain nonterminal symbols and pairs of terminal symbols. These pairs
can consist of exactly one symbol from set T1 and one
GIT G := (N, T1 , T2 , , R)
N
(T1 T2 ) = , N,
(X [Y ]) (X Y ) R
XN
multaneously. The most important part of the ITG are the bracket operators. The operator []means
the derivation is applied to both sentences normally. The inversion operator <> means the derivation
is applied to the first sentence as usual, but for the second sentence the rule gets inverted.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
NP
NP
IN
DT
NNP
NN
[IN NP]
[DT NP]
NNP NN
[of/de]
[the/]
[gallic/gallico]
[war/bello]
1
2
5
3
4
6
7
(
(
(
(
(
(
(
(
IN NP
IN DT NP
IN the NP
IN the NNP NN
of the NNP NN
of the gallic NN
of the gallic war
,
,
,
,
,
,
,
,
IN NP
IN DT NP
IN NP
IN NN NNP
de NN NNP
de NN gallico
de bello gallico
)
)
)
)
)
)
)
)
Fig. 10 Creating the sentence in both languages simultaneously by applying the ITG derivation rules
IN
of/de
NP
DT
the/
NP
NNP
NN
gallic/gallico
war/bello
extent.
3.2.
Treebanks are collections of sentences, with a few exceptions monolingual, annotated with syntactic
information. They are a very useful tool in computational linguistics to train and test parsers, taggers and other machine learning software working on Natural Language Processing. Treebanks can
either be created completely manually or semi-automatically with a parser suggesting a syntactic
structure that then has to be checked by a linguist. Either way it is a very labour-intensive and timeconsuming process.
19
Chapter 3 Formalities
which then moves to the front of the sentence. The original space of the word is marked by the
null element *T*. Both the null element and the corresponding wh-word receive the index number
1. Additionally Tim is identified as subject and marked
SBARQ
WHNP
SQ
VBZ
NP
VP
VBG
NP
10000
8000
6000
4000
2000
5000
10000
30000
35000
Fig. 14 Increase in the number of seen distinct derivation rules while analysing the Penn Treebank
40000
20
Chapter 3 Formalities
Depending on the number of nonterminals Nand terminals Tthe upper bound can be formulated as
k
n
|N|
n=1 |N T| with k being the maximum length of the right hand side of a derivation rule. The
Penn Treebank annotation guide does not impose a maximum length. The longest seen derivation
Number of Distinct Derivation Rules with n Non-Terminals
10000
100
10
10
15
20
Non-Terminals in Derivation
25
30
NP
NNP
NNP
NNP
NNP
NNP
NNP
NNP
NNP
New
York
Stock
Exchange
New
York
Stock
Exchange
Fig. 16/17 Example of structural flatness in the Penn Treebank. On the left the
syntax tree of the term New York Stock Exchange as it appears in
the treebank. An alternative syntax tree is shown on the right.
stead of possibly up to three rules. Another example that does not involve a trademark is the rule
VP(VBspin) (PRToff) (NPitstextilesoperations) (PPtoexistingshareholders) (PPinarestructuring) (Stoboostshareholdervalue) of length 6.
Long derivation rules will inevitably lead to a long tail in the rule distribution: A lot of the rules are
only observed a few times. Approximating the real distribution becomes difficult due to data sparsity. Having only short rules however is not necessarily better: A maximum rule length of 2 would
result in a maximum of 104,832 distinct rules in the Penn Treebank. Even though approximating the
real distribution becomes much easier due to the reduced number of possible rules, the amount of
information carried within a single rule diminishes.
21
Chapter 3 Formalities
X/Y
Y
X/Y
X
Y
X\Y
Y /Z
X
X
X/Z
Y /(Y \X)
in figure 18. The forward application (1) means that a function (X/Y) takes an argument (Y) to its
right and returns type (X). By applying these rules the complete derivation tree can be built from the
bottom up. Xand Ythemselves stand for any possible category. Derivation rules can be deduced by
reading these rules from right to left: XX/YY, etc.
S
S
NP
IBM
VP
NP
VBD
NP
bought
Lotus
IBM
S\NP
(S\NP)/NP
NP
bought
Lotus
Fig. 19/20 Example sentence with regular syntax tree on the left.
To the right the same sentence annotated with CCG
In the sentence IBM bought Lotus the intransitive verb bought expects two arguments: Something that is buying and something that is being bought. Both subject and object are again of type
NP. Bought expects one argument to its right and one to its left. As in a normal syntax tree bought
Lotus is considered a verb phrase on its own. This closer connection dictates operator precedence:
Lotus has to be applied first. The outcome of this application will result in a category that expects
another NPto its left. This type is the same (S\NP) as that of the transitive verb encountered previously. Now the category of bought can be defined as (S\NP)/NP.
CCG has the advantage that by following function application semantic structure within the sentence can be extracted. The set of primitive categories is very small and there are only a few function application rules. By design CCG derivation trees can only have a branching factor of 1 or 2.
22
Chapter 3 Formalities
But there are two major drawbacks making CCG unsuitable for this project. Firstly the information is
stored at the word level. The derivation tree is defined by the categories, not the other way around.
This also causes categories to become increasingly complex in larger sentences. An example from
the CCG bank is the category (((S\NP)/((S\NP)/NP))/NP)\((((S\NP)/((S\NP)/NP))/NP)/NP) assigned
to the phrase BPC residents. The number of possible rules therefore is again unbounded, which
leads to data sparsity problems. Too much encoded information will also make changing the grammar in a consistent way across many sentences quite difficult.
S
S
S/NP
NP
(S/NP)/NP
NP
bought
Lotus
IBM
S/NP
NP
(S/NP)\NP
IBM
bought
NP
Lotus
Fig. 22 S
emantic change modifying
two categories
The second drawback is that modifying the language is much more complicated than with a
simple CFG. Since the derivation tree is completely dependent on the word level categories those
have to be modified. Furthermore in more complex sentences there are interdependencies between
words. If one category is changed others have to be changed as well so that a valid tree can still be
built. One small change can cause a ripple effect throughout the tree. This will make it very difficult
to change the grammar in a controlled way to e.g. get closer to another language model.
zine and Hong Kong News. The treebank uses 33 POS tags
)
(NN )))
))
)))
Penn Treebank.
Chapter 3 Formalities
23
The treebank was semi-automatically POS-tagged and annotated with syntactic structure. Its
54 POS tags represent a slightly modified version of the STTS tagset, a standard tagset used by all
the major German treebanks. STTS was developed at the University of Stuttgart and the University
of Tbingen. Nearly all words in the TIGER treebank are additionally annotated with detailed information about their case, number, gender, person, degree, tense and mood.
<graph root="s4231_VROOT" discontinuous="true">
<terminals>
<t id="s4231_1" word="In" lemma="in" pos="APPR" morph="--"/>
<t id="s4231_2" word="Japan" lemma="Japan" pos="NE" morph="[Link]"/>
<t id="s4231_3" word="wird" lemma="werden" pos="VAFIN" morph="[Link]"/>
<t id="s4231_4" word="offenbar" lemma="offenbar" pos="ADJD" morph="Pos"/>
(..)
<nonterminals>
<nt id="s4231_500" cat="PP">
<edge label="AC" idref="s4231_1"/>
<edge label="NK" idref="s4231_2"/>
</nt>
Fig. 24 Example from the TIGER Treebank in XML format
3.3.
Working with data from more than one treebank inevitably leads to the problem that every treebank
likes to use its own format, annotation guidelines and tagset. While overcoming the format barrier is
merely a programming exercise, working with disparate tagsets can be tedious.
To be able to cross the tagset barrier a translation, a mapping from the tags in one language to
the tags in the other, is required. The way this mapping will be accomplished in this project is by
means of an intermediate (common) POS tagset and two surjective and total translation functions.
This common tagset will be kept as close to the English tagset as possible, the only difference to
the English tagset will be that elements that cannot be reached from the foreign language will be
removed, or merged with reachable elements, or as a last resort collected in a special catch-all tag
that will be used for all tags that have no correspondent.
The tagset used in the Chinese Treebank was clearly inspired by the Penn Treebank tagset, but it
contains several important differences. While the Penn Treebank offers 36 POS tags (ignoring punctuation tags) the Chinese Treebank has 33 POS tags. However out of these 33 POS tags only 31 tags
24
Chapter 3 Formalities
actually appear in the corpus, out of which 8 are particles and 6 are classified other. The remaining
17 tags are sometimes quite different from the English set. The Chinese word can, depending
on context, be translated to destroy, destroys, destroyed, destroying or destruction. Consequently
the Chinese treebank only has 4 POS tags to indicate different verb forms, but these do not map
directly to any of the 7 POS tags of the English treebank. There is a suggested mapping from Chinese
tags to English in Xia (2000), but it is neither total nor surjective nor a function.
A small excerpt of the mapping procedure is presented to the right. The Chinese tags VA and VV can
both mark verbs equivalent to the English VB. Ad-
VB
MD
VV
VA
VB
VA
V
MD
VV
NN
NNS
NT
NN
(CTB)
(PTB)
NN
NN
N N
NT
(CTB)
NNS
(CTS)
(PTB)
noun tags lack number information, so they will be mapped to the English singular noun NN. The
English plural noun NNS stands isolated. Depending on the view the mapping has to be surjective
and total, so isolated tags can not be not allowed. NNS is closest in meaning to the Chinese NN, so
it is added to the noun cluster. Once all isolated tags are assigned to their closest equivalent (or a
catch-all common tag for tags without an appropriate equivalent) the entire connected component
is assigned a tag in the new common tagset.
One possible mapping between English and Chinese tags created by this method is presented in
the appendix, chapter 7.2. In this case the common tagset consists of 18 tags. Generally a higher
number of common tags is expected to give more reliable results later on, as a higher number suggests a higher precision in what this cluster represents. A lower number of common tags represents
less information, and will make the languages appear closer together than they really are. If the common tagset were to contain only a single tag (i.e. WORD) then the involved languages would look
identical.
Trying to map CCG categories to POS tags in other languages poses huge difficulties. Due to the
possibly infinite number of distinct categories an automatic mapping process would be needed. Together with the already mentioned problems of modifying the grammar it was decided not to pursue
the use of CCG in this project any further.
25
Chapter 3 Formalities
The German STTS tagset is also quite different from the English tagset but it is arguably a lot
closer than the Chinese tagset. In fact the additional annotation in the TIGER treebank can be used
together with the German POS tag to find appropriate English tags to map to. Most of the time
a very exact matching, in some cases 1:1, is possible. The common tagset between English/Penn
Treebank and German/TIGER contains 32 tags.
Since TBa-D/S is lacking the additional annotation of the TIGER treebank some information is
lost. Like with the Chinese treebank singular and plural cannot be distinguished. The third person
verb forms have to be unified with the regular ones, differentiation between different degrees of
adjectives is no longer possible. The resulting common tagset between English/PTB and German/
TBa-D/S contains only 26 tags.
3.4.
With the introduction of a common tagset it should now be possible to compare two languages.
Comparing languages is required to do directed evolution. Ideally one has a function that returns
a number indicating whether two language models are equal, or how far apart they are. HowevSL1
NP
SL2
VP
NNP
VBD
OP
Sue
saw
NP
NNP
VBD
NP
Sue
saw
NNP
NNP
Mary
Mary
er comparing two CFG language models is harder than it seems. Strictly speaking the equivalence
problem for Context-Free Grammars is even undecidable.
With these two sentences from two language models it is not possible to deduce whether the
two models describe the same language. One way of comparing the two languages is to compare
the derivation trees of the sentences. The node annotation is a bit different and would have to be
translated to a common tagset as it was the case with the POS tags. It can be seen easily that a
noun phrase is a hyponym, a sub-type of an object phrase (OP). But except from the derivation
NPNNP for Mary the derivation trees have nothing in common. Another way of comparing is by
Chapter 3 Formalities
26
looking at the output. The actual words are of course not included in either CFG, but they would be
of no help since the two CFGs are describing different real languages and hardly any words would
match. So the POS tag sequence NNP VBD NNP has to be used instead.
It is very simple to decide whether this sequence can occur in the other language. The word
3
) with the CYK algorithm. The algorithm
problem for Context-Free Grammars can be solved in O(n
would however only return true or false not exactly helpful to determine the distance between
two languages. A better variation is not to determine if a sequence can be generated, but to determine the likelihood that this sequence is generated. The probability of a sequence is the sum of all
the probabilities of its possible parses, the probability of a specific parse is the product of the probabilities of all involved derivation rules.
A few problems remain with this approach: If there is no possible parse for the sequence then the
sequence will have a probability of zero. Especially with flat grammar structures this could be caused
by data sparsity. It was shown in chapter 3.2.1. that there are 1.03*1023 possible rule combinations
for Penn Treebank derivation rules with a maximum length of 12. Data sparsity is expected to be a
big problem, especially because the two language models will not be based on translations of the
same sentences.
3.5.
There are two ways to avoid or mitigate the data sparsity problem. One is to apply some kind of
smoothing to the PCFG model to assign unseen derivations a low but non-zero probability. Smoothing would likely be a quite complicated process, and it is uncertain how successful smoothing on the
PCFG can be.
There is another widely used model in natural language processing, called the n-grammodel. The
probability of a sentence S occurring in a language L can also be rewritten using the chain rule and
conditional probabilities. These probabilities can then be approximated by making the independence
P (S|L)
=
=
3-gram
2-gram
P (t1 t2 t3 t4 t5 . . . tn |L)
P (t1 |L) P (t2 |L, t1 ) P (t3 |L, t1 , t2 ) P (t4 |L, t1 , t2 , t3 ) P (t5 |L, t1 , t2 , t3 , t4 ) . . .
P (t1 |L) P (t2 |L, t1 ) P (t3 |L, t1 , t2 ) P (t4 |L, t2 , t3 ) P (t5 |L, t3 , t4 ) . . .
P (t1 |L) P (t2 |L, t1 ) P (t3 |L, t2 ) P (t4 |L, t3 ) P (t5 |L, t4 ) . . . P (tn |L, tn1 )
Fig. 28 P
robability of a sentence occuring in a language according to the chain rule, a trigram and a bigram model.
assumption that the probability of any specific tag occurring only depends on the (n-1) preceding
tags. Each of these sub-sequences containing (n-1) tags plus the following tag constitutes an ngram. Probabilities are estimated by the usual maximum likelihood estimation.
Chapter 3 Formalities
27
The described view is that of an (n-1)-th order Markov model. As in any other Markov model the
past beyond a certain point does not influence the current state. Disregarding parts of the history
of a sentence of course means that in some cases long range dependencies cannot be taken into
account. Precision can be balanced against data sparsity issues by choosing an appropriate value
for n.
This project will use 5-gram models. Of course there is a very high number of 5-grams that will
only be seen a few times or not at all, so again smoothing is crucial. There are some established
smoothing methods for n-gram models. The simplest method is to treat any n-gram like it was seen
at least once. However this technique allocates a relatively high probability to the entire mass of
unseen n-grams, which is why it is usually not used in practice.
A far more sophisticated solution is the Katz back-off smoothing technique, presented in (S. M.
Katz, 1987). This technique leaves probability estimates of very common n-grams as they are, as
the estimation is deemed to be quite reliable due to a lot of evidence. Probability estimates of rarely
seen (less than k times) n-grams are reduced in favour of the probability of unseen n-grams. The
probability of unseen n-grams is estimated recursively by looking at the Katz smoothed (n-1)-gram
model of the data. For this project k was chosen to have a value of 9 in the English model, 5 in the
Chinese model and 10 in the German models.
With n-gram models the problems of directly comparing PCFGs can be avoided. Using n-gram
models instead of PCFG parsing probabilities has three more practical advantages: It is not necessary
to map phrase and clause level annotations between language models, it is not necessary to implement a parser to calculate the probabilities, and finally it is not necessary to create PCFG models for
the foreign languages as the n-gram models can be extracted directly from the corpora.
The practical viability of n-gram models was for example shown in the practical of the lecture Computational Linguistics. The assignment was to program a Parts of Speech tagger that was to be trained on the words and tags of
a part of the Penn Treebank, and then had to assign POS tags to untagged text. It was possible to correctly annotate
between 88% (on texts with previously unseen words) and 98% (on texts without unseen words) of the untagged
text with the help of a bigram model.
28
Chapter 3 Formalities
3.6.
With this definition of a sentence probability the next logical step is to define a distance between
P (S|LE )
P (S|LG )
read.
0.00001
0.00095
book.
0.00095
0.00001
0.015
0.015
have
have
went
(PRP)
(PRP)
(PRP)
(VBP)
(VBP)
(VBD)
this
book
(DT)
(NN)
read
this
(VBN)
(DT)
(VBN)
(NN)
home.
(NN)
Fig. 29 S
entence probabilities in English and German language
German texts are stated. Due to the used definition of sentence probability shorter sentences will
usually have a much higher probability than longer sentences. So while it makes sense to compare
the probabilities of sentences of the same length between languages, it does not make sense to
compare probabilities between sentences of different lengths.
One way to measure how far apart two probability distributions over the sentence space are,
is the so called relative entropy or Kullback-Leibler (KL) divergence. Intuitively the total difference
between two entire models is the sum of all the differences on a sentence level. The KL divergence
P (i|L )
P (i|L2 )
1
P (i|L1 ) log
of two language models L1 and L2 is defined as d(L
,
which is just a
1 ||L2 ) =
i
weighted sum of the probability differences between all sentences. The weighting of the sum causes
the KL divergence to become asymmetric, which is the reason why it is called KL divergence rather
than KL distance.
This definition is of course rather problematic for any real world scenario. The real sentence probability function P is unknown and approximated by the Katz smoothed n-gram model K. Hence the
K (i)
KL2 (i)
L1
KL1 (i) log
real KL divergence can only be approximated with d(L
.
The next prob1 ||L2 )
i
lem is that the KL divergence is a sum over the entire sentence space. The weighted sum of course
means that any sentences with a zero probability can be ignored, but the Katz smoothing of the ngram model ensured that there are no sentences with a zero probability. Since calculating the probabilities of an infinite number of sentences is out of the question another simplification is needed.
The best way to approximate a weighted sum is to only consider the most important summands with
KL1 can be found easily in the L1 corpus.
very high weights. Sentences with a high probability
While technically any sentences from the corpus could be used for the approximation of the KL
distance it seemed prudent to use a set of sentences that is separate from the set of sentences that
was used to create the PCFG or n-gram models. For the English Penn Treebank section 00 (containing 1,921 sentences) is used for this task. Similarly in the Chinese and German treebanks the first
1,921 sentences are put aside and are not used to create the language model.
29
Chapter 3 Formalities
K (sL1 )
KL2 (si )
L1 i
1
The KL divergence approximation is now defined as d(L
with s
KL1 (sL
1 ||L2 )
i ) log
L1
i=1
being the set of sentences put aside for the divergence calculation. After removing an infinite chunk
of the original weighted sum it is a natural step to normalize the remaining weights so that the sum
of the weights adds up to 1 again. But there is good reason to change the weighting entirely: As it
was mentioned before, using the sentence probability to compare between sentences of different
lengths is not very meaningful. Since the sentences in the set s are probably not of a uniform length
using the n-gram probability at this point will certainly bias the results toward short sentences.
One solution is to re-estimate the sentence probability. In the weighted sum only the probability
relative to the other sentences in the set s is required. A simple way to estimate this probability is
by using a maximum likelihood estimation on the sentences in s. In other words each sentence in
the set s will be predicted as equally likely, which finally results in the KL divergence approximation
1
n
KL1 (sL
i )
1
.
The original KL divergence was guaranteed to be nonnegative. While
d(L
1 ||L2 )
n log
L1
KL2 (si )
i=1
this will usually still be the case it can due to the approximations no longer be guaranteed.
To get from the asymmetrical divergence to a real distance between two language models the
divergence and the inverted divergence can be added up: d(L
.
1 , L2 ) := d(L1 ||L2 ) + d(L2 ||L1 )
3.7.
Distance Learning
A controlling instance can now close the loop. It modifies the original English Probabilistic Context
Free Grammar extracted from the Penn Treebank. The output of the PCFG, sentences in POS tag
form, is then fed into a new n-gram model. The new KL divergence is then calculated. Depending on
whether the modified English got closer to the foreign language the modification is kept or reverted.
The process then starts anew.
In a lot of cases more than one change to the original grammar is needed to get closer to a foreign
grammar. In the intermediate steps the KL divergence will rise. Hence the value of greedy algorithms
will be limited. This optimization problem can for example be tackled by a simulated annealing algorithm or by genetic algorithms.
30
Chapter 3 Formalities
3.8.
The schema below provides an overview for one complete possible (directed) evolution model for
natural languages. Numbers next to the elements reference the relevant chapter.
Penn Treebank 3.2.1.
sect. 02-21
S
NP
IBM
VBD
NP
bought
Lotus
3.6.
NP
IBM
VP
VP
NP
VBD
Lotus
bought
Modied English
POS streams/PTB
TIGER German
POS streams/STTS
POS streams/STTS
map
tags
map
tags
POS streams/CTS
TIGER German
map
tags
POS streams/CTS
POS streams/CTS
Controller
modes grammar to
3.7.
minimize divergence
KL Divergence 3.6.
d(German||ModEnglish)
31
Chapter 4
Details & Results
32
Those who know nothing of foreign languages know nothing of their own.
Johann Wolfgang von Goethe
4.1.
Measuring Distances
foreign language LF
Penn Chinese Treebank
German Treebank TIGER
German Treebank T
uBaD/S
d(LE ||LF )
31.91
17.21
27.89
d(LF ||LE )
45.29
14.38
14.56
d(LE , LF )
77.20
31.59
42.45
|CTS|
18
32
26
Fig. 31 KL divergences and distances of English to the three foreign language models
In a first step the existing divergence between the n-gram model of original English and the other
languages is determined. The first two columns show the KL divergence and the inverse KL divergence between the languages. The third column contains the KL distance, the sum of the KL
divergences, and the fourth column holds the size of the used common tagset. As was already mentioned in chapter 3.3. a smaller common tagset will cause the distance between languages to appear
smaller than it actually is. It is all the more surprising and reassuring that the calculated divergences
between the languages behave as expected.
The divergences between English and Chinese and vice versa are the highest across-the-board,
although the associated common tagset is the smallest with only 16 tags. The model based on TIGER
gets closest to English, with TBaD/S coming in second. This result was to be expected considering
that TIGER is, like the used sections of the Penn Treebank, a collection of newspaper text. TBaD/S
instead is a collection of spoken dialogue. Its model should inherently be a lot more chaotic as people
tend to insert interjections, get interrupted mid-sentence, and generally not adhere to a strict grammar when speaking. A rather interesting result is that the divergence d(L
4.2.
Within the language evolution model as shown in section 3.8. and explained in section 3.5. the
English language PCFG has to be translated into an n-gram model. For this model a number of POS
tag sequences is required. There are three ways of doing this: The mathematically correct way is
to start with the PCFG and generate all possible sentences. These can then be fed into the n-gram
model, and the distribution of POS tag sequences corresponds exactly to the language defined by
the PCFG.
33
34
these have to be dealt with, either by deleting them, too, or re-attaching them to the tree in a different place. Moreover, the linguistic motivation behind these sentence modifications seems inconclusive. It is also unclear how these new rules and the places they are applied to are chosen.
In a pure PCFG model adding rules can be justified somewhat more easily because these rules
are applied automatically while generating sentences. It is also possible to adjust the probabilities
of existing rules, making a certain existing construct more probable or obsolete. How new rules are
chosen if of course again undefined.
SBARQ
Promotion
WHNP
VBZ
Demotion
Sub-tree
Raising
SQ
NP
VP
VBG
NP
Sub-tree
Raising
SBARQ
SBARQ
WHNP
SQ
VP
VBZ
WHNP
VP
VP
NP
VBG
SQ
NP
VBZ
Fig. 32 Possible tree operations to modify a language model
VBG
NP
NP
35
4.3.
Directed Evolution
The perl program presented in the appendix uses the following method to simulate directed evolution: The syntax trees of the English treebank are loaded. Then the program picks any binary rule
that appears at least 300 times in the entire corpus and then swaps all occurrences of this rule in all
the syntax trees. Then the n-gram models are recalculated and the change in the KL divergence is
observed. If the divergence improves, or at least does not grow beyond a certain threshold above
the best known divergence, then the rule change is kept and recorded. Otherwise the rule change
is reverted.
The constraint on which rules are chosen will be relaxed after some time. After 30 simulated
generations rules that occur at least 100 times will also be considered for manipulation. The threshold for keeping grammar modifications is reduced by a small amount in each generation to slowly
enforce reaching a possibly local KL divergence minimum.
36
Even with heavy use of efficient data structures, intermediate result caching and partially lazy
evaluation of the n-gram model, simulating language change is a very time-demanding and memory-intensive process. Currently between 20 and 30 seconds are required for each generation and
around 1 GB of RAM is needed for the language models. Most of the processing time is spent for
creating the modified Katz-smoothed n-gram models.
The longest running simulation so far was on English and German/TIGER. It spanned 369 generations and took slightly over three hours. In the end 25 swaps were deemed useful, or at least not too
detrimental, and were kept. The KL divergence between German and English was reduced to 11.00
from the original 14.38.
KL Divergence
16
15.5
15
KL Divergence
14.5
14
13.5
13
12.5
12
11.5
11
50
100
150
200
250
300
350
400
Generation
By monitoring the change of the KL divergence during the simulation is becomes obvious that
the restrictions on which rules are eligible for swaps play an important role. Most progress in reducing the KL divergence was made in the first 30 generations, with the minimum reached in the 27th
generation. Afterwards the KL divergence appears to be just moving around randomly.
37
4.4.
Interpreting Results
The simulation run to modify English grammar to get closer to the model of German obtained from
the TIGER treebank returned the following 25 rule sets in this order:
NP [CD NNP]
VP [PP VBG]
VP [SBAR VBD]
NP [CD NNP]
VP [S VBD]
VP [VBD VP]
VP [S VBN]
VP [VB VP]
VP [NP VBN]
NP [NNP NNP]
NP [NN PRP$]
VP [PP VBP]
VP [NP VBN]
VP [PP VBG]
NP [ADJP NNS]
NP [NNP NNP]
NP [NP NP]
ADVP [PP RB]
NP [NNP NNPS]
VP [S VBP]
ADJP [CD NN]
VP [MD VP]
VP [PP VBN]
NP [NP SBAR]
VP [ADVP VB]
The notation with brackets is meant to convey that each rule stands for a set of affected rules. So
every occurence of NPCDNNP will be changed to NPNNPCD and vice versa. Interestingly this
swap improved the KL divergence. This swap is reversed a bit later and improved the result again,
which shows the need for simulation methods like simulated annealing that can avoid getting stuck
in local KL minima.
This set of rules can be reduced by removing these duplicate rules as well as symmetric swaps like
NPNNPNNP. Eventually the set of swaps can be reduced by randomly removing swaps while
monitoring the effect on the KL divergence. With the remaining 11 rule changes KL divergence is
improved to 10.58:
ADVP [PP RB]
NP [NP SBAR]
VP [MD VP]
VP [S VBD]
VP [S VBN]
VP [S VBP]
VP [SBAR VBD]
VP [PP VBN]
VP [PP VBP]
VP [VB VP]
VP [VBD VP]
38
The
new
The
new
Die
neue
(DT)
(DT)
(DT)
(JJ)
(JJ)
(JJ)
rate
will
rate
be
Rate
wird
(NN)
(NN)
(NN)
(MD)
(VB)
(VBP)
be
payable
payable
Feb. 15
am 15. Feb.
fallig
(VB)
(JJ)
(DT CD NNP)
(JJ)
(NNP CD)
(JJ)
Feb. 15
(NNP CD)
will
(MD)
werden
(VB)
Most of these rules are concerned with verbs. By applying these swaps to a sample text their
effect can be observed and interpreted (Fig 38). In this case the rule changes have the effect that
a verb will move to the end of the sentence or after another constituent. Generally moving verbs
to the end is a good start to get closer to the German language. In the shown example moving the
modal verb to the end of the sentence is one of the required changes to achieve the same POS tag
sequence. The other required changes are swapping NP[CDNNP], which was actually found earlier, and ADJP[JJNP]. The remaining differences, the introduction of a second determiner and the
replacement of the verb will (MD) with its German equivalent werden (VB) can not be achieved
by swapping.
It is obvious that there are limits how close one can get to another language. With swaps alone
it is not possible to reach the POS tag distribution of the foreign language. So if one language uses
more verbs than the other, it will lead to a lower bound on the KL divergence. Then there are limits
that are caused by the PCFG model itself. German for example is a V2 language, which means that in
a declarative sentence the second constituent is always a verb. It is however not possible to express
such a constraint in a PCFG (without resorting to tricks that lead to a sharp increase in the number of
non-terminals and an equally sharp decrease in the usefulness of the grammar). One should therefore not expect to find a finite set of swapping rules that would lead to a KL divergence of zero.
39
Chapter 5
Conclusion & Outlook
40
There is nothing so trivial as a grammar, and scarce any thing so rare as a good
grammar.
G. Mige
In this thesis a novel statistical method to compare language grammars was devised. This method
would work without any particular knowledge about the languages heritage. This method was then
developed into a complete formal model for directed evolution of natural languages. The model was
implemented and put to the test with the help of annotated Chinese, English, and German treebank
data. Results show that it is possible to find meaningful grammatical differences that could be used
in other fields of natural language processing.
While working out the intricacies of the evolutionary model, several problems with the suitability of treebank data for cross-language research in general and comparative grammar research in
particular were identified. In the case of mapping different POS tagsets a complete workable solution was demonstrated, in other cases viable workarounds were presented. The advantages and
drawbacks of using probabilistic context free grammars to reproduce natural language change were
investigated.
The conclusion is that statistical approaches to model grammar change are workable. There is
however a distinct lack of historical data to create historically accurate evolutionary models. There is
also a distinct need for a common tagset or even a sufficiently general and simple common feature
annotation that is consistent and consistently used across languages. Existing tagsets are usually
created with one specific language in mind, and strongly reflect specific language patterns and may
to a certain point even convey the authors mentality. These properties in turn deter others from
reusing an existing tagset or annotation.
The project PROIEL at the University of Oslo aims to create a parallel corpus in a variety of ancient Indo-European languages (Greek, Latin, Old Armenian, Old Church Slavonic and Gothic). This
may become a very useful resource for cross-language research and it is certainly sensible to keep
an eye on it.
The specific implementation of a directed language evolver presented in this Thesis has yet to be
brought to its full potential. There are still a lot of options to optimize for speed (e.g. do not recalculate the Katz n-gram model from scratch, instead change it according to the changes in the POS tag
[Link]
41
sequences) to get a lower simulation time per generation. It will then become viable to do simulation
runs on a larger scale that may get a lot closer to the foreign language, and give a larger and more
detailed set of proposed changes.
The presented grammar evolver is a completely new and innovative model. Further research
should go into determining exactly how close this model can evolve a language towards another. The
grammar evolver should be extended to allow different changes to the syntax trees. To speed up
simulation, and subsequent interpretation, rules could be changed according to some pattern, e.g.
for all derivation rules containing a verb POS tag at the front and a noun phrase, the verb could be
swapped with the right-most instance of a noun phrase. It may also be worthwhile to investigate and
implement promotion, demotion and sub-tree raising operations. Replacing the evolution controller with a more sophisticated implementation of a simulated annealing algorithm will also certainly
improve the overall results.
42
References
Chapter 6 References
43
There are worse crimes than burning books. One of them is not reading them.
Joseph Brodsky
Cangelosi, A., and Parisi, D., 2002, Computer simulation: A new scientific approach to study
of language evolution, in Simulating language evolution (A. Cangelosi and D. Parisi, Eds.), London:
Springer-Verlag, pp. 328.
Dryer, M., Haspelmath, D. Gil, Comrie, B., 2005, World Atlas of Language Structures.
Hare, M., Elman, J. L., 1995, Learning and morphological change. Cognition, 56(1):6198.
Hockenmaier, J., Steedman, M., 2002, Acquiring Compact Lexicalized Grammars from a
Cleaner Treebank, In Proceedings of the Third LREC Conference, Las Palmas, Spain. Katz, S. M.,
1987, Estimation of probabilities from sparse data for the language model component of a speech
recogniser. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400401
Kayne R. S., 2005, Some notes on Comparative Syntax, in The Oxford Handbook of Comparative
Syntax, pp. 369
Lewis, P. M., Stearns, R. E., 1968, Syntax-Directed Transduction. J. ACM 15, 3 (Jul. 1968),
465-488.
Lohfer A., 2007, UE History of English, Philipps-Universitt Marburg, 23/04/07, Session 2: Historical Linguistics and Linguistic Reconstruction, [Link]
Martn-Vide, C., 2003, Formal Grammars and Languages, in R. Mitkov, ed., Oxford Handbook of
Computational Linguistics: 157177. Oxford University Press, Oxford.
Mitchell, P. M., Santorini, B., Marcinkiewicz, M. A., 1993, Building a Large Annotated
Corpus of English: The Penn Treebank, in Computational Linguistics, Volume 19, Number 2 (June
1993), pp. 313330
Mohri, M., Sproat, R., 2006, On a Common Fallacy in Computational Linguistics, A Man of
Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday.
Picone, J., Lecture 33: Smoothing N-Gram Language Models, ECE 8463: Fundamentals of Speech
Recognition, Mississippi State University, [Link]
courses/ece_8463/lectures/current/lecture_33/[Link]
Chapter 6 References
44
Searls, D., 1993, The computational linguistics of biological sequences. In Artificial Intelligence
and Molecular Biology (Hunter, L., ed.), pp. 47120, AAAI Press.
Turner, H., 2002, An introduction to methods for simulating the evolution of language, in Simulating language evolution (A. Cangelosi and D. Parisi, Eds.), London: Springer-Verlag, pp. 2950.
Vijay-Shanker, K., Weir, D., 1994, The equivalence of four extensions of context-free grammar.
Mathematical Systems Theory, 27:511546.
Wu, D. 1997, Stochastic inversion transduction grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23:377404.
Xia, F., 2000, The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) October 17, 2000.
Xue, N., Chiou, F.-D., Palmer, M., 2002, Building a Large-Scale Annotated Chinese Corpus, ,
Proceedings of the 19th. International Conference on Computational Linguistics (COLING 2002),
Taipei, Taiwan.
45
Appendix
46
Chapter 7 Appendix
7.1.
7.1.1.
S
SBAR
SBARQ
7.1.2.
ADJP
ADVP
CONJP
FRAG
INTJ
LST
Adjective Phrase
Adverb Phrase
Conjunction Phrase
Fragment
Interjection
List marker. Includes surrounding punctuation.
Not a Constituent (within an
NP)
Noun Phrase.
Used within NPs to mark the
head of the NP.
Prepositional Phrase.
NAC
NP
NX
PP
7.1.3.
CC
CD
DT
EX
FW
IN
JJ
JJR
JJS
LS
MD
NN
NNS
NNP
NNPS
PDT
POS
PRP
PRP$
SINV
SQ
PRN
PRT
Parenthetical.
Particle. Category for words
that should be tagged RP.
QP
Quantifier Phrase within NP.
RRC
Reduced Relative Clause.
UCP
Unlike Coordinated Phrase.
VP
Verb Phrase.
WHADJP Wh-adjective Phrase.
WHAVP Wh-adverb Phrase.
WHNP Wh-noun Phrase.
WHPP
Wh-prepositional Phrase.
X
Unknown, uncertain, or
unbracketable.
RB
RBR
RBS
RP
SYM
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP$
WRB
Adverb
Adverb, comparative
Adverb, superlative
Particle
Symbol
to
Interjection
Verb, base form
Verb, past tense
Verb, gerund or present
participle
Verb, past participle
Verb, non-3rd person singular
present
Verb, 3rd person singular
present
Wh-determiner
Wh-pronoun
Possessive wh-pronoun
Wh-adverb
Additional parts of speech tags are used for currency signs, brackets and punctuation including
quotation marks.
Chapter 7 Appendix
7.2.
AD
RB
AS
BA
CC
CC
CD
CD
CS
IN
DEC
VBG
DEG
POS
DER
DEV
DT
DT
ETC
CC
FW
FW
IJ
UH
JJ
JJ
LB
LC
IN
NN
MSP
TO
NN
NN
NP
NR
NT
NN
OD
CD
IN
PN
PRP
PU
SB
SP
47
Chapter 7 Appendix
CTB tag
VA
VC
VE
VP
VV
X
48
PTB tags of the English text have to be mapped as follows (second translation function):
EX X and LS X and RP X and SYM X particles without counterpart
JJR JJ and JJS JJ
comparative/superlative adjectives
RBR RB and RBS RB and WRB RB
NNS NN and NNPS NNP
plural forms not marked in CTB
PDT CD
WDT DT
MD V and VB V and VBD V
and VBN V and VBP V and VBZ V
WP PRP and WP$ PRP and PRP$ PRP
$ NN and # NN
currency symbols
all other tags remain unchanged.
49
Chapter 7 Appendix
7.3.
Usage
ADJA
attributive adjective
[das] groe [Haus]
ADJD
ADV
APPR
APPRART
APPO
APZR
ART
CARD
FM
ITJ
KOUI
KOUS
KON
KOKOM
IN
has a similar function to IN
has a similar function to IN
DT
CD
FW
UH
IN
IN
CC
IN (like)
50
Chapter 7 Appendix
STTS tag
Usage
NN
common noun
Tisch, Herr, [das] Reisen
NE
proper noun
Hans, Hamburg, HSV
combination of common and proper
noun
Theodor-Heuss-Stiftung, Niger-Delta
NNE
PDS
PDAT
PIS
PIAT
PIDAT
PPER
PPOSS
PPOSAT
PRELS
PRELAT
PRF
PWS
PWAT
PWAV
PAV /
PROAV
(TIGER)
DT
NN
DT (any)
NN (a bit/NN of water). A mapping to DT
would also be possible (both/DT brothers),
but NN seems to be closer to the function
of most of these words.
PRP
PRP
PRP$
This tag is mapped to WP, but could also
be mapped to WDT
WP$ (whose)
PRP (myself)
WP
WP$
WRB (why, when, where)
RB (hence, therefore)
51
Chapter 7 Appendix
STTS tag
Usage
PTKZU
zu before infinitive
zu [gehen]
PTKNEG
negating particle
nicht
separated verb particle
[er kommt] an, [er fhrt] rad
PTKVZ
PTKANT
answer particle
ja, nein, danke, bitte
PTKA
TRUNC
truncated element
An- [und Abreise]
VVFIN
VVIMP
VVINF
VVIZU
VVPP
VAFIN
VAIMP
VAINF
VAPP
VMFIN
VMINF
VMPP
These constructions do appear in the English language: being better off, to break
down. Particles are tagged RP.
These answer particles are not tagged
consistently in the Penn Treebank: Yes/
UH, Please/UH, Thanks/NNS, Thank/VB
you, Please/VB, No/DT. This tag will be
mapped to UH
Is mapped to RB. (too/RB)
Other valid mappings would be more/JJR
and most/JJS.
Similar constructions are annotated differently in the Penn Treebank. The suspended hyphen is considered a separate
sentence element:
third -/: and fourth-quarter growth
TRUNC will be mapped to (:)
Will be mapped to VBZ, VBP or VBD according to number and tense information
VB
VB
VB
VBN
Will be mapped to VBZ, VBP or VBD according to number and tense information
VB
VB
VBN
MD
MD
MD
52
Chapter 7 Appendix
STTS tag
Usage
XY
SYM
final punctuation
.?!;:
other punctuation marks
- [,]()
(.)
$,
$.
$(
(,)
(:)
PTB tags of the English text have to be mapped as follows (second translation function):
VBG NN
EX ( PPER) PRP
LS SYM
PDT ( PIAT) DT
POS ( PPOSAT) PRP$
WDT ( PRELS) WP
RBR RB and RBS RB
comparative and superlative forms of adverbs
$ NN and # NN
currency symbols
all other tags remain unchanged.
53
Chapter 7 Appendix
7.4.
Code: [Link]
#!/usr/bin/perl
#
#
# "Perl terrifies me, it looks like an explosion in an ascii factory."
#
-- svunt(916464), Slashdot
use strict;
use ThsFunctions;
use ThsCorpora;
use constant OTHERLANGUAGE => 0;
#
#
#
#
use constant SAMPLELANGUAGE => 1; # From where should the 1921 sample sentences be taken?
# 0 = English/PTB
(section 00)
# 1 = other selected language
my (@treebank, @samplesentences, %derivations, @derivationslist);
my $MEnglishModel; # Model of the modified English language, obtained from leaves of the trees
# A) Load sample sentences (eg sect 00 PTB), POS-Streams
if (SAMPLELANGUAGE == 0) {
my $wsj = read_sentences_from_file('[Link]-sect0');
@samplesentences = @{(parse_PTB_file($wsj))[0]};
print "Read ". (0+ @samplesentences). " English sampling sentences.\n";
}
# B) read GER-Treebank, extract POS-streams, create Katz-model
my $GermanTIGERModel;
if (OTHERLANGUAGE == 0) {
my @TIGERstreams = @{ get_TIGER_pos_streams() };
if (SAMPLELANGUAGE == 1) {
for (my $n = 0; $n <= 1920; $n++) {
$samplesentences[$n] = shift @TIGERstreams;
}
print "Set aside ". (0+@samplesentences) ." German/TIGER sentences for sampling, ". (0+@TIGERstreams)
. " remain.\n";
} else {
for (my $n = 0; $n < @samplesentences; $n++) {
$samplesentences[$n] = prepare_sentence_for_TIGER $samplesentences[$n];
}
print "Converted ". (0+@samplesentences) ." sample sentences to German/TIGER format.\n";
}
}
# D) dito DE-2
my $GermanTueBadSModel;
if (OTHERLANGUAGE == 1) {
my @TueBadSstreams = @{ get_TueBadS_pos_streams() };
if (SAMPLELANGUAGE == 1) {
for (my $n = 0; $n <= 1920; $n++) {
$samplesentences[$n] = shift @TueBadSstreams;
}
print "Set aside ". (0+@samplesentences) ." German/TueBadS sentences for sampling, ".
(0+@TueBadSstreams). " remain.\n";
} else {
for (my $n = 0; $n < @samplesentences; $n++) {
$samplesentences[$n] = prepare_sentence_for_TueBadS $samplesentences[$n];
}
print "Converted ". (0+@samplesentences) ." sample sentences to German/TueBadS format.\n";
}
}
# D) dito CN
my $ChineseTBModel;
if (OTHERLANGUAGE == 2) {
my @Chinesestreams = @{ get_CTB_pos_streams() };
if (SAMPLELANGUAGE == 1) {
for (my $n = 0; $n <= 1920; $n++) {
$samplesentences[$n] = shift @Chinesestreams;
}
print "Set aside ". (0+@samplesentences) ." Chinese sentences for sampling, ". (0+@Chinesestreams)
. " remain.\n";
} else {
for (my $n = 0; $n < @samplesentences; $n++) {
$samplesentences[$n] = prepare_sentence_for_CTB $samplesentences[$n];
}
Chapter 7 Appendix
}
}
54
# E)
# 1. load WSJ
# 2. extract syntax trees
{
my $wsj = read_sentences_from_file('[Link]');
my (@wsjsentences, @wsjwords);
{
my @parseptb = parse_PTB_file($wsj);
@wsjsentences = @{$parseptb[1]};
@wsjwords = @{$parseptb[4]};
}
foreach (@wsjsentences) {
my (%sentencedescr, @sentencetree, $nodenumber);
do {
# Process innermost layer, replace brackets by their descriptors
s/\(([^()\s]+)\s([^()]+)\)/
my ($descr, $deriv) = ($1, $2);
while ($deriv =~ s|\s*-(NONE\|LRB\|RRB)-\s*| |g) {};
while ($deriv =~ s|(^\.\|\.$\|\s\.\s)| |g) {};
while ($deriv =~ s|(^\,\|\,$\|\s\,\s)| |g) {};
while ($deriv =~ s|(^``\|``$\|\s``\s)| |g) {};
while ($deriv =~ s|(^''\|''$\|\s''\s)| |g) {};
$deriv =~ s|(^\s+\|\s+$)||go;
$descr = substr($descr, 0, index($descr, '-')) if (index($descr, '-') > 0);
$descr = substr($descr, 0, index($descr, '=')) if (index($descr, '=') > 0);
if ($deriv ne '') {
$nodenumber++;
push @{$sentencetree[$nodenumber]}, $descr;
my $rule = '';
foreach (split ' ', $deriv) {
push @{$sentencetree[$nodenumber]}, $_;
$rule .= ' '. ($sentencetree[$_])->[0] if ($_ > 0);
$rule .= ' '. $_
unless ($_ > 0);
}
my $sortedderivation = $descr . ' -> [' . (join ' ', sort split ' ', $rule) . ']';
push @{$sentencedescr{$sortedderivation}}, $nodenumber;
$derivations{$sortedderivation}++ if (($sortedderivation =~ tr: : :) == 3);
# Only list binary derivations
$descr = $nodenumber;
}
($deriv eq '') ? '-NONE-' : $descr/geo;
# Process root nodes, brackets that contain only one descriptor
s/\s*\(([^()\s]+)\)\s*/$sentencetree[0] = $1; " "/geo;
} until /^\s?$/o;
$sentencedescr{'tree'} = \@sentencetree;
if (@wsjwords > 0) {
$sentencedescr{'orig'} = shift @wsjwords;
my @sentencewords = split ' ', $sentencedescr{'orig'};
my (@wordtree, @stack);
$wordtree[0] = $sentencetree[0];
for (my $i = 1; $i < @sentencetree; $i++) {
my @s = @{$sentencetree[$i]};
for (my $j = 1; $j < @s; $j++) {
$s[$j] = '.' if ($s[$j] <= 0);
}
$wordtree[$i] = \@s;
}
push @stack, $wordtree[0];
do {
my $node = pop @stack;
if ($node > 0) {
push @stack, ($_ > 0 ? $_ : -$node) foreach (reverse @{$sentencetree[$node]});
pop @stack; # last element is [0] = parent node
} else {
for (my $i = 0; $i < @{$wordtree[-$node]}; $i++) {
if ($wordtree[-$node]->[$i] eq '.') {
$wordtree[-$node]->[$i] = '-'. (shift @sentencewords);
$i = 1 + @{$wordtree[-$node]};
}
}
}
} while (@stack > 0);
}
$sentencedescr{'word'} = \@wordtree;
Chapter 7 Appendix
# 3. Generate POS-streams
# 4. Generate Katz-smoothed model
sub extract_leaves {
my (@stack, $leafstream);
my $tree = @_[0];
push @stack, $tree->[0];
print $stack[0];
do {
$_ = pop @stack;
if ($_ > 0) {
push @stack, $_ foreach (reverse @{$tree->[$_]});
pop @stack; # last element is [0] = parent node
} else {
$leafstream .= ' ' . $_;
}
} while (@stack > 0);
return $leafstream;
}
sub update_model {
print "Updating POS-model\n";
my @POSstreams;
for (my $n = 0; $n < @treebank; $n++) {
if ($treebank[$n]->{'pos'} eq '') {
my $POSstream = extract_leaves($treebank[$n]->{'tree'});
#
print "sentence $n stream : $POSstream\n";
$POSstream = 'START'.$POSstream.' STOP';
$POSstream = prepare_sentence_for_TIGER($POSstream)
if (OTHERLANGUAGE == 0);
$POSstream = prepare_sentence_for_TueBadS($POSstream) if (OTHERLANGUAGE == 1);
$POSstream = prepare_sentence_for_CTB($POSstream)
if (OTHERLANGUAGE == 2);
$treebank[$n]->{'pos'} = $POSstream;
if (defined $treebank[$n]->{'word'}) {
my $words = extract_leaves($treebank[$n]->{'word'});
while ($words =~ s/ -/ /go) {};
$treebank[$n]->{'words'} = $words;
}
}
push @POSstreams, $treebank[$n]->{'pos'};
$samplesentenceprobabilities[$n] = $lpL2;
#update_model;
#print_treebank \@treebank;
#
# F)
# 1. Select a binary derivation
sub pick_derivation {
my $minpick = shift;
print "Picking one random derivation out of ". (0+@derivationslist). ": ";
my $pick;
55
Chapter 7 Appendix
do {
$pick = $derivationslist[rand @derivationslist];
} while ($derivations{$pick} < $minpick);
print "$pick (occurs ". $derivations{$pick} ." times)\n";
return $pick;
# 2. Swap
sub mutate_treebank {
my $pick = shift;
for (my $n = 0; $n <= $#treebank; $n++) {
if (defined $treebank[$n]->{$pick}) {
foreach (@{$treebank[$n]->{$pick}}) {
($treebank[$n]->{'tree'}->[$_]->[1], $treebank[$n]->{'tree'}->[$_]->[2])
= ($treebank[$n]->{'tree'}->[$_]->[2], $treebank[$n]->{'tree'}->[$_]->[1]);
($treebank[$n]->{'word'}->[$_]->[1], $treebank[$n]->{'word'}->[$_]->[2])
= ($treebank[$n]->{'word'}->[$_]->[2], $treebank[$n]->{'word'}->[$_]->[1])
if (defined $treebank[$n]->{'word'});
}
$treebank[$n]->{'pos'} = '';
}
my $kld = $sumpos - $sumneg;
print "KL DIVERGENCE = $sumpos - $sumneg = $kld = avg. ". $kld / @samplesentences ."\n";
return $kld / @samplesentences;
my $kld; # = kldivergence;
my $kll = $kld;
my $generation = 0;
my $evolutionfreedom = 2;
my $evolutionpressure = 0.002;
while ($generation < 2000) {
++$generation;
my $mutation;
if ($generation < 30) {
$mutation = pick_derivation 300;
} elsif ($generation < 100) {
$mutation = pick_derivation 100;
} elsif ($generation < 600) {
$mutation = pick_derivation 10;
} else {
$mutation = pick_derivation 1;
}
mutate_treebank $mutation;
update_model;
my $kld2 = kldivergence;
if ($kld2 > $kld) {
$kld = $kll = $kld2;
print "G$generation - This is much better! keep $mutation, KLD=$kld\n";
} elsif ($kld2 > $kll) {
$kll = $kld2;
print "G$generation - This is better! keep $mutation, KLD=$kld, EF=$evolutionfreedom\n";
} elsif ($kld2 > $kld - $evolutionfreedom) {
$kll = $kld2;
print "G$generation - This is within reason. keep $mutation, KLD=$kld, EF=$evolutionfreedom\n";
} else {
print "G$generation - Doesn't help. Reverting...\n";
mutate_treebank $mutation;
}
$evolutionfreedom -= $evolutionpressure;
$evolutionfreedom = 0 if ($evolutionfreedom < 0);
}
# 6. If closer to foreign language then keep, otherwise revert (or if within threshold do random walk)
# 7. Contd. at F1
56
Chapter 7 Appendix
7.5.
Code: [Link]
#!/usr/bin/perl
use strict;
package ThsCorpora;
require Exporter;
our (@ISA, @EXPORT);
@ISA = qw(Exporter);
@EXPORT = qw(read_sentences_from_file parse_PTB_file
get_CTB_pos_streams prepare_sentence_for_CTB
get_TIGER_pos_streams prepare_sentence_for_TIGER
get_TueBadS_pos_streams prepare_sentence_for_TueBadS
);
use constant DEBUG => 0;
sub read_sentences_from_file {
my ($filename) = @_;
print "Reading $filename\n";
# Slurp input file
$_ = do { local(@ARGV, $/) = "<$filename"; <> };
print 'Input
: '. length() ." bytes\n";
# Cleaning up:
# - Remove redundant whitespace near brackets
# - Collapse consecutive whitespace including line endings
s/\s\s+/ /g;
s/\s+\)/\)/g;
s/\(\s+/\(/g;
tr/\n/ /;
print 'Cleaned : '. length() ." bytes\n";
}
return $_;
sub parse_PTB_file {
# Extract sentences:
# - Find all substrings with a matching number of opening
#
and closing parentheses
my @sentences;
{
my $wsj = @_[0];
my $bufright = 0;
do {
my $buffer;
my $openbrackets = 1;
my $bufleft = $bufright;
do {
$bufright = 1 + index $wsj, ')', $bufright for 1 .. $openbrackets;
$buffer = substr $wsj, $bufleft, $bufright - $bufleft;
$openbrackets = ($buffer =~ tr/(/(/) - ($buffer =~ tr/)/)/);
} while ($openbrackets > 0);
push @sentences, $buffer if ($bufright > 0);
#
}
print "Extracted: ". (0 + @sentences) ." sentences\n";
57
Chapter 7 Appendix
}
}
# get_CTB_pos_streams
#
# Returns a reference to an array containing all streams of POS tags
# from the Chinese (Penn) tree bank
sub get_CTB_pos_streams {
my $filename = '[Link]';
$_ = do { local(@ARGV, $/) = "<$filename"; <> };
tr/\n/ /; s/\s\s+/ /g;
my (@sentences, $numsent, $numtags);
s/<S [^>]*>(.*?)<\/S>/push @sentences, $1; ++$numsent;/geo;
$_ = <<'CTBMAP';
AD
AS
BA
CC
CD
CS
DEC
DEG
DT
ETC
FW
IJ
JJ
LB
LC
M
MSP
NN
NN-OBJ
NN-SBJ
NN-SHORT
NP
NR
NR-PN
NR-SHORT
NT
NT-SHORT
OD
P
PN
SB
SP
VC
VE
VP
VV
RB
X
X
CC
CD
IN
VBG
POS
DT
CC
FW
UH
JJ
X
IN
NN
TO
NN
NN
NN
NN
NN
NNP
NNP
NNP
NN
NN
CD
IN
PRP
X
X
V
V
V
V
VA
DER
PU
DEV
X
VA
DER
PU
DEV
X
CTBMAP
tr/\n\t/ /; s/\s+/ /g; split ' ';
my %CTBmap;
$CTBmap{(pop)} = pop while (@_);
my @posstreams;
foreach (@sentences) {
my $posstream = '';
s/\(([^ ()]+) [^ ()]+\)/$posstream .= ' ' . $CTBmap{$1} if ($1 ne '-NONE-'); ++$numtags;/geo;
$posstream .= ' ';
while ($posstream =~ s/ V DEV / RB /go) {};
while ($posstream =~ s/ VA DEV / RB /go) {};
while ($posstream =~ s/ DER / V /go) {};
$posstream =~ s/ PU $/ . /go;
while ($posstream =~ s/ PU / : /go) {};
while ($posstream =~ s/ VA / JJ /go) {};
push @posstreams, 'START'. $posstream .'STOP';
}
print "ChineseTB: read $numsent POS streams with a total of $numtags tags\n";
return \@posstreams;
58
59
Chapter 7 Appendix
# prepare_sentence_for_CTB
$sentence
#
# Replaces PTB tags that have no CTB tags mapped to them
sub prepare_sentence_for_CTB {
my ($sentence) = @_;
while
while
while
while
($sentence
($sentence
($sentence
($sentence
=~
=~
=~
=~
s/
s/
s/
s/
EX / X /go) {};
LS / X /go) {};
RP / X /go) {};
SYM / X /go) {};
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
=~
=~
=~
=~
=~
=~
s/
s/
s/
s/
s/
s/
MD / V /go) {};
VB / V /go) {};
VBD / V /go) {};
VBN / V /go) {};
VBP / V /go) {};
VBZ / V /go) {};
return $sentence;
# get_TIGER_pos_streams
#
# Returns a reference to an array containing all streams of POS tags
# from the TIGER tree bank
sub get_TIGER_pos_streams {
print "Loading TIGER...\n";
my $filename = "TIGER/corpus/tiger_release_aug07.xml";
$_ = do { local(@ARGV, $/) = "<$filename"; <> };
tr/\n/ /; s/\s\s+/ /g;
my ($numsent, $numtags) = (0, 0);
my @sentences;
s/<terminals>(.*?)<\/terminals>/push @sentences, $1; ++$numsent;/geo;
$_ = <<'STTSMAP';
ADJA-Pos
ADJA-Comp
ADJA-Sup
ADJA-*.*.*.*
ADJD-Pos
ADJD-Comp
ADJD-Sup
ADV
APPR
APPRART
APPO
APZR
ART
CARD
FM
ITJ
KOUI
KOUS
KON
KOKOM
NN-*
NN-Sg
NN-Pl
NE-*
NE-Sg
NE-Pl
NNE
PDS
PDAT
PIS
PIAT
JJ
JJR
JJS
JJ
JJ
JJR
JJS
RB
IN
IN
IN
IN
DT
CD
FW
UH
IN
IN
CC
IN
NN
NN
NNS
NNP
NNP
NNPS
NNP
DT
DT
NN
DT
60
Chapter 7 Appendix
PIDAT
NN
PPER
PRP
PPOSS
PRP
PPOSAT
PRP$
PRELS
WP
PRELAT
WP$
PRF
PRP
PWS
WP
PWAT
WP$
PWAV
WRB
PAV
RB
PROAV
RB
PTKZU
TO
PTKNEG
RB
PTKVZ
RP
PTKANT
UH
PTKA
RB
TRUNC
:
[Link]
[Link]
VVFIN-Pres
VBP
VVFIN-Past
VBD
VVIMP
VB
VVINF
VB
VVIZU
VB
VVPP
VBN
[Link]
[Link]
VAFIN-Pres
VBP
VAFIN-Past
VBD
VAIMP
VB
VAINF
VB
VAPP
VBN
VMFIN
MD
VMINF
MD
VMPP
MD
XY
SYM
$,
,
$.
.
$(
:
VBZ
VBZ
VBZ
VBZ
STTSMAP
tr/\n\t/ /; s/\s+/ /g; split ' ';
my %STTSmap;
$STTSmap{(pop)} = pop while (@_);
my $regex = "";
$regex .= ' '.$_.'="([^">]+)"([^>]*)' foreach ('pos', 'morph', 'number', 'degree', 'tense');
my @posstreams;
foreach (@sentences) {
my $posstream = '';
s/$regex/$posstream .= ' '; $posstream .= $STTSmap{$1}
|| $STTSmap{$1.'-'.$3}
|| $STTSmap{$1.'-'.$5}
|| $STTSmap{$1.'-'.$7}
|| $STTSmap{$1.'-'.$9}
|| "unknown:$1:$3:$5:$7:$9 "; ++$numtags;/geo;
$posstream =~ s/ [.,]//go; # filter . and ,
push @posstreams, 'START'. $posstream .' STOP';
}
print "TIGER: read $numsent POS streams with a total of $numtags tags\n";
return \@posstreams;
# prepare_sentence_for_TIGER
$sentence
#
# Replaces PTB tags that have no TIGER tags mapped to them
sub prepare_sentence_for_TIGER {
my ($sentence) = @_;
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
while ($sentence =~
return $sentence;
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
# get_TueBadS_pos_streams
#
# Returns a reference to an array containing all streams of POS tags
# from the TueBadS tree bank
Chapter 7 Appendix
sub get_TueBadS_pos_streams {
my $filename = "TueBadS/[Link]";
$_ = do { local(@ARGV, $/) = "<$filename"; <> };
tr/\n/ /; s/\s\s+/ /g;
my ($numsent, $numtags) = (0, 0);
my @sentences;
s/<sentence([^>]*)>(.*?)<\/sentence>/push @sentences, $2; ++$numsent;/geo;
$_ = <<'STTSMAP';
ADJA
JJ
ADJD
JJ
ADV
RB
APPR
IN
APPRART IN
APPO
IN
APZR
IN
ART
DT
BS
SYM
CARD
CD
FM
FW
ITJ
UH
KOUI
IN
KOUS
IN
KON
CC
KOKOM
IN
NN
NN
NE
NNP
NNE
NNP
PDS
DT
PDAT
DT
PIS
NN
PIAT
DT
PIDAT
NN
PPER
PRP
PPOSS
PRP
PPOSAT PRP$
PRELS
WP
PRELAT WP$
PRF
PRP
PWS
WP
PWAT
WP$
PWAV
WRB
PAV
RB
PROAV
RB
PROP
RB
PTKZU
TO
PTKNEG RB
PTKVZ
RP
PTKANT UH
PTKA
RB
TRUNC
:
VVFIN
VERB
VVIMP
VB
VVINF
VB
VVIZU
VB
VVPP
VBN
VAFIN
VERB
VAIMP
VB
VAINF
VB
VAPP
VBN
VMFIN
MD
VMINF
MD
VMPP
MD
XY
SYM
$,
,
$.
.
$(
:
STTSMAP
tr/\n\t/ /; s/\s+/ /g; split ' ';
my %STTSmap;
$STTSmap{(pop)} = pop while (@_);
my @posstreams;
foreach (@sentences) {
my $posstream = '';
s/<word ([^>]*)pos="([^">]*)"([^>]*)>/$posstream .= ' '; $posstream .= $STTSmap{$2}
|| "unknown:$2"; ++$numtags;/geo;
$posstream =~ s/ [.,]//go; # filter . and ,
push @posstreams, 'START'. $posstream .' STOP';
}
print "TueBadS: read $numsent POS streams with a total of $numtags tags\n";
return \@posstreams;
# prepare_sentence_for_TueBadS
$sentence
#
# Replaces PTB tags that have no TueBadS tags mapped to them
sub prepare_sentence_for_TueBadS {
my ($sentence) = @_;
61
62
Chapter 7 Appendix
while
while
while
while
while
while
while
while
while
while
while
while
while
while
while
while
while
}
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
($sentence
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
=~
return $sentence;
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
s/
Chapter 7 Appendix
7.6.
Code: [Link]
#!/usr/bin/perl
use strict;
package ThsFunctions;
require Exporter;
our (@ISA, @EXPORT);
@ISA = qw(Exporter);
@EXPORT = qw(get_top_n_results print_array print_places get_random_hash_element create_katz_model
sentence_katz_probability
print_treebank
);
# @EXPORT_OK = qw(funcTwo $varTwo);
sub get_top_n_results {
my ($hash, $n) = @_;
my (%counts, @output);
while (my ($entity, $count) = each(%$hash) ) {
push(@{$counts{$count}}, $entity);
}
my @topcounts = reverse sort {$a <=> $b} keys %counts;
for (my ($place, $results); ($place <= $#topcounts) && (($n == 0) || ($results < $n)); $place++ ) {
foreach (@{$counts{$topcounts[$place]}}) {
if ((($n == 0) || (++$results <= $n)) && ($topcounts[$place] > 0)) {
push @output, ($topcounts[$place], $_);
}
}
}
return @output;
sub print_array {
while (@_) {
printf "%3dx %s\n", shift @_, shift @_;
}
}
sub print_places {
my $n = 0;
while (@_) {
printf " Place %3d: %3dx %s\n", ++$n, shift @_, shift @_;
}
}
sub get_random_hash_element {
my ($probabilitysum, $rndstruct) = @_;
my $rndnumber = rand($probabilitysum);
keys %$rndstruct; # reset iterator to beginning of hash
my $entity;
while (($entity, my $probability) = each(%$rndstruct) ) {
$rndnumber -= $probability;
if ($rndnumber <= 0) {
last;
}
}
keys %$rndstruct; # reset iterator to beginning of hash
}
return $entity;
# create_katz_model \@array
of streams of POS tags
#
$K
K constant in katz-smoothing
#
# Creates a 1-, 2-, 3-, 4-, 5-gram model with Katz-smoothing
# Returns a reference to a hash containing the counts of all
# n-grams with 1 < n <= 5 as well as individual word counts
sub create_katz_model {
my ($TagStreams, $K) = @_;
my %gramcount;
my %wordcount;
my $wordtotal;
# Count occurences
foreach (@{$TagStreams}) {
split ' ';
++$wordcount{$_} foreach (@_);
$wordtotal += @_;
for (my $i = 1; $i < @_; $i++) {
for (my $n = 1; $n <= 4; $n++) {
63
Chapter 7 Appendix
my %inversecount;
foreach my $gc (keys %gramcount) {
my $len = 2 + $gc=~tr/ / /;
foreach (keys %{$gramcount{$gc}}) {
$inversecount{$len}->{$gramcount{$gc}->{$_}}++ if ($_ ne '');
}
}
foreach (keys %wordcount) {
$inversecount{1}->{$wordcount{$_}}++;
}
my %model;
$model{'gc'} = \%gramcount;
$model{'wc'} = \%wordcount;
$model{'ic'} = \%inversecount;
$model{'N'} = $wordtotal;
$model{'K'} = $K;
# $model{'p'} = probability cache
# $model{'lp'} = log probability cache
}
return \%model;
# katz_probability \%katz
reference to the katz model
#
$input
input so far
#
$output
next element in stream
#
# Returns the probability of '$output' occuring if '$input' was
# the last observation
sub katz_probability {
use constant KDBG => 0;
my ($katz, $input, $output) = @_;
# result cached?
return $katz->{'p'}->{$input}->{$output}
if defined($katz->{'p'}->{$input}->{$output});
print "ug,$output,wc=".$katz->{'wc'}->{$output}.',' if (KDBG);
# simple probability for unigrams
return $katz->{'p'}->{$input}->{$output} =
$katz->{'wc'}->{$output} / $katz->{'N'}
if ($input eq '');
my $r = $katz->{'gc'}->{$input}->{$output};
print "r=$r,gc=".$katz->{'gc'}->{$input}->{''}.',' if (KDBG);
my $K = $katz->{'K'};
# if r > k then return traditional estimate
return $katz->{'p'}->{$input}->{$output} =
$r / $katz->{'gc'}->{$input}->{''}
if ($r > $K);
my $n = 2 + $input =~ tr/ / /;
print "n=$n," if (KDBG);
if ($r > 0) {
my $p = $r / $katz->{'gc'}->{$input}->{''};
my $rstar = ($r + 1) * $katz->{'ic'}->{$n}->{$r + 1}
/ $katz->{'ic'}->{$n}->{$r};
print "observed $n-grams occuring $r+1 times: ".$katz->{'ic'}->{$n}->{$r + 1}."\n" if (KDBG);
print "observed $n-grams occuring $r
times: ".$katz->{'ic'}->{$n}->{$r}."\n" if (KDBG);
print "observed $n-grams occuring $K+1 times: ".$katz->{'ic'}->{$n}->{$K+1}."\n" if (KDBG);
print "observed $n-grams occuring 1 times: ".$katz->{'ic'}->{$n}->{1}."\n" if (KDBG);
my $d = ($rstar / $r);
$d -= ($K + 1) * $katz->{'ic'}->{$n}->{$K + 1}
/ $katz->{'ic'}->{$n}->{1};
$d /= 1 - (($K + 1) * $katz->{'ic'}->{$n}->{$K + 1}
/ $katz->{'ic'}->{$n}->{1});
print "\nD-P: $p R*: $rstar D-D: $d D: ". ($p * $d)."\n" if (KDBG);
return $katz->{'p'}->{$input}->{$output} = $p * $d;
}
my $backoff;
$backoff = '' if ($n <= 2);
$backoff = substr($input, index($input, ' ')+1) if ($n > 2);
my ($alphat, $alphab, @knownderivations);
print "Input: $input
Backf: $backoff
Output: $output\n" if (KDBG);
@knownderivations = keys %{$katz->{'gc'}->{$input}};
64
Chapter 7 Appendix
foreach (@knownderivations) {
print " knownderiv: $input -> $_\n" if (KDBG);
$alphat += katz_probability($katz, $input, $_) if ($_ ne '');
$alphab += katz_probability($katz, $backoff, $_) if ($_ ne '');
print " $input -> $_ ($katz->{'gc'}->{$input}->{$_}, $alphat)\n" if ($_ ne '') && KDBG;
print " $backoff => $_ ($katz->{'gc'}->{$backoff}->{$_}, $alphab)\n" if ($_ ne '') && KDBG;
}
my ($alpha, $ps);
$alpha = (1 - $alphat) / (1 - $alphab) if ($alphab != 1);
print "A = $alpha\n" if (KDBG);
$ps = katz_probability($katz, $backoff, $output) if ($alpha != 0);
print "PS ($backoff -> $output) = $ps\n" if (KDBG);
}
# sentence_katz_probability \%hash
containing the katz model
#
$sentence a stream of POS tags
#
$depth
max n in n-gram
#
# Returns the probability of the given sentence in the given model
sub sentence_katz_probability {
my ($katz, $sentence, $depth, $k) = @_;
my $probability;
split ' ', $sentence;
--$depth;
$depth = @_-1 if ($depth >= @_);
for (my $i = 1; $i <= $depth; $i++) {
my $input = join ' ', @_[0..$i-1];
my $output = @_[$i];
if (!defined $katz->{'lp'}->{$input}->{$output}) {
$katz->{'lp'}->{$input}->{$output} = katz_probability($katz, $input, $output);
#
#
}
#
$probability += $katz->{'lp'}->{$input}->{$output};
print " + PR [ $output | $input ] = $katz->{'lp'}->{$input}->{$output}\n";
#
#
$probability += $katz->{'lp'}->{$input}->{$output};
print " + PR [ $output | $input ] = $katz->{'lp'}->{$input}->{$output}\n";
return $probability;
sub print_treebank {
my @treebank = @{@_[0]};
for (my $n = 0; $n < @treebank; $n++) {
print "Sentence $n:\n";
foreach (keys %{$treebank[$n]}) {
print "$_ :: ";
if (($_ eq 'tree') || ($_ eq 'word')) {
print "[0] = $treebank[$n]->{$_}->[0]\n";
for (my $i = 1; $i < @{$treebank[$n]->{$_}}; $i++) {
print "
[$i] = ";
for (my $j = 0; $j < @{$treebank[$n]->{$_}->[$i]}; $j++) {
print '['.$treebank[$n]->{$_}->[$i]->[$j].'] '.($j == 0 ? '-> ' : '');
}
print "\n";
}
} elsif (($_ eq 'pos') || ($_ eq 'post') || ($_ eq 'words') || ($_ eq 'orig')) {
print $treebank[$n]->{$_}."\n";
} else {
print '[ ';
print "$_ " foreach (@{$treebank[$n]->{$_}});
65
Chapter 7 Appendix
print "]\n";
}
print "\n";
}
66