Tamil Langu PDF
Tamil Langu PDF
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the School of Engineering
by
ANAND KUMAR M
April, 2013
AMRITA SCHOOL OF ENGINEERING
AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641 112
BONAFIDE CERTIFICATE
Thesis Advisor
Dr. K.P.SOMAN
Professor and Head,
Center for Excellence in Computational Engineering and Networking.
AMRITA SCHOOL OF ENGINEERING
AMRITA VISHWA VIDYAPEETHAM, COIMBATORE 641 112
DECLARATION
I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis
entitled “MORPHOLOGY BASED PROTOTYPE STATISTICAL
MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL
LANGUAGE” is the record of the original work done by me under the guidance of
Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational
Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best
of my knowledge this work has not formed the basis for the award of any
degree/diploma/associateship/fellowship or a similar award, to any candidate in any
University.
COUNTERSIGNED
Thesis Advisor
Dr. K.P.SOMAN
Professor and Head
Center for Excellence in Computational Engineering and Networking
TABLE OF CONTENTS
1 INTRODUCTION ....................................................................................................... 1
1.1 GENERAL ....................................................................................................... 1
1.2 OVERVIEW OF MACHINE TRANSLATION ............................................. 2
1.3 ROLE OF MACHINE TRANSLATION IN NLP .......................................... 3
1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM ... 4
1.5 MOTIVATION OF THE THESIS .................................................................. 6
1.6 OBJECTIVE OF THE THESIS....................................................................... 7
1.7 RESEARCH METHODOLOGY .................................................................... 9
1.7.1 Overall System Architecture .............................................................. 9
1.7.2 Details of Preprocessing English Language Sentence ..................... 10
1.7.2.1 Reordering English Language Sentence ........................... 10
1.7.2.2 Factorization of English Language Sentence .................... 11
1.7.2.3 Compounding of English Language Sentence .................. 11
1.7.3 Details of Preprocessing Tamil Language Sentence ........................ 12
1.7.3.1 Tamil Part-of-Speech Tagger ............................................ 13
1.7.3.2 Tamil Morphological Analyzer ......................................... 13
1.7.4 Factored SMT System for English to Tamil Language.................... 14
1.7.5 Postprocessing for English to Tamil SMT ....................................... 15
1.7.5.1 Tamil Morphological Generator........................................ 15
1.8 RESEARCH CONTRIBUTIONS ................................................................. 16
1.9 ORGANISATION OF THE THESIS ............................................................ 17
iv
2.1.1 Part Of Speech Tagger for Indian Languages .................................. 21
2.1.2 Part Of Speech Tagger for Tamil Language .................................... 23
2.2 MORPHOLOGICAL ANALYZER AND GENERATOR .......................... 25
2.2.1 Morphological Analyzer and Generator for Indian Languages ....... 26
2.2.2 Morphological Analyzer and Generator for Tamil Language .......... 26
2.3 MACHINE TRANSLATION SYSTEMS ..................................................... 30
2.3.1 Machine Translation Systems for Indian Languages ....................... 30
2.3.2 Machine Translation Systems for Tamil Language ......................... 35
2.4 ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM ................ 38
2.5 RELATED NLP WORKS IN TAMIL .......................................................... 43
2.6 SUMMARY ................................................................................................... 46
3.3.1 Machine Learning ............................................................................ 58
3.3.2 Support Vector Machines ................................................................. 59
3.3.3 Geometrical Interpretation of SVM ................................................. 61
3.3.4 SVM Formulation ............................................................................ 64
3.4 VARIOUS APPROACHES FOR POS TAGGING ...................................... 67
3.4.1 Supervised POS Tagging ................................................................. 67
3.4.2 Unsupervised POS Tagging ............................................................. 68
3.4.3 Rule based POS Tagging.................................................................. 68
3.4.4 Stochastic POS Tagging ................................................................... 69
3.4.5 Other Techniques ............................................................................. 69
3.5 VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER........ 70
3.5.1 Two level Morphological Analysis .................................................. 70
3.5.2 Unsupervised Morphological Analyser ............................................ 71
3.5.3 Memory based Morphological Analysis .......................................... 72
3.5.4 Stemmer based Approach................................................................. 72
3.5.5 Suffix Stripping based Approach ..................................................... 72
3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION ..................... 73
3.6.1 Linguistic or Rule based Approaches............................................... 73
3.6.1.1 Direct Approach ................................................................ 74
3.6.1.2 Interlingua Approach......................................................... 76
3.6.1.3 Transfer Approach............................................................. 77
3.6.2 Non Linguistic Approaches .............................................................. 79
3.6.2.1 Dictionary based Approach ............................................... 79
3.6.2.2 Empirical or Corpus based Approach ............................... 79
3.6.2.3 Example based Approach .................................................. 80
3.6.2.4 Statistical Approach .......................................................... 81
3.6.3 Hybrid Machine Translation System................................................ 82
3.7 EVALUATING STATISTICAL MACHINE TRANSLATION .................. 83
3.7.1 Human Evaluation Techniques ........................................................ 84
3.7.2 Automatic Evaluation Techniques ................................................... 85
3.7.2.1 BLEU Score ...................................................................... 85
3.7.2.2 NIST Metric ...................................................................... 86
3.7.2.3 Precision and Recall .......................................................... 87
vi
3.7.2.4 Edit Distance Measures ..................................................... 88
3.8 SUMMARY ................................................................................................... 89
5.4.1 Untagged and Tagged Corpus ....................................................... 130
5.4.2 Available Corpus for Tamil............................................................ 131
5.4.3 POS Tagged Corpus Development ................................................ 131
5.4.4 Applications of Tagged Corpus...................................................... 134
5.4.5 Details of POS Tagged corpus developed ...................................... 134
5.5 DEVELOPMENT OF POS TAGGER USING SVMTOOL....................... 136
5.5.1 SVMTool ....................................................................................... 136
5.5.2 Features of SVMTool ..................................................................... 137
5.5.3 Components of SVMTool .............................................................. 138
5.5.3.1 SVMTlearn ...................................................................... 138
5.5.3.2 SVMTagger ..................................................................... 146
5.5.3.3 SVMTeval ....................................................................... 151
5.6 RESULTS AND COMPARISON WITH OTHER TOOLS........................ 160
5.7 ERROR ANALYSIS ................................................................................... 161
5.8 SUMMARY ................................................................................................. 162
6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer .... 179
6.4.2.1 Paradigm Classification................................................... 179
6.4.2.2 Word forms ..................................................................... 180
6.4.2.3 Morphemes ...................................................................... 183
6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer.. 186
6.4.2.5 Issues in Data Creation .................................................... 188
6.4.3 Morphological Tagging Framework using SVMTool ................... 189
6.4.3.1 Support Vector Machine (SVM) ..................................... 189
6.4.3.2 SVMTool ......................................................................... 189
6.4.3.3 Implementation of Morphological Analyzer System ...... 190
6.5 MORPH ANALYZER FOR PRONOUN USING PATTERNS ................. 192
6.6 MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES........... 194
6.7 RESULTS AND EVALUATION ............................................................... 195
6.8 PREPROCESSED ENGLISH AND TAMIL SENTENCE......................... 198
6.9 SUMMARY ................................................................................................. 198
7.4.3 SRILM ............................................................................................ 214
7.5 DEVELOPMENT OF FACTORED CORPORA ........................................ 215
7.5.1 Parallel Corpora Collection ............................................................ 215
7.5.2 Monolingual Corpora Collection .................................................. 216
7.5.3 Automatic Creation of Factored Corpora ....................................... 216
7.6 FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE ................. 217
7.6.1 Building Language Model .............................................................. 218
7.6.2 Building Phrase based Translation Model...................................... 219
7.7 SUMMARY ................................................................................................. 221
10.2 SUMMARY OF WORK DONE ................................................................. 247
10.3 CONCLUSIONS ......................................................................................... 249
10.4 FUTURE DIRECTIONS ............................................................................. 250
xi
ACKNOWLEDGEMENT
I would never have been able to finish my dissertation without the guidance, support
and encouragement of numerous people including my mentors, my friends, colleagues
and support from my family and wife. At the end of my thesis I would like to thank all
those people who made this thesis possible and an unforgettable experience for me.
First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata
Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my
doctoral studies, both in unseen and unconcealed ways.
xii
I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from
the very early stage of this research as well as giving me extraordinary experiences
through-out the work.
I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for
supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a
good friend from my graduate is always willing to help and give his best suggestions.
I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD
(MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan,
SAMSUNG Electronics, for their kind support and direction which have been of great
value in this study.
xiii
I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun
Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research.
I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their
kind support, the confidence and the love they have shown to me. You have been my
greatest strength and I am blessed to be your son. I would also like to give a special
thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways.
-ANAND KUMAR M
xiv
LIST OF FIGURES
Figure 1.1 Morphology based Factored SMT for English to Tamil Language ................... 10
Figure 1.3 Mapping English Word Factors to Tamil Word Factors .................................... 14
xv
Figure 5.2 Example of Tagged Corpus .............................................................................. 130
Figure 6.3 General Framework for Morphological Analyzer System ............................... 176
Figure 7.1 The Noisy Channel Model to Machine Translation ......................................... 201
xvi
Figure 9.1 BLEU-1 Score for Various Models .................................................................. 243
xvii
LIST OF TABLES
Table 4.15 Compounded English Sentence ........................................................................ 113
Table 5.4 Example of Suitable POS Features for Model 0 .............................................. 141
Table 5.5 Example of Suitable POS Features for Model 1 .............................................. 141
Table 5.6 Example of Suitable POS Features for Model 2 .............................................. 142
Table 6.16 Number of Words and Characters and Level of Efficiencies .......................... 197
xix
Table 6.17 Sentence Level Accuracies ............................................................................... 198
xx
LIST OF ABBREVIATIONS
HMM Hidden Markov Model
IBM International Business Machine
IE Information Extraction
International Institute of Information
IIIT
Technology
IR Information Retrieval
KWIC Key word in context
LDC Language data Consortium
LSV Letter Successor Varieties
ManTra MAchiNe assisted TRAnslation
MBMA Memory based Morphological Analysis
MEMM Maximum Entropy Markov Models
MG Morphological Generator
MIRA Margin Infused Relaxed Algorithm
ML Machine Learning
MLI Morpho-Lexical Information
MT Machine Translation
NIST National Institute of Standards and
Technology
NLI Natural Language Interface
NLP Natural Language Processing
NLU Natural Language Understanding
PBSMT Phrase based Statistical Machine Translation
PCFG Probalistic Context Free Grammar
PER Position Independent Word Error Rate
PLIL Pseudo Lingual for Indian Languages
PN Proper Noun
PNG Person-Number-Gender
POS Part-of-Speech
POST Part-of-Speech Tagging
QA Question Answering
RBMT Rule based Machine Translation
xxii
RCILTS Resource Centre for Indian Language
Technology Solutions
SMR Statistical Machine Reordering
SMT Statistical Machine Translation
SOV Subject-Object-Verb
Stanford Research Institute for Language
SRILM
Modeling
SVM Support Vector Machine
SVO Subject-Verb-Object
TBL Transformation based learning
TDIL Technology Development for Indian
Languages
TER Translation Edit Rate
TnT Trigrams n Tagger
UCSG Universal Clause Structure Grammar
UN United Nations
VG Verb Group
WER Word Error Rate
WFR Word Formation Rules
WSJ Wall Street Journal
WWW World Wide Web
xxiii
ABSTRACT
Tamil word forms are productive, that is, word forms are written without
spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the
problem of sparse data. It is very difficult to collect or create a parallel corpus which
contains all the possible Tamil surface words. Because, a single Tamil root verb is
xxiv
inflected into more than ten thousand different forms. Moreover, selecting a correct
Tamil word or phrase during translation is a challenging job. The corpus size and
quality decides the accuracy of the Machine Translation system. The limited
availability of parallel corpora for English-Tamil language and high inflectional
variation increases the data sparseness problem for baseline phrase-based SMT system.
While translating from English to Tamil language, the SMT baseline system will not
generate the Tamil word forms that are not present in the training corpora.
xxv
CHAPTER 1
INTRODUCTION
1.1 GENERAL
Machine Translation is an automatic translation of one natural language text to another
using computer. Initial attempts for Machine Translation made in 1950’s didn’t meet
with success. Now internet users need a fast automatic translation system between
languages. Several approaches like Linguistic based and Interlingua based systems are
used to develop a machine translation system. But currently, statistical methods
dominate the machine translation field. Statistical Machine Translation (SMT)
approach draws knowledge from automata theory, artificial intelligence, data structure
and statistics. SMT system treats translation as a machine learning problem. This means
that a learning algorithm is applied to a large amount of parallel corpora. Parallel
corpora are sentences in one language along with its translation. Learning algorithms
create a model from parallel sentences and using this model, unseen sentences are
translated. If parallel corpora are available for a language pair then it is easy to build a
bilingual SMT system. The accuracy of the system is highly dependent on the quality
and quantity of the parallel corpus and the domain. These parallel corpora are
constantly growing. Parallel corpora are the fundamental resource for SMT system.
Parallel corpora are available from government’s bi-lingual text books, news papers,
websites and novels.
SMT models are giving good accuracy for language pairs, particularly for similar
languages in specific domains or languages that have large availability of bi-lingual
corpora. If a sentence in language pair is not structurally similar, then the translation
patterns are difficult to learn. Huge amounts of parallel corpora are required for
learning the pattern, therefore statistical methods are difficult to use in “less
resourced” languages. To enhance the translation performance of dissimilar language
pairs and less resourced languages, an external preprocessing is required. This
preprocessing is performed using linguistic tools.
In SMT system, statistical methods are used for mapping of source language
phrases into target language phrases. Statistical model parameters are estimated from
bi-lingual and mono-lingual corpora. There are two models in the SMT system. They
1
are Translation model and Language model. The translation model takes parallel
sentences and finds the translation hypothesis between the phrases. Language model is
based on the statistical properties of n-grams. It uses the monolingual corpora.
Several translation models are available in SMT system. Some important models
are phrase based model, syntax based model and factored model. Phrase Based
Statistical Machine Translation (PBSMT) is limited to the mapping of small text
chunks. Factored translation model is an extension of phrase based models. It integrates
linguistic information at the word level. This thesis proposes a pre-processing method
that uses linguistic tools to the development of English to Tamil machine translation
system. In this translation system, external linguistic tools are used to augment the
linguistic information into the parallel corpora. The pre and post processing
methodology proposed in this thesis are applicable to other language pairs too.
Machine translation is one of the major oldest and the most active area in natural
language processing. The word ‘translation’ refers to transformation of text or speech
from one language into other. Machine translation can be defined as, the application of
computers to the task of translating texts from one natural language to another. It is a
focussed field of research in linguistic concepts of syntax, semantics, pragmatics and
discourse.
Today a number of systems are available for producing translations, though they
are not perfect. In the process of translation, which is either carried out manually or
automated through machines, the context of the text in the source language when
translated must convey the exact context in the target language. Translation is not just
word level replacement. A translator, either a machine or human, must interpret and
analyse all the elements in the text. Also human/machine should be familiar with all the
issues during the translation process and must know how to handle it. This requires in-
depth knowledge in grammar, sentence structure, meanings, etc and also an
understanding in each language’s culture in order to handle idioms and phrases
originated from different culture. The cross culture understanding is an important issue
that holds the accuracy of the translation.
2
It will be a great challenge for humans to design automatic machine translation
system. It is difficult for translating sentences by taking into consideration all the
required information. Humans need several revisions to make the perfect translation.
No two individual human translators can generate identical translations of the same text
in the same language pair. Hence it will be a greater challenge for humans to design a
fully automated machine translation system to produce high quality translations.
Machine Translation is used for translating texts for assimilation purpose which
aids bilingual or cross-lingual communication and also for searching, accessing and
understanding foreign language information from databases and web-pages [3]. In the
field of information retrieval a lot of research is going on in Cross-Language
Information Retrieval (CLIR), i.e. information retrieval systems capable of searching
databases in many different languages [4].
3
good automatic translation system, students can improve their translation and writing
skills. Such system can break the language barriers of students and language learners.
Traditionally, rule based approaches are used to develop a machine translation system.
Rule based approach feeds the rules into machine using appropriate representations.
Feeding all linguistic knowledge into a machine would be very hard. In this context, the
statistical approach to Machine Translation has some attractive qualities that made it
the preferred approach in machine translation research over the past two decades.
Statistical translation models learn translation patterns directly from data, and
generalize them to translate a new text. The SMT approach is largely language-
independent, i.e. the models can be applied to any language pair.
System based on statistical methods is much better than the traditional rule-based
systems. In SMT, implementation and development times are much shorter. SMT can
improve by coupling new models for reordering and decoding. It only needs to learn
parallel corpora for generating a translation system. In contrast, rule based system
needs transfer rules which only linguistic experts can generate. These rules are entirely
dependent on language pair involved and defining general “transfer-rules” is not an
easy task, especially for languages with different structures [5].
SMT system can be developed rapidly if the appropriate corpus is available. A Rule
Based Machine Translation (RBMT) system requires a lot of development and
customization costs until it reaches the desired quality threshold. Packaged RBMT
systems have been already developed and it is extremely difficult to reprogram models
and equivalences. Above all, RBMT has a much longer process involving more human
resources. RBMT system is retrained by adding new rules and vocabulary among other
things [5].
4
nowadays thanks to the wider availability of more powerful computers. RBMT requires
a longer deployment and compilation time by experts so that, in principle, building
costs are also higher. SMT generates statistical patterns automatically, including a good
learning of exceptions to rules. As regards to the rules governing the transfer of RBMT
systems, certainly they can be seen as special cases of statistical standards.
Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT
systems can be upgraded with syntactic information and even semantics, like the
RBMT. A SMT engine can generate improved translations if retrained or adapted
again. In contrast, the RBMT generates very similar translations after retraining [5].
SMT systems, in general, have trouble in handling the morphology on the source or
the target side especially for morphologically rich languages. Errors in morphology can
have severe consequences on meaning of the sentence. They change the grammatical
function of words or the interpretation of the sentence through the wrong verb tense.
Factored translation models try to solve this issue by explicitly handling morphology on
the generation side.
5
1.5 MOTIVATION OF THE THESIS
Machine translation (MT) is the application of computers to the task of translating texts
from one natural language to another. Even though machine translation was envisioned
as a computer application in the 1950’s, machine translation is still considered to be an
open problem [3].
6
In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are translation of news from English into local languages,
translation of annual reports of government departments and public sector units among
English, Hindi and the local language. Many resources such as news, weather reports,
books, etc., in English are being manually translated to Indian languages. Of these,
News and weather reports from all around the world are translated from English to
Indian languages by human translators more often. Human translation is slow and also
consumes more time and cost compared to machine translation. It is clear from this that
there is large market available for machine translation rather than human translation
from English into Indian languages. The reason for choosing automatic machine
translation rather than human translation is that machine translation is faster and
cheaper than human translation.
Tamil, a Dravidian language, is spoken by around 72 million people and has the
official status in the state of Tamilnadu and Indian union territory of Puducherry.
Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken
by significant minorities in Malaysia and Mauritius as well as emigrant communities
around the world. It is one of the 22 scheduled languages of India and declared a
classical language by the government of India in 2004 [9].
7
• Develop a pre-processing module (Reordering, Compounding and
Factorization) for English language sentence to transform the structure to
more similar to that of Tamil.
The pre-processing module for source language includes three stages, which are
reordering, factorization and compounding. In reordering stage, the source language
sentence is to be syntactically reordered according to the Tamil language syntax.
After reordering, the English words will be factored into lemma and other
morphological features. It will be followed by the compounding process, in which
the various function words are removed from the reordered sentence and attached
as a morphological factor to the corresponding content word.
Tamil POS tagger is going to develop using Support Vector Machine (SVM)
based machine learning tool. POS annotated corpus will be created for training the
automatic tagger system.
8
• Develop a Tamil Morphological Generator system to generate Tamil surface
word form.
Parallel corpora are used to train the statistical translation models. Parallel corpora
are created and converted into factored parallel corpora using preprocessing. English
sentences are factored using Stanford Parser tool and Tamil sentences are factored
using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected
from various news papers and factored using Tamil linguistic tools. This mono-lingual
corpus is used in language model. Finally, in post-processing, Tamil morphological
generator is used for generating a surface word from output factors.
9
Figure 1.1 Morphoology based
d Factored SMT
S for En
nglish to Tam
mil languagge
Maachine Transslation systeem for languuage pair wiith disparatee morphologgical structurre
neeeds appropriiate pre-proccessing or moodeling befoore translatioon. The prepprocessing caan
be performed on the raw source langguage sentennce to makee it more ap
ppropriate fo
for
trannslating into
o target lannguage senttence. The pre-processing modulee for Englissh
lannguage sentence consistss of reorderinng, factorizaation and com
mpounding.
10
Reordering rules are handcrafted using the syntactic word order difference between
English and Tamil language. 180 reordering rules are created based on the sentence
structure of English and Tamil. Reordering significantly improves the performance of
the Machine Translation system. Lexicalized distortion reordering model is
implemented in Moses toolkit [180]. But this automatic reordering in Moses toolkit is
good for short range sentences. Therefore external tool or component is needed for
dealing the long distance reordering. This reordering is also a one way of indirectly
integrating syntactic information to the source language. 80% of English sentences are
reordered correctly according to the rules which are developed. Example for English
reordering is given in the Figure 1.2.
Factored models can be used for morphologically rich languages, in order to reduce
the amount of bi-lingual data. Factorization refers splitting the word into linguistic
factors and integrates as a vector. Stanford Parser is used to parse the English
sentences. From the parsed tree, the linguistic information such as lemma, part-of-
speech tags, syntactic information and dependency information are retrieved. This
linguistic information is integrated as factors in the original word.
11
morphological structure of Tamil language sentence. In compounding phase, the
function words are identified from the English factored corpora using dependency
information. After finding the function words, these are removed from the factored
sentence and attached as a morphological factor to the corresponding content word.
Compounding process reduces the length of the English sentence. Like function words,
auxiliary verbs and model verbs are also removed and attached as a morphological
factor of source language word. Now the morphological representation of the English
language sentence is similar to that of the Tamil language sentence. This compounding
step indirectly integrates dependency information into the source language factor. Table
1.1 and Table 1.2 show the factored and compounded sentences respectively.
I | i | PN | prn
my | my | PN | PRP$
home | home | N | NN
to | to | TO | TO
vegetables | vegetable | N | NNS
bought | buy | V | VBD .
I | i | PN | prn_i
my | my | PN | PRP$
home | home | N |NN_to
vegetables | vegetable | N | NNS
bought | buy | V | VBD_1S.
12
morphological analyzer. Morphological analyzer split the word to lemma and
morphological information. Parallel corpora as well as the monolingual corpora are
preprocessed in this stage.
POS tagging means labeling grammatical classes i.e. assigning parts of speech tags
to each and every word of the given sentence. Tamil sentences are POS tagged using
Tamil POS Tagger tool. This tagger was developed, using Support Vector Machine
(SVM) based machine learning tool, SVMTool [12], which make the task simple and
efficient. In this method, POS tagged corpus is created and used to generate a trained
model. The SVMTool is used for creating models using tagged sentences and untagged
sentences are tagged using those models. 42k sentences (approx 5 lakh words) are
tagged for this Part-of-Speech tagger with the help of eminent Tamil linguist. The
experiments are conducted with our tagged corpus. The overall accuracy of 94.6% is
obtained for the test set which contains 6K sentences (approx 35 thousand words).
After POS tagging, sentences in the corpora are morphologically analyzed for
finding the lemma and morphological information. Morphological analyzer is a
software tool used to segment the word into meaningful units. Morphological analysis
of Tamil is a complex process because of its “morphological-rich” nature. Generally,
rule based approaches are used to develop morphological analyzer system. For a
morphologically rich language like Tamil, the creation of rules is a challenging task.
Here a novel machine learning based approach is proposed and implemented for Tamil
verb and noun Morphological analyzer. Additionally, this approach is tested for
languages such as Malayalam, Telugu and Kannada.
13
assigning grammatical classes to each morpheme. The SVM based tool was used for
training the data. This tool segments each word into its lemma and morphological
information.
14
“Minimized-POS” and “Compound-Tag” factors of English word is aligned to
“Morphological information” factor of Tamil word. Here, the important thing is Tamil
surface new words are not generated in SMT decoder. Only factors are generated from
SMT system and the surface word is generated in the post processing stage. Tamil
morphological generator is used in post processing to generate a Tamil surface word
from output factors. The system is evaluated with different sentence patterns like
simple, continuous and model auxiliaries and with these types, 85% of the sentences
are translated correctly. In addition, for other sentence types, the performance is 60%.
The prototype machine translation system which is developed properly handles the
noun-verb agreement. This is an essential requirement for translating into
morphologically rich languages like Tamil. BLEU and NIST evaluation scores clearly
show that the factored model with an integration of linguistic knowledge gives better
result for English to Tamil Statistical Machine Translation system.
15
lemma and word-class as input and gives the lemma’s paradigm number and word’s
stem as output. This paradigm number is referred as column index. Paradigm number
provides information about all the possible inflected words of a lemma in a particular
word class. The second module takes morpho-lexical information as an input and gives
its index number as an output. From the complete morpho-lexical information list, the
index number of the corresponding input morpho-lexical information factor is identified
and this is referred as row index. In third module, a two dimensional suffix-table is
used to generate the word using row index and column index. Finally the identified
suffix is attached with the stem to create a word form. For pronouns, pattern matching
approach is followed for generating pronoun word form.
This thesis shows how preprocessing and post processing can be used to improve
the statistical machine translation for English to Tamil language. The main focus of this
research is on translation from English into Tamil language, but also the development
of linguistic tools for Tamil language. The contributions are,
• Introduced a novel pre-processing method for English sentences which is
based on reordering and compounding. Reordering rearrange the English
sentence structures according to Tamil sentence. Compounding removes the
function words and auxiliaries then merged to the morphological factor of
content word. This pre-processing reorganizes the English sentence structure
according to the structure of Tamil sentence.
• Created a Tamil POS Tagger and tagged corpora size of 5 lakh words which
is a part of pre-processing Tamil language sentence.
• Introduced a novel method for developing Tamil morphological analyser
which is based on Machine learning approach. Corpora developed for this
approach contains 4 lakh morphologically segmented Tamil verbs and 2 lakh
Tamil nouns.
• Introduced a novel algorithm for developing Tamil morphological generator
with the use of paradigms and suffixes. Using this generator, it is possible to
generate 10 thousand distinct word form of a single Tamil verb.
• Successfully integrated these pre-processing and post-processing modules and
developed English to Tamil factored SMT system.
16
1.9 ORGANIZATION OF THE THESIS
This thesis is divided into ten chapters. Figure 1.4 shows the Organization of the thesis.
Chapter‐I INTRODUCTION
Chapter‐3 BACKGROUND
PREPROCESSING PREPROCESSING
Chapter‐4 ENGLISH TAMIL
LANGUAGE LANGUAGE
POS TAGGER
Chapter‐5
FOR TAMIL
MORPH ANALYZER
Chapter‐6 FOR TAMIL
MORPHOLOGICAL
Chapter‐8 GENERATOR FOR TAMIL
EXPERIMENTS AND
Chapter‐9 RESULTS
Chapter‐10 CONCLUSION
17
This thesis is organized as follows. General introduction is presented in chapter 1.
Chapter 2 presents the literature survey for linguistic tools and available Machine
Translation systems for Indian languages. In Chapter 3, the theoretical background and
language processing for Tamil is described. Chapter 4 contains the different stages of
preprocessing English language sentences. Stages include reordering, factorization and
compounding. Chapter 5 and 6 presents the preprocessing of Tamil sentence using
linguistic tools. In Chapter 5, development of Tamil POS tagger is explained and
Chapter 6 illustrates the Morphological Analyzer for Tamil language. This
morphological analyzer is developed based on the new machine learning based
approach. Additionally, the detailed descriptions of the method and data resources are
also illustrated. Chapter 7 presents the Factored SMT system for English to Tamil
language. This chapter explains how the factored corpora are trained and decoded using
SMT Toolkit. Post-processing for Tamil language is discussed in chapter 8.
Morphological generator is used as a Post-processing tool. This chapter also explains
the detailed description about a new algorithm which is developed for Tamil
Morphological generator. Chapter 9 explains the experiment and results of English to
Tamil Statistical Machine Translation system. It also describes the training and testing
details of SMT toolkit. The output of the developed system is evaluated using BLEU
and NIST metrics. Finally Chapter 10 concludes the thesis and explains the future
directions about this research.
18
CHAPTER 2
LITERATURE SURVEY
This chapter presents the state of the art in the field of Tamil Linguistic tools and
Machine Translation systems. Tamil Linguistic tools include POS Tagger,
Morphological analyzer and Morphological generator. This chapter discusses the
literature review about the Linguistic tools and Machine Translation systems for Indian
languages and Tamil languages.
Different approaches have been used for Part-of-Speech (POS) tagging, where
the notable ones are rule-based, stochastic, or transformation-based learning
approaches. Rule-based taggers try to assign a tag to each word using a set of hand-
written rules. These rules could specify, for instance, that a word following a
determiner and an adjective must be a noun. This means that the set of rules must be
properly written and checked by human experts. The stochastic (probabilistic) approach
uses a training corpus to pick the most probable tag for a word [14-17]. All
probabilistic methods cited above are based on first order or second order Markov
Models.
There are a few other techniques which use probabilistic approach for POS
Tagging, such as the Tree Tagger [18]. Finally, the transformation-based approach
combines the rule-based approach and statistical approach. It picks the most likely tag
based on a training corpus and then applies a certain set of rules to see whether the tag
should be changed to anything else. It saves any new rules that it has learnt in the
process, for future use. One example of an effective tagger in this category is the Brill
tagger [19-22]. All of the approaches discussed above fall under the rubric of
supervised POS Tagging, where a pre tagged corpus is a prerequisite. On the other
19
hand, there is the unsupervised POS tagging [23] [24] [25] technique and it does not
require any pre-tagged corpora. Koskenniemi(1985) [26] also used a rule-based
approach implemented with finite-state machines.
Greene and Rubin (1971) [27] have used a rule-based approach in the TAGGIT
program, which was an aid in tagging the Brown corpus [28]. TAGGIT disambiguated
77% of the corpus; the rest was done manually over a period of several years.
Derouault and Merialdo (1986) [29] have used a bootstrap method for training.
At first, a relatively small amount of text was manually tagged and used to train a
partially accurate model. The model was then used to tag more text, and the tags were
manually corrected and then used to retrain the model. Church (1988) [15] uses the
tagged Brown corpus for training. These models involve probabilities for each word in
the lexicon and hence a large tagged corpus is required for a reliable estimation. Jelinek
(1985) [30] has used Hidden Markov Model (HMM) for training a text tagger.
Parameter smoothing can be conveniently achieved using the method of ‘deleted
interpolation’ in which weighted estimates are taken from second and first-order
models and a uniform probability distribution.
20
previous taggers have been evaluated on the English WSJ corpus, using the Penn
Treebank set of POS categories and a lexicon constructed directly from the annotated
corpus. Although the evaluations were performed with slight variations, there was a
wide consensus in the late 90’s that the state–of-the–art accuracy for English POS
tagging was between 96.4% and 96.7%. In the recent years, the most successful and
popular taggers in the NLP community have been the HMM–based TnT tagger, the
Transformation–based learning (TBL) tagger [35] and several variants of the Maximum
Entropy (ME) approach [34].
The SVMTool [38] is intended to comply with all the requirements of modern
NLP technology, by combining simplicity, flexibility, robustness, portability and
efficiency with state–of–the–art accuracy. This is achieved by working in the Support
Vector Machines (SVM) learning framework, and by offering NLP researchers a highly
customizable sequential tagger generator.
Smriti Singh et.al (2006) [40] have proposed tagger for Hindi, that uses the affix
information stored in a word and assigns a POS tag using no contextual information.
By considering the previous and the next word in the Verb Group (VG), it correctly
identifies the main verb and the auxiliaries. Lexicon lookup was used for identifying
the other POS categories.
Hidden Markov Model (HMM) based tagger for Hindi was proposed by Manish
Shrivastava and Pushpak Bhattacharyya (2008) [41]. The authors attempted to utilize
the morphological richness of the languages without resorting to complex and
expensive analysis. The core idea of their approach was to explode the input in order to
21
increase the length of the input and to reduce the number of unique types encountered
during learning. This in turn increases the probability score of the correct choice while
simultaneously decreasing the ambiguity of the choices at each stage.
Nidhi Mishra and Amit Mishra [44] proposed a Part-of-Speech Tagging for
Hindi Corpus in 2011. In the proposed method, the system scans the Hindi corpus and
then extracts the sentences and words from the given corpus. Also the system search
the tag pattern from database and display the tag of each Hindi word like noun tag,
adjective tag, number tag, verb tag etc.
Based on lexical sequence constraints, a POS tagger algorithm for Hindi was
proposed by Pradipta Ranjan Ray (2003) [45]. The proposed algorithm acts as the first
level of part of speech tagger, using constraint propagation, based on ontological
information, morphological analysis information and lexical rules. Even though the
performance of the POS tagger has not been statistically tested due to lack of lexical
resources, it covers a wide range of language phenomenon and accurately captures the
four major local dependencies in Hindi.
Sivaji Bandyopadhyay et.al (2006) [46] came up with a rule based chunker for
Bengali which gave an accuracy of 81.64 %. The chunker has been developed using
rule-based approach since adequate training data was not available. The list of suffixes
has been prepared for handling unknown words. They used 435 suffixes; many of them
usually appear at the end of verb, noun and adjective words.
For Telugu, three POS taggers have been proposed by using different POS
tagging approaches viz., (1) Rule-based approach, (2) Transformation based learning
(TBL) approach of Erich Brill (3) Maximum Entropy Model, a machine learning
technique [47].
For Bengali, Sandipan et al., (2007) [48] have developed a corpus based semi-
supervised learning algorithm for POS tagging based on HMMs. Their system uses a
small tagged corpus (500 sentences) and a large unannotated corpus along with a
22
Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.
A stochastic Hidden Markov Model (HMM) based part of speech tagger has
been proposed for Malayalam. To perform the Part-of-Speech tagger using stochastic
approach, an annotated corpus is required. Due to the non-availability of annotated
corpus, a morphological analyzer was also developed to generate a tagged corpus from
the training set [50]. Antony P.J et.al (2010) [51] developed tagset and tagged corpora
size of 180,000 words for Malayalam language. This tagged corpus is used for training
the system. The performance of the SVM based tagger achieves 94 % accuracy and
showed an improved result than HMM based tagger.
Parts of speech tagging scheme, tags a word in a sentence with its parts of
speech. It is done in three stages: pre-editing, automatic tag assignment, and manual
post-editing. In pre-editing, the corpus is converted to a suitable format to assign a Part
of Speech tag to each word or word combination. Because of orthographic similarity
one word may have several possible POS tags. After the initial assignment of possible
POS, words are manually corrected to disambiguate words in texts.
23
Vasu Ranganathan’s Tagtamil (2001)
Ganesan [56] has prepared a POS tagger for Tamil. His tagger works well in
CIIL Corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for
Tamil. He tagged a portion of CIIL corpus by using a dictionary as well as a
morphological analyzer. He corrected it manually and trained the rest of the corpus
with it. The tags are added morpheme by morpheme.
pUkkaLai : pU_N_PL_AC
Kathambam of RCILTS-Tamil
Lakshmana Pandian S and Geetha T V (2009) [57] developed CRF Models for
Tamil Part of Speech Tagging and Chunking. This method avoids a fundamental
limitation of maximum entropy Markov models (MEMMs) and other discriminative
24
Markov models. The Language models are developed using CRF and designed based
on morphological information of Tamil.
Selvam and Natarajan (2009) [58] have developed a Rule based Morphological
analyser and POS Tagger for Tamil. They improved the above systems using Projection
and Induction techniques. Rule based morphological analyzer and POS tagger can be
built from well defined morphological rules of Tamil. Projection and induction
techniques are used for POS tagging, base noun-phrase bracketing, named entity
tagging and morphological analysis from a resource rich language to a resource
deficient language. They applied alignment and projection techniques for projecting
POS tags, and alignment, lemmatization and morphological induction techniques for
inducing root words from English to Tamil. Categorical information and root words are
obtained from POS projection and morphological induction respectively from English
via alignment across sentence aligned corpora. They generated more than 600 POS tags
for rule based morphological analysis and POS tagging.
John Goldsmith (2001) [62] shows how stems and affixes can be inferred from
a large un-annotated corpus. Data-driven method for automatically analyzing the
morphology of ancient Greek used a nearest neighbor machine learning framework
25
[63]. A language modeling technique to select the optimal segmentation rather than
using heuristics is proposed for Thai morphological analyzer [64].
In Tamil language, the first step towards the preparation of morphological analyzer for
Tamil was initiated by Anusaraka group. Ganesan (2007) [56] developed a
morphological analyzer for Tamil to analyze CIIL corpus. Phonological and
morphophonemic rules are taken into for building morphological analyzer for Tamil.
Resource Centre for Indian Language Technological Solutions (RCILTS) -Tamil has
26
prepared a morphological analyzer (Atcharam) for Tamil. Finite automata state-table
has been adopted for developing this Tamil morphological analyzer [72].
Vijay Sundar Ram R et.al (2010) [75] was designed Tamil Morphological
Analyzer using paradigm based approach and Finite State Automata, which works
efficiently in recursive tasks and considers only the current state for having a transition.
In this approach complex affixations are easily handled by FSA and in the FSA, the
required orthographic changes are handled in every state. In this approach, they built a
FSA using all possible suffixes, categorize the root word lexicon based on paradigm
approach to optimize the number of orthographic rules and use morpho-syntax rules to
get the correct analysis for the given word. FSA is used in analysis of the word is done
suffix by suffix. FSA are the proven technology for efficient and speedy processing.
There are three major components in their morphological analyzer system. The first one
is Finite State Automata which is modeled using all possible suffixes (allomorphs).
Next is lexicon, categorized based on the paradigm approach and the final component is
morpho-syntax rules for filtering the correct parse of the word.
27
Akshar bharati et.al (2001) [76] developed a algorithm for unsupervised
learning of morphological analysis and generation of inflectionally rich languages. This
algorithm uses the frequency of occurrences of word forms in a raw corpus. They
introduce the concept of “observable paradigm “by forming equivalence classes of
feature-structures which are not obvious. Frequency of word forms for each
equivalence class is collected from such data for known paradigms. In this algorithm,
suppose the morphological analyzer cannot recognize the inflectional form. The
possible stem and paradigm was guessed using the corpus frequencies. The method
assumes that the morphological package makes use of paradigms. This package was
able to guess stem paradigm pair for an unknown word. This method only depends on
the frequencies of the word forms in raw corpora and does not require any linguistic
rules or tagger. The performance of this system is depends on the size of the corpora.
Vasu Ranganathan (2001) [55] built a Tamil tagger by implementing the theory
of lexical phonology and morphology and this system was tested with English-Tamil
Machine translation system and a number of natural language processing tasks. This
tagger tool was written in Prolog and built with knowledge base morphological rules of
Tamil language. This tagger should be capable of accounting all morphological
information during the process of recognition and generation. This tagger was built
using successive stage of knowledge based morphological rules of Tamil Language. In
this method, three different coding procedures were adopted to recognize the written
Tamil literary word forms. The output was contains morphological information such as
type of the word, root form of the word and suitable morphological tags for affixes.
This tagger is capable of recognizing and generating Tamil word forms including finite
and non finite verbs such as aspect, modality, tense forms as well as the noun forms
like participial noun, verbal nouns and cases. The dictionary is built as part of this
system which contains information about the root word and grammatical information.
This tagger was tested and included in Machine translation system.
28
Dhanabalan et.al (2003) [13] developed a spell checker for Tamil Language.
Lexicons with morphological and syntactic information are used for developing this
spell checker. This spell checker can be integrated with word processors. Each word is
compared against a dictionary of correctly spelled word. This tool needs syntactic and
semantic knowledge for catching the misspell words. It also provides a facility to
customize the spell checker’s dictionary so that the technical words and proper nouns
can be appended. Initially the spell checker reads the word from document. If the word
is present in the dictionary then it is interpreted as valid word. Otherwise the word is
forwarded into error correcting process. This tool consists of three phases, they are, text
parsing, spelling verification and generation. This spell checker uses Tamil
morphological analyzer and morphological generator. Analyzer is used for analyzing
the given word and generator is for generating different suggestions. The
morphological analyzer first tries to split the suffix of the correct word. If spelling
mistake is there then that word is passes to spelling verification and correction module
for correcting the mistake. After finding the root word, system compares it with
dictionary entries. If the root word is not present in the dictionary then the nearest root
word is taken and given to morphological generator system.
M.Ganesan (2007, 2009) [56] explained about the analysis and generation of
Tamil corpora. He also developed various tools to analyze the Tamil corpora. The tools
include POS tagging, morphological analyzer, frequency counter and KWIC (KeyWord
In Context) concordance. At word level the tagset contains 22 tags and morph level it
contains 82 tags. Using this pos tagger and morphological analyzer he has also
developed a syntactic tagger. This tagger is for tagging a Tamil sentence at phrase level
and clause level.
29
position. Successive split method is used for splitting the Tamil words. Initially all the
words are treated as stems. In the first pass these stems are split into new stems and
suffixes based on the similarity of the characters. They are split at the position where
the two words differ. The right substring is stored in a suffix list and the left sub string
is kept as a stem. In the second pass, the same procedure is followed, and the suffixes
are stored in a separate suffix list.
Uma Maheswar Rao (2004) [67] proposed a modular model that is based on a
hybrid approach which combines the two basic primary concepts of analyzing word
forms and Paradigm model. The morphological model underlying this description is
influenced by the needs of flexibility of the design that permits to plugin language
specific databases with minimum changes and the simplicity of computation. This
architecture involves the identification of different layers among the affixes, which
enter into concatenation to generate word forms. Unlike in traditional item and
arrangement model, allomorphic variants are not given any special status if they are
generable through simple morphophonemic rules. Automatic phonological rules also
used for to derive surface forms.
Menon et.al (2009) [131] developed a Finite State Transducer (FST) based
Tamil morphological analyzer and generator. They have used AT &T Finite State
Machine to build this tool. FST maps between two sets of symbols. This is used as a
transducer that accepts the input string if it is in the language and generates another
string on its output. The system is based on lexicon and orthographic rules from a two
level morphological system. For the Morphological generator, if the string which has
the root word and its morphemic information is accepted by the automaton, then it
generates the corresponding root word and morpheme units in the first level.
Machine Translation systems have been developed in India, for translation from
English to Indian Languages and from Indian languages to Indian languages. These
systems are also used for teaching machine translation to the students and researchers.
Most of these systems are in the English to Hindi domain with exceptions of a Hindi to
English and English to Kannada machine translation system. English is a SVO
30
language while Indian regional languages are SOV and are relatively of free word-
order. The translation domains are mostly government documents, health, tourism,
news reports and stories. . The University of Hyderabad under K. Narayana Murthy has
worked on an English-Kannada MT system called “UCSG-based English-Kannada
MT”, using the Universal Clause Structure Grammar (UCSG) formalism. A survey of
the machine translation systems that have been developed in media for translation from
English to Indian languages and among Indian languages reveals that the machine
translation software is used in field testing or is available as web translation service.
Indian Machine Translation system [80] are presented below; these systems are used to
translate English to Hindi language
Anglabharti is a pattern directed rule based system with context free grammar
like structure for English, the source language which generates a ‘pseudo-target’ (PLIL)
applicable to a group of Indian languages (target languages). A set of rules obtained
through corpus analysis is used to identify plausible constituents with respect to which
movement rules for the PLIL is constructed. The idea of using PLIL is primarily to
exploit structural similarity to obtain advantages similar to that of using interlingua
approach. It also uses some example-base to identify noun and verb phrasal and resolve
their ambiguities [82].
31
aim of translation from one Indian language to another. It produces output which a
reader can understand but is not exactly grammatical. For example, a Bengali to Hindi
Anusaaraka can take a Bengali text and produce output in Hindi which can be
understood by the user but will not be grammatically perfect. Likewise, a person
visiting a site which is in a language he does not know, he can run Anusaaraka and read
the text and understand the context. Anusaaraka's have been built from Telugu,
Kannada, Bengali, and Marathi to Hindi [83].
The Mantra (MAchiNe assisted TRAnslation tool) translates English text into
Hindi in a specified domain of personal administration, specifically gazette
notifications, office orders, office memorandums and circulars. Initially, the Mantra
system was started with the translation of administrative document such as appointment
letters, notification, and circular issued in central government from English to Hindi.
The system is ready for use in its domains.
It has a text categorization component at the front, which determines the type of
news story such as political, terrorism, economic, etc., before operating on the given
story. Depending on the type of news, it uses an appropriate dictionary. It also requires
considerable human assistance in analysing the input. Another novel component of the
system is that given a complex English sentence, it breaks it up into simpler sentences,
which are then analysed and used to generate Hindi. They are using the translation
32
system in a project on Cross Lingual Information Retrieval (CLIR) that enables a
person to query the web for documents related to health issues in Hindi [85].
Two machine translation systems, Shiva and Shakti [86] for English to Hindi
were developed jointly by Carnegie Mellon University USA, Indian Institute of
Science, Bangalore, India, and International Institute of Information Technology,
Hyderabad. Shakti machine translation system has been designed to produce machine
translation systems for new languages rapidly. Shakti system combines rule-based
approach with statistical approach whereas Shiva is example based machine translation
system. The rules are based on linguistic nature, and the statistical approach tries to
infer or use linguistic information. Some modules also use semantic information.
Currently Shakti is working for three target languages (Hindi, Marathi and Telugu).
33
Anubaad [87], a hybrid machine translation system for translating English news
headlines to Bengali, developed by Sivaji Bandyopadhyay at Jadavpur University
Kolkata. The current version of the system works at the sentence level.
R. Mahesh et al. [88] proposed Hinglish, a machine translation system for pure
(standard) Hindi to pure English. It had been implemented by incorporating additional
layer to the existing English to Hindi translation (AnglaBharti-II) and Hindi to English
translation (AnuBharti-II) systems developed by Sinha. The system claimed to be
produced satisfactory acceptable results in more than 90% of the cases. Only in case of
polysemous verbs, due to a very shallow grammatical analysis used in the process, the
system is unable to resolve their meaning.
Gurpreet Singh Josan et al. [90] developed Punjabi to Hindi machine translation
system at Punjabi University, Patiala. This system is based on direct word-to-word
translation approach. This system consists of modules like pre-processing, word-to-
word translation using Punjabi-Hindi lexicon, morphological analysis, word sense
disambiguation, transliteration and post processing. The system has reported 92.8%
accuracy.
Vishal Goyal et.al. [91] developed Hindi to Punjabi Machine translation System
at Punjabi University, Patiala. This system is based on direct word-to-word translation
approach. This system consists of modules like pre-processing, word-to-word
translation using Hindi-Punjabi lexicon, morphological analysis, word sense
34
disambiguation, transliteration and post processing. The system has reported 95%
accuracy.
Prashanth Balajapally et al. [92] developed English to {Hindi, Kannada, and Tamil}
and Kannada to Tamil language-pair example based machine translation. It is based on
a bilingual dictionary comprising of sentence-dictionary, phrases-dictionary, words-
dictionary and phonetic-dictionary is used for the machine translation. Each of the
above dictionaries contains parallel corpora of sentence, phrases and words, and
phonetic mappings of words in their respective files. Example Based Machine
Translation (EBMT) has a set of 75000 most commonly spoken sentences that are
originally available in English. These sentences have been manually translated into
three of the target Indian languages, namely Hindi, Kannada and Tamil.
Ruvan Weerasinghe (2004), [93] developed a SMT system for Sinhala to Tamil
Language. In this method, corpora were utilized from newspaper of Sri Lanka which
publishing in both languages. He has also collected corpora from a website that
contains translations of English articles into Sinhala and Tamil. These resources are
formed a small trilingual parallel corpus for this research. This corpus consists of news
items and articles related to politics and culture in Sri Lanka. The fundamental task of
sentence boundary detection was performed employing a semi-automatic approach. In
this scheme, a basic heuristic was first applied to identify sentence boundaries and
35
those situations that were exceptions to the heuristic identified. Sentences are aligned
in manual way. After cleaning up the texts and manual alignment, a total of 4064
sentences of Sinhala and Tamil were used for SMT. All language processing done used
raw words and were based on statistical information. The CMU-Cambridge Statistical
Language Modeling Toolkit (version 2) was used to build n-gram language models.
36
the Tamil inflection tables. The stemmer uses regular expression matching to cut off
inflectional endings and introduce some extra tokens for negation and certain case
markings (such as locative and genitive), which are all marked morphologically in
Tamil. Finally, it is showed that the performance of MT system increases using the
stemmer.
37
is transformed to the grammar of an intermediate language after carrying out the syntax
and morphological analysis and word by word translation.
Saravanan et.al (2010) [99] developed a Rule based Machine translation system
for English to Tamil. Using statistical machine translation approach, Google developed
a web based machine translation engine for English to Tamil language. This system is
also having the facility to identify the source language automatically.
Statistical translation models have evolved from the word-based models originally
proposed by Brown et al. [100] to syntax-based and phrase-based techniques. The
beginnings of phrase-based translation can be seen in the alignment template model
introduced by Och et al. [101]. A joint probability model for phrase translation was
proposed by Marcu et al. [102]. Koehn et al. [103] propose certain heuristics to extract
phrases that are consistent with bidirectional word-alignments generated by the IBM
models [100]. Phrases extracted using these heuristics are also shown to perform better
than syntactically motivated phrases, the joint model, and IBM model 4 [103].
The use of morphological information for SMT has been reported in [108] and
[109]. The detailed experiments described by Nießen et al.[108], show that the use of
morph-syntactic information drastically, reduces the need for bilingual training data.
Recent work by Koehn et al. [10] proposes factored translation models that
combine feature functions to handle syntactic, morphological, and other linguistic
information in a log-linear model. The following addresses the various approaches to
handle idioms and phrasal verbs in machine translation. Handling of idioms and phrasal
verbs is one of the most important tasks to be handled in Machine Translation. Various
approaches are developed to handle idioms and phrases in machine Translation. Sahar
Ahmadi et al. [111] focused on analysing the translatability of colour idiomatic
expressions in English- Persian and Persian-English texts to explore the applied
translation strategies in translation of colour idiomatic expressions and also to find
cultural similarities and differences between colour idiomatic expressions in
English and Persian.
Martine Smets et al. [112] developed their Machine Translation system in such
a way such that it handles the verbal idioms. Verbal idioms constitute a challenge for
machine translation systems: their meaning is not compositional, preventing a word-
for-word translation, and they can be discontinuous, preventing a match during
tokenization.
39
Elisabeth Breidt et al. [113] suggested describing their syntactic restrictions and
their idiosyncratic peculiarities with local grammar rules, which at the same time
permit to express regularities valid for a whole class of multi-word lexemes such as
word order variation in German.
Digital Sonata provides NLP services and products. It has released its tool kit
called, Caraboa Language Kit, in which idioms serves as the backbone of its
architecture and it is mainly rule based. Here, the idioms are considered as sequences
and each sequence is a combination of one or more lexical units.
40
and target languages. The predicted morphological features are then used to generate
the correct surface forms.
Sara Stymne (2009), [119] explores how compound processing can be used to
improve phrase-based statistical machine translation (PBSMT) between English and
German/Swedish. For translation into Swedish and German the parts are merged after
translation. The effect of different splitting algorithms for translation between English
and German, and of different merging algorithms for German is also investigated. For
translation between English and German different splitting algorithms work best for
different translation directions. A novel merging algorithm based on art-of-speech
matching is designed and evaluated.
41
information is used to preprocess the Arabic text for Arabic-to-English and English-to-
Arabic translation. This preprocessing reduces the gap in the complexity of the
morphology between Arabic and English language. The second method addresses the
issue of long-distance reordering in translation to account for the difference in the
syntax of the two languages. In the third part, it is showed that how additional local
context information on the source side is incorporated. This part helps to reduce the
lexical ambiguity. Two methods are also proposed for using binary decision trees to
control the amount of context information introduced. Finally the system combines the
outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of
the flexibility of the statistical approach and the rich linguistic knowledge embedded in
the rule-based MT system.
Irimia Elena and Alexandru Ceauşu (2010) [122] presents a method for
extracting translation examples using the dependency linkage of both the source and
target language sentence. They identified two types of dependency link-structures -
super-links and chains - and used these structures to set the translation example borders.
They used a Romanian-English parallel corpus contained about 600,000 translation
units. In order to build the translation models from the linguistically analyzed parallel
corpora GIZA++ tool is used and unidirectional translation models are also constructed.
The performance of the dependency-based approach is measured with the BLEU-NIST
score and in comparison with a baseline system.
42
Sriram Venkatapathy et.al (2010) [123] proposes a dependency based statistical
system that uses discriminative techniques to train its parameters. Experiments are
conducted for English- Hindi parallel corpora. The use of syntax (dependency tree)
allows us to address the large word-reordering between English and Hindi. They
grouped the function words with their corresponding function words. These groups of
words are called local-word groups. In these cases, the function words are considered
as factors of the content words. There are three types of transformation features are
explored, first one is, Local Features, the next is Syntactic Features and the final one is
Contextual Features. Online-large margin algorithm, MIRA is used for updating the
weights which are learned in training algorithm.
43
characters and clusters of characters with other characters or clusters. Remove rules
simply remove the characters or clusters. This CWF algorithm is used for both English
and Tamil names, but with different rule set. The final CWF forms will only have the
minimal consonant skeleton. In the second stage Levenshtein’s Edit Distance algorithm
is modified to incorporate Tamil characteristics like long-short vowel, ambiguities in
consonants like ‘n’, ‘r’, ‘i’, etc. Finally, the CWF Mapping transliteration algorithm
takes an input source language named entity string, converts it into CWF form and then
maps with similar Tamil CWF words using modified edit distance. This method
produces a ranked list of transliterated names in the target language Tamil for an
English source language name.
Noun phrase chunking deals with extracting the noun phrases from a sentence.
While NP chunking is much simpler than parsing, it is still a challenging task to build
an accurate and very efficient NP chunker. The importance of NP chunking derives
from the fact that it is used in many applications. Noun phrases can be used as a pre-
processing tool before parsing the text. Due to the high ambiguity of the natural
44
language exact parsing of the text may become very complex. In these cases chunking
can be used as a preprocessing tool to partially resolve these ambiguities. Noun phrases
can be used in Information Retrieval systems. In this application the chunking can be
used to retrieve the data's from the documents depending on the chunks rather than the
words. In particular nouns and noun phrases are more useful for retrieval and extraction
purposes. Most of the recent work on machine translation use texts in two languages
(parallel corpora) to derive useful transfer patterns. Noun phrases also have applications
in aligning of text in parallel corpora. The sentences in the parallel corpora can be
aligned by using the chunk information and by relating the chunks in the source and the
target language. This can be done lot more easily than doing word alignment between
the texts of the two languages. Further noun phrases that are chunked can also be used
in other applications where in depth parsing of the data is not necessary [129].
The approach is a rule based one. In this method initially a corpus is taken and it
is divided into two or more sets. One of these divided sets is used as the training data.
The training data set is taken and manually chunked for noun phrases, thus evolving
rules that can be applied to separate the noun phrases in a sentence. These rules serve as
the base for chunking. The chunker program uses these rules and chunks the test data.
The coverage of these rules is tested with this test data set. Precision and recall are
calculated for this and the result is analyzed to check, if more rules are needed to
improve the coverage of the system. If more rules are needed then additional rules are
added and the same process as mentioned above is repeated to check for increase in the
precision and recall of the system. The system is then tested for various other
applications [130].
Vaanavil of RCILTS-Tamil
45
handle free word order. It handles ambiguity using 15 heuristic rules. It uses the
morphological analyzer to obtain the root word [129].
2.6 SUMMARY
This chapter presents the literature survey for linguistic tools and available
Machine Translation systems for Indian languages. Literature review about the
Linguistic tools such as POS Tagger, Morphological analyzer and Morphological
generator are described in this chapter. Most of the tools are developed based on rule
based methods and few are developed using data driven. In India, Machine Translation
Systems have been developed using direct machine translation approach for closely
related language pairs. Some of these systems are very successful and still operational.
Statistical Machine Translation methods are frequently applied for unrelated language
pairs. Thus, it is concluded that statistical approach is the most appropriate for
unrelated languages.
46
CHAPTER 3
THEORETICAL BACKGROUND
3.1 GENERAL
Tamil belongs to the southern branch of the Dravidian languages, a family of around
twenty-six languages native to the Indian subcontinent. It flourished in India as a
language with rich literature during the Sangam period (300 BCE to 300 CE). Tamil
scholars categorize the history of the language into three periods, Old Tamil (300 BC -
700 CE), Middle Tamil (700 - 1600) and Modern Tamil (1600–present). In Old Tamil,
Epigraphic attestation of Tamil begins with rock inscriptions from the 3rd century BC,
written in Tamil-Brahmi, an adapted form of the Brahmi script. The earliest extant
literary text is the ெதால்காப்பியம் (tholkAppiyam), a work on grammar and poetics
which describes the language of the classical period. The Sangam literature contains
about 50,000 lines of poetry contained in 2381 poems attributed to 473 poets including
many women poets [9].
During Modern Tamil i.e., in the early 20th century, the chaste Tamil
Movement called for the removal of all Sanskrit and other foreign elements from
47
Tamil. It received support from Dravidian parties and nationalists who supported Tamil
independence. This led to the replacement of a significant number of Sanskrit loan
words by Tamil equivalents. An important factor specific to Tamil is the existence of
two main varieties of the language, colloquial and formal Tamil ெசந்தமிழ்
(sewthamiz), which are sufficiently divergent that the language is classed as diglossic.
Colloquial Tamil is used for most spoken communication, and formal Tamil is spoken
in a restricted number of high contexts, such as lectures and news bulletins, and also
used in writing. They differ in terms of their lexis, morphology, and segmental
phonology.
Tamil is the official language of the Indian state of Tamilnadu and one of the 22
languages under schedule 8 of the constitution of India. It is also one of the official
languages of the Union Territories of Puducherry, Andaman & Nicobar Islands, Sri
Lanka, Malaysia and Singapore. Tamil became the first legally recognized classical
language of India in the year 2004 [9].
these parts. The tholkAppiyam (ெதால்காப்பியம்) is the oldest work on the grammar of
48
3.1.3 Tamil Characters
Tamil is written using a script called the vattEzuththu. The Tamil script has twelve
vowels uyirezuththu (உயிெர த் ) "soul-letters", eighteen consonants meyyezuththu
(ெமய்ெய த் ) "body-letters" and one character, the Aythaezuththu (ஆய்த எ த் )
“the hermaphrodite letter”, which is classified in Tamil grammar as being neither a
consonant nor a vowel though often considered as part of the vowel set. The script,
however, is syllabic and not alphabetic.
The complete script, therefore, consists of the thirty-one letters in their independent
form, and an additional 216 compound letters representing a total 247 combinations.
These compound letters are formed by adding a vowel marker to the consonant. The
details of Tamil vowels are given in Table 3.2. Some vowels require the basic shape of
the consonant to be altered in a way that is specific to that vowel. Others are written by
adding a vowel-specific suffix to the consonant, yet others a prefix, and finally some
vowels require adding both a prefix and a suffix to the consonant. The following Table
3.3 lists vowel letters across the top and consonant letters along the side, the
combination of which gives all Tamil compound ( uyirmei) letters.
In every case, the vowel marker is different from the standalone character for the
vowel. The Tamil script is written from left to right. Vowels are also called the 'life'
(uyir) or 'soul' letters. Tamil vowels are divided into short and long kuril and nedil -
five of each type) and two diphthongs. Tamil compound (uyirmei) letters are formed by
adding a vowel marker to the consonant. There are 216 compound letters in Tamil. The
Tamil transliteration is given in the Appendix A.
49
Table 3.3 Tamil Compound Letters
Tamil is an agglutinative language. Tamil words consist of a lexical root to which one
or more affixes are attached. Mostly, Tamil affixes are suffixes. Tamil suffixes can be
derivational suffixes, which either changes the Part-of-Speech of the word or its
meaning, or inflectional suffixes, which mark categories such as person, number, mood,
tense, etc. There is no absolute limit on the length and extent of agglutination, which
can lead to long words with a large number of suffixes, which would require several
words or a sentence in English.
50
Tamil is a morphologically rich language in which most of the morphemes
coordinate with the root words in the form of suffixes. Suffixes are used to perform the
functions of cases, plural marker, euphonic increment and postpositions in noun class.
Tamil verbs are inflected for tense, person, number, gender, mood and voice. Other
features of Tamil language are, using plural for honorific noun, frequent echo words,
and null subject feature i.e. not all sentences have subject. Computationally, each root
word can take more than ten thousand inflected word-forms, out of which only a few
hundred will exist in a typical corpus [129]. Tamil is consistently head-final language.
The verb comes at the end of the clause with a typical word order of Subject-Object-
Verb (SOV). However, Tamil language allows word order to be changed, making it a
relatively word order free language. In Tamil, subject-verb agreement is required for
the grammaticality of a Tamil sentence.
There are many issues that make a Tamil language processing task to difficult. These
relate to the problems of representation and interpretation. Language computing
requires precise representation of context. The natural languages are highly ambiguous
and vague, so achieving such representations are very hard. The various sources of
ambiguities in Tamil language are described below.
Tamil morphemes are ambiguous in the grammatical category and the position it takes
in a word construction.
A morpheme can have more than one grammatical category. For example, the
morpheme athu, ana, thu can occur as Nominalizing suffix or 3rd Person neuter
suffix.
The suffixation of the morpheme’s position also leads to ambiguity. The Table 3.4
gives a few examples for the morphemes and its possible grammatical features.
51
Table 3.4 Ambiguity in Morpheme’s Position
A word may be ambiguous in its Part of Speech or the word class. A word may have
more than one interpretation. For example, the word ப “padi” can take noun class or
verb class. The word ambiguity has to be disambiguating while referring to its context.
in the sense. For instance, the Tamil word கா “ kAddu” has 11 senses in noun class
and 18 senses in verb class [kiriyAvin tharkAla Tamil akarAthi, 2006] [133]. For
example the following sentence has two different meanings.
52
(He ask the song )
A sentence may be ambiguous even if the words are not ambiguous. For example, the
following sentence has two interpretations.
The words are not ambiguous but the sentences are ambiguous.
3.2 MORPHOLOGY
Morphology is the field within linguistics that studies the internal structure of words.
While words are generally accepted as being the smallest units of syntax, it is clear that
in most (if not all) languages, words can be related to other words by rules.
Morphology is the branch of linguistics that studies patterns of word-formation within
and across languages, and attempts to formulate rules that model the knowledge of the
speakers of those languages.
53
happi-ness. Derivation poses a problem to translation in that “not all derived words
have straight-forward compositional translation as derived words. In English, for
example, the same meaning can be expressed by different affixes. Moreover, the same
affix can have more than one meaning. This can be exemplified by the suffix -er. This
suffix can be used to express the agent as in player and singer. But this is not the only
meaning it can convey as it can describe instruments as in mixer and cooker. In this
way the affix can have a range of equivalents in the target language and the attempt to
have one-to-one correspondences for affixes will be greatly misguided.
3.2.2 Lexemes
A lexical database is organized around lexemes, which include all the morphemes of a
language. A lexeme is conventionally listed in a dictionary as a separate entry.
Generally lexeme corresponds to a set of forms taken by a single word. For example, in
the English language, run, runs, ran and running are forms of the same lexeme “run”.
A stem is the part of the word that never changes even when morphologically
inflected, whilst a lemma is the base form of the verb. For example, for the word
"produced", the lemma is "produce", but the stem is “produc-”. This is because there
54
are words such as production. In linguistic analysis, the stem is defined more generally
as the analyzed base form from which all inflected forms can be formed. When
phonology is taken into account, the definition of the unchangeable part of the word is
not useful, as can be seen in the phonological forms of the words in the preceding
example: "produced" vs. "production".
Morpheme is the minimal meaningful unit in a word. The concept of word and
morpheme are different, a morpheme may or may not stand alone. One or several
morphemes compose a word.
• Free morphemes, like town and dog, can appear with other lexemes (as in town
hall or dog house) or they can stand alone, i.e. "free".
• Bound morphemes like "un-" appear only together with other morphemes to
form a lexeme. Bound morphemes in general tend to be prefixes and suffixes.
55
• Derivational morphemes can be added to a word to create (derive) another
word: the addition of "-ness" to "happy," for example, to give "happiness." They
carry semantic information.
Agglutinative languages have words containing several morphemes that are always
clearly differentiable from one another in that each morpheme represents only one
grammatical meaning and the boundaries between those morphemes are easily
demarcated. The bound morphemes are affixes, and they may be individually
identified. Agglutinative languages tend to have a high number of morphemes per
word, and their morphology is highly regular [134].
3.2.6 Allomorphs
3.2.7 Morpho-Phonemics
56
monosyllabic ending with a long and the following morpheme starts with a vallinam
consonant, the consonant geminates. Sandhi changes can occur between two
morphemes or words. Although sandhi rules are mostly dependent on phonemic
properties of the morphemes, they sometimes depend on the grammatical relations of
the words on which they operate. Sometimes gemination may be invalid when the
words are in subject-predicate relation, but valid if they are in modifier-modified
relation. Sandhi changes can occur in four different ways: Gemination, Insertion,
Deletion and Modification. Gemination is a case of insertion where the vallinam
consonants double themselves. In general, the insertion happens when new characters
are inserted between words or morphemes. Deletion happens when existing characters
at the end of the first word or the start of the second word are dropped. Modification
happens when characters get replaced by some other characters with close phonological
properties.
3.2.8 Morphotactics
The morphemes of a word cannot occur in random order. In every language, there are
well-defined ways to sequence the morphemes. The morphemes can be divided into a
number of classes and the morpheme sequences are normally defined in terms of the
sequence of classes. For instance, in Tamil, the case morphemes follow the number
morpheme in noun constructions. For example, க்கைள ( _கள்_ஐ). The other way
around is invalid. For example, ஐக்கள் ( _ஐ_ கள்). The order in which
morphemes follow each other is strictly governed by a set of rules called morphotactics.
In Tamil, these rule play a very important role in word construction and derivation as
the language is agglutinative and words are formed by a long sequence of morphemes.
Rules of morphotactics also serve to disambiguate the morphemes that occur in more
than one class of morphemes. The analyzer uses these rules to identify the structure of
words.
57
3.3 MACHINE LEARNING FOR NLP
Machine learning deals with techniques that allow computers to automatically learn and
make accurate predictions based on past observations. The major focus of machine
learning is to extract information from data automatically, by using computational and
statistical methods. Machine learning techniques are being used for solving various
tasks of Natural Language processing. This includes speech recognition, document
categorization, document segmentation, part-of-speech tagging, and word-sense
disambiguation, named entity recognition, parsing, machine translation and
transliteration.
There are two main tasks involved in machine learning; learning/training and
prediction. The system is given with a set of examples called training data. The primary
goal is to automatically acquire effective and accurate model from the training data.
The training data provides the domain knowledge i.e., characteristics of the domain
from which the examples are drawn. This is a typical task for inductive learning and is
usually called concept learning or learning from examples. The larger the amount of
training data, usually the better the model will be. The second phase of machine
learning is the prediction, wherein a set of inputs is mapped into the corresponding
target values. The main challenge of machine learning is to create a model, with good
prediction performance on the test data i.e., model with good generalization on
unknown data.
58
acquiring some kind of knowledge. So, depending on what the system learns, the
learning is categorized as
• Model Learning: The system predicts values of unknown function. This is called
as prediction and is a task well known in statistics. If the function is discrete, the
task is called classification. For continuous-valued functions it is called regression.
However, translating the training set into a higher dimensional space incurs both
computational and learning-theoretic costs. Representing the feature vectors
59
corresponding to the training set can be extremely expensive in terms of memory and
time. Furthermore, artificially separating the data in this way exposes the learning
system to the risk of finding trivial solutions that overfit the data.
The maximum margin allows the SVM to select among multiple candidate
hyperplanes. However, for many data sets, the SVM may not be able to find any
separating hyperplane at all, either because the kernel function is inappropriate for the
training data or because the data contains mislabeled examples. The latter problem can
60
be addressed by using a soft margin that accepts some misclassifications of the training
examples. A soft margin can be obtained in two different ways. The first is to add a
constant factor to the kernel function output whenever the given input vectors are
identical. The second is to define a priori an upper bound on the size of the training set
weights. In either case, the magnitude of the constant factor is to be added to the kernel
or to tie the size of the weights which controls the number of training points that the
system misclassifies. The setting of this parameter depends on the specific data at hand.
Completely specifying a support vector machine therefore requires specifying two
parameters: the kernel function and the magnitude of the penalty for violating the soft
margin.
Thus, a support vector machine finds a nonlinear decision function in the input
space by mapping the data into a higher dimensional feature and separating it there by
means of a maximum margin hyperplane. The computational complexity of the
classification operation does not depend on the dimensionality of the feature space,
which can even be infinite. Overfitting is avoided by controlling the margin. The
separating hyperplane is represented sparsely as a linear combination of points. The
system automatically identifies a subset of informative points and uses them to
represent the solution. Finally, the training algorithm solves a simple convex
optimization problem. All these features make SVMs an attractive classification
system.
Typically, the machine is presented with a set of training examples, (xi,yi) where the xi
are the real world data instances and the yi are the labels indicating which class the
instance belongs to. For the two class pattern recognition problem, yi = +1 or yi = -1. A
training example (xi,yi) is called positive if yi = +1 and negative otherwise. SVMs
construct a hyperplane that separates two classes (this can be extended to multi-class
problems). While doing so, the SVM algorithm tries to achieve maximum separation
between the classes.
Separating the classes with a large margin minimizes a bound on the expected
generalization error [137]. A ‘minimum generalization error’, means that when new
examples (data points with unknown class values) arrive for classification, the chance
61
of making an error in the prediction (of the class which it belongs) based on the learned
classifier (hyperplane) should be minimum. Intuitively, such a classifier is one which
achieves maximum separation-margin between the classes. Figure 3.1 illustrates the
concept of ‘maximum margin’. The two planes parallel to the classifier and which
pass through one or more points in the data set are called ‘bounding planes’. The
distance between these bounding planes is called the ‘margin’ and SVM ‘learning’,
means, finding a hyperplane which maximizes this margin. The points (in the dataset)
falling on the bounding planes are called ‘support vectors’ . These points play a crucial
role in the theory and hence the name support vector machines. ‘Machine’, means
algorithm. Vapnik (1998) has shown that if the training vectors are separated without
errors by an optimal hyperplane, the expected error rate on a test sample is bounded by
the ratio of the expectation of the support vectors to the number of training vectors.
Since this ratio is independent of the dimension of the problem, if one can find a small
set of support vectors, good generalization is guaranteed [136].
Support Vectors
Maximum Margin
62
In the case, wherein the data points are shown in Figure 3.2, one may simply minimize
the number of misclassifications whilst maximizing the margin with respect to the
correctly classified examples. In such a case it is said that the SVM training algorithm
allows a training error. There may be another situation wherein the points are clustered
such that the two classes are not linearly separable as shown in Figure 3.3, that is, if one
tries for a linear classifier, it may have to tolerate a large training error. In such cases,
one prefers non-linear mapping of data into some higher dimensional space called
‘feature space’, F, where it is linearly separable. In order to distinguish between these
two spaces, the original space of data points is called ‘input space’. The hyperplane in
‘feature space’ corresponds to a highly non-linear separating surface in the original
input space. Hence the classifier is called a non-linear classifier
63
Figure 3.3 Non-linear Classifier
The process of mapping the data into higher dimensional space involves heavy
computation especially when the data which itself may be of high dimensional.
However, there is no need to do any explicit mapping to higher dimensional space for
finding the hyper plane classifier, all computations will be done in the input space itself
[138].
Notation used
⎡ xi1 ⎤
⎢x ⎥
xi = ⎢ i 2 ⎥
⎢ ⎥
⎢ ⎥
⎣ xin ⎦ , n dimensional vector, which represent a data point in “input space”.
64
⎡ d1 ⎤
⎢d ⎥
d =⎢ 2⎥
⎢ ⎥
⎢ ⎥
⎣ d m ⎦ , vector representing target value of m data points
⎡ d1 0 0⎤
⎢0 d2 … 0 ⎥⎥
D = diag( d )= ⎢
⎢ . ⎥
⎢ ⎥
⎣0 0 … dm ⎦
⎡ w1 ⎤
⎢w ⎥
w = ⎢ 2⎥
⎢ ⎥
⎢ ⎥
⎣ wn ⎦ , weight vector orthogonal to the hyper plane
w1 x1 + w2 x2 + … wn xn − γ = 0
.
γ is a scalar which is generally known as bias term
φ (.) → x A nonlinear mapping function that maps input vector x into a high
⎡ φ ( x1 )T φ ( x1 ) φ ( x )T φ ( x2 ) . . φ ( x )T φ ( xm ) ⎤
⎢ ⎥
1 1
⎢ φ ( x2 ) φ ( x1 ) φ ( x2 )T φ ( x ) φ ( x2 )T φ ( xm ) ⎥
T
1
. .
K =⎢ . . . . . ⎥
⎢ ⎥
⎢ . . . . . ⎥
⎢φ(x ) φ(x )
T
φ ( x )T φ ( x ) φ ( xm )T φ ( xm )⎥⎦
⎣ m 1 m 2
. .
65
Q = an mxm matrix whose (i,j)th element is di d jφ ( xi ) φ ( x j )
T
(
Q = K .* d * d T ), where .* represent element wise multiplication
From the geometric point of view, the support vector machine constructs an optimal
T
hyperplane given by w x - γ = 0 between two classes of examples. The free
parameters are a vector of weights w which is orthogonal to the hyperplane and a
threshold value γ. The aim is to find maximally separating bounding planes
wTx- γ=1
w T x - γ = -1
such that data points with d = -1 satisfy the constraints
w T x - γ ≤ -1
w T x - γ ≥ 1.
|- γ + 1|/||w||
and the perpendicular distance of the bounding plane w T x - γ = -1 from the origin
is, |- γ - 1|/||w|| .
The margin between the optimal hyperplane and the bounding plane is 1/||w||, and
so the distance between the bounding hyperplanes is 2/||w||.
1 2
Minimize = w
2
subject to Dii (w T x i − γ ) ≥ 1, i = 1, … , l.
The ‘training of SVM’ consists of finding w and γ, given the matrix of data points
A and the corresponding class vector d. Once w and γ are obtained then the decision
is, for a new points the sign of w x − γ is assigned as the class value. The problem is
T
66
3.4 VARIOUS APPROACHES FOR POS TAGGING
There are different approaches for POS tagging. The Figure 3.4 demonstrates different
POS tagging models. Most tagging algorithms fall into one of the two classes which are
rule-based taggers or stochastic taggers.
The supervised POS tagging models require pre-tagged corpora which are used for
training to learn rule sets, information about the tagset, word-tag frequencies etc. The
learning tool generates trained models along with the statistical information. The
performance of the models generally increases with increase in the size of pre-tagged
corpus.
POS Tagging
Supervised Unsupervised
Neural
Rule Based Stochastic Neural Rule Based Stochastic
Brill Brill
Viterbi Algorithm
67
3.4.2 Unsupervised POS Tagging
Unlike the supervised models, the unsupervised POS tagging models do not require a
pre-tagged corpus. Instead, they use advanced computational methods like the Baum-
Welch algorithm to automatically induce tagsets, transformation rules etc. Based on the
information, they either calculate the probabilistic information needed by the stochastic
taggers or induce the contextual rules needed by rule-based systems or transformation
based systems.
The rule based POS tagging models apply a set of hand written rules and use contextual
information to assign POS tags to words in a sentence. These rules are often known as
context frame rules. For example, a context frame rule might say something like:
On the other hand, the transformation based approaches use a pre-defined set of
handcrafted rules as well as automatically induced rules that are generated during
training. Some models also use information about capitalization and punctuation, the
usefulness of which are largely dependent on the language being tagged. The earliest
algorithms for automatically assigning Part-of-Speech were based on two-stage
architecture. The first stage used a dictionary to assign each word a list of potential
parts of speech. The second stage used large lists of hand-written disambiguation rules
to bring down this list to a single Part-of-Speech for each word [139].
68
3.4.4 Stochastic POS Tagging
Apart from these, a few different approaches for tagging have been developed.
Support Vector Machines: This is the powerful machine learning method used for
various applications in NLP and other areas like bio-informatics, data mining, etc.
Neural Networks: These are potential candidates for the classification task since
they learn abstractions from examples [141].
69
Decision Trees: These are classification devices based on hierarchical clusters of
questions. They have been used for natural language processing such as POS Tagging.
“Weka” can be used for classifying the ambiguous words [141].
This is the actual spelling of the final valid word. For example English words
eating and swimming, are both surface representations.
This shows a simple concatenation of base forms and tags. Consider the
following examples showing the lexical and surface form of English words.
70
It may be noted that the lexical representation (or form) is often invariant or
constant. In contrast, affixes and bases of the surface form tend to have alternating
shapes. This can be seen in the above examples. The same tag “+Verb + Prog” is used
with both eat and swim, but swim is realized as swimm in the context of ing, while eat
shows no alternation in the context of ing. The rule component consists of rules which
map the two representations to each other. Each rule is described through a Finite-
State-Transducer (FST). Figure 3.5, schematically depicts two-level morphology.
(there are various levels to be distinguished) of the language of the input text.”
71
3.5.3 Memory based Morphological Analysis
Stemmer uses a set of rules containing list of stems and replacement rules to stripping
of affixes. It is a program oriented approach where the developer has to specify all
possible affixes with replacement rules. Potter algorithm is one of the most widely used
stemmer algorithm and it is freely available. The advantage of stemmer algorithm is
that it is very suitable to highly agglutinative languages like Dravidian languages for
creating Morphological Analyzer and Generator.
72
3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION
From the period when the first idea of using machine for the process of language
translation, there have been many different approaches to machine translation that have
been proposed, implemented and put into use, during the course of time. The main
approaches to machine translation are:
Direct, Interlingua and Transfer approaches are linguistic approaches which require
some sort of linguistic knowledge to perform translations, whereas dictionary based,
example based and statistical approach falls under non-linguistic approaches that don’t
require any linguistic knowledge to translate the sentences. Hybrid approach is a
combination of both linguistic and non-linguistic approaches.
Rule based approaches requires a lot of linguistic knowledge during the translation and
so it uses grammar rules and computer programs which will be helpful in analysing the
text for determining grammatical information and features for each and every word in
the source language, translating it by replacing each word by lexicon or word that have
the same context in the target language. Rule based approach is the principal
methodology that was developed in machine translation. Linguistic knowledge will be
required in order to write the rules for this type of approaches. These rules will play a
vital role during the different levels of translation. This approach is also called as
Theory based Machine Translation.
73
The benefit of rule based machine translation method is that it can intensely
examine the sentence at its syntax and semantic levels. There are complications in this
method such as prerequisite of vast linguistic knowledge and very huge number of rules
is needed in order to cover all the features in a language. An advantage of the approach
is that the developer has more control over the translations than is the case with corpus-
based approaches. The three different approaches that require linguistic knowledge are
as follows.
This approach perform a simple and minimal syntactic and semantic analysis,
by which it differs from the other rule based translation systems such as interlingua and
the transfer-based approaches. As the direct approach to machine translation is
considered to be ad-hoc and found to be an approach that is unsuitable approach to
machine translation. Table 3.6 describes the example, how the sentence “he came late
to school yesterday” will be translated from English to Tamil using the direct approach.
74
Figuree 3.6 Block Diagram off Direct App
proach to Machine
M Traanslation
Table 3.5 An
A Examplle to Illustraate the Direct Approach
Input Sentence in
n English He camee late to schoool yesterdayy
<He><yyesterday><
<to school><
<late><comee
Worrd Reorderiing
PAST>
mtd; ne
ew;W gs;spff;F neuk; fH
Hpj;J th
Dicttionary Lookup
PAST
Infleect(the fina
al translateed
mtd; ne
ew;W gs;spff;F neuk; fH
Hpj;J te;jhd;;.
sentence)
75
3.6.1.2 Interlingua Approach
Interlingua approach to machine translation mainly aims at transforming the texts in the
source language to a common representation which is applicable to many languages.
Using this representation the translation of text to the target language is performed and
it should be possible to translate to every language from the same Interlingua
representation with the right rules.
1. Analysing and transforming the source language texts into a common language
independent representation.
2. From the common language independent form generate the text in the target
language.
The first stage is particular to source language and doesn’t require any knowledge
about the target language whereas the second stage is particular to the target language
and doesn’t require any knowledge from the source language. The main advantage of
interlingua approach is that it creates an economical multilingual environment that
requires 2n translation systems to translate among n languages where in the other case,
the direct approach requires n(n-1) translation systems. Table 3.6 has the Interlingua
representation of the sentence, “he will reach the hospital in ambulance”.
Predicate Reach
Tense FUTURE
The concepts and relations that are used are the most important aspect in any
interlingua-based system. The ontology should be powerful enough that all subtleties of
meaning that can be expressed using any language should be representable in the
Interlingua. Interlingua approach can be found more economical when translation is
76
carrried out witth three or more
m languagges but also the complexxity of this approach
a geets
inccreased, dram
matically. This
T is clearlly evident from
f ngle which is
the Vaauquois trian
shoown in the Figure 3.7.
The transffer model involves thhree stages which are analysis, transfer, annd
genneration. In the analysiis stage, thee source lan
nguage senttence is parrsed, and thhe
sen
ntence structture and thee constituentts of the senntence are identified.
i In
n the transfe
fer
stage, transformations aree applied to the source language pparse tree too convert thhe
ucture to thaat of the targget languagee. The generation stage translates th
stru he words annd
exppresses the tense,
t numbber, gender eetc. in the taarget languaage. Figure 3.8
3 shows thhe
bloock diagram of the transffer approachh.
77
Figure 3.8 Block Diagram
D forr Transfer A
Approach
Consider the
t sentencee, “he will come
c to schoool in bus”. Table 3.7 illustrates thhe
three stages off the translatiion of this seentence usin
ng the transfeer approach. The sentencce
reppresentation after the an
nalysis stagee of the trannsfer approaach is show
wn in analyssis
stage. The reprresentation of
o the sentence after reorrdering it acccording to thhe Tamil worrd
ordder as result of the transfer stage of the transfer approach iss shown in Table
T 3.7. Thhe
final generatioon stage which replacess the sourcee language w
words to tarrget languagge
ords.
wo
78
Table 3.7 An Example for Transfer Approach
The non-linguistic approaches are those which don’t require any linguistic knowledge
explicitly to translate texts in the source language to target language. The only resource
required by this type of approaches is data either the dictionaries for the dictionary
based approach or bilingual and monolingual corpus for the empirical or corpus based
approaches.
The dictionary based approach to machine translation uses dictionary for the language
pair to translate the texts in the source language to target language. In this approach,
word level translations will be done. This dictionary based approach can either be
preceded by some pre-processing stages to analyse the morphological information and
lemmatize the word to be retrieved from the dictionary. This kind of approach can be
used to translate the phrases in a sentence and found to be least useful in translating a
full sentence. This approach will be very useful in accelerating the human translation,
by providing meaningful word translations and limiting the work of humans to
correcting the syntax and grammar of the sentence.
The corpus based approaches don’t require any explicit linguistic knowledge to
translate the sentence. But a bilingual corpus of the language pair and the monolingual
corpus of the target language are required to train the system to translate a sentence.
This approach has driven lots of interest in world-wide.
79
3.66.2.3 Examp
ple based Ap
pproach
This approach to machine translation is a techniquue that is maiinly based on how humaan
beiings interpreet and solve the
t problem
ms. That is, normally the humans spliit the problem
m
intoo sub probleems, solve eaach of the suub problems with the ideea of how they solved thhis
typpe of similarr problems inn the past annd integrate them to sollve the probllem in wholle.
This approachh needs a huge
h bilinguual corpus of
o the langguage pair among
a whicch
trannslation has to be perforrmed.
The EB
BMT system
m functions llike a translaation memorry. A translaation memorry
is a computer aided transllation tool tthat is able to reuse preevious transllations. If thhe
sen
ntence or a similar sentennce has beenn translated previously,
p tthe previouss translation is
retuurned. In coontrast, the EBMT system can traanslate noveel sentences and not juust
repproduce prevvious senten T translates iin three stepps; matching,
nce translations. EBMT
alig
gnment, andd recombinattion [143]. 1) In matchin
ng, the system
m looks in its database of
o
preevious exam
mples and finnds the piecees of text that together ggive the besst coverage of
o
thee input sentence. This matching
m is doone using vaarious heurisstics from ex
xact characteer
maatch to matchhes using hig
gher linguisttic knowledg
ge to calculaate the similaarity of wordds
or identify gen
neralized tem
mplates. 2) T
The alignmennt step is theen used to iddentify whicch
targget words thhese matchinng strings coorrespond to
o. This identtification caan be done by
b
usiing existing bilingual dicctionaries orr automaticaally deduced from the paarallel data. 3)
3
Finnally, these correspondeences are reccombined an
nd the rejoinned sentences are judgeed
usiing either heeuristic or sttatistical infoormation. Fiigure 3.9 shoows the blocck diagram of
o
exaample-based
d approach.
80
In order to get a cleear idea of this
t approachh, consider tthe English sentence “H
He
ught a homee” and the Taamil translattion also given in Table 3.8.
bou
Tab
ble 3.8 Exam
mple of Engglish and Taamil Senten
nces
English
h Tamill
He bought a pen mtd; xU ngdh th
h';fpdhd;
He has a hoome mtDf;F
F xU tPL ,Uf;fpwJ
Staatistical app
proach to machine
m trannslation gen
nerates trannslations usiing statistical
meethods by deriving
d the parameters for those methods byy analysing the bilinguual
corrpora. This approach diiffers from the other appproaches tto machine translation in
i
maany aspects. Figure 3.10
0 shows thee simple blo
ock diagram of a Statisttical Machinne
Traanslation (SM
MT) system..
Figu
ure 3.10 Bloock Diagram
m of SMT S
System
81
The advantages of statistical approach over other machine translation approaches are as
follows:
• Rule based machine translation systems are generally expensive as they employ
manual creation of linguistic rules and also these systems cannot be generalised for
other languages, whereas statistical systems can be generalised for any pair of
languages, if bilingual corpora for that particular language pair is available.
Hybrid machine translation approach makes use of the advantages of both statistical
and rule-based translation methodologies. Commercial translation systems such as Asia
Online and Systran provide systems that were implemented using this approach. Hybrid
machine translation approaches differ in many numbers of aspects:
82
Ru ing by statisstical approoach: Here thhe rule baseed
ule-based sysstem with post-processi
p
maachine translation system
m produces ttranslations for
f a given ttext in sourcce language to
t
targget language. The outpput of this rule
r based system willl be post-processed by a
stattistical systeem to providde better traanslations. Figure
F 3.11 sshows the block
b diagram
m
forr this system.
Figgure 3.11 Ru
ule based Trranslation System
S with
h Post-proceessing
3.77 EVAL
LUATING
G STATIST
TICAL MACHINE
M E TRANSL
LATION
83
quick, cheap and consistent approach is required to judge the MT systems. A precise
automated evaluation technique would require linguistic understanding. Methods for
automatic evaluation usually find the similarity between the translation output and one
or more translation references.
Statistical Machine Translation outputs are very hard to evaluate. To judge the quality
of translation one may ask human translators to find the scores for a machine
translation output or compare a system output with a gold standard output. This gold
standard outputs are generated by human translators. In human evaluation, different
translators translated same sentence in different ways. There is no single correct answer
for the translation task because a sentence can be translated in different ways. The
reason for translation variation is choice of words, word order and style of translators.
So the machine translation quality is very hard to predict.
The human evaluation tasks provide the best insight into the performance of an
MT system, but they come with some major drawbacks. It is an expensive and time
consuming evaluation method. To overcome some of these drawbacks, automatic
evaluation metrics have been introduced. These are much faster and cheaper than
human evaluation, and they are consistent in their evaluation, since they will always
provide the same evaluation given the same data. The disadvantage of automatic
evaluation metrics is that their judgments are often not as correct as those provided by a
human. The evaluation process, however, has the advantage that it is not tied by the
realistic scenery of translation. Most often, evaluation is performed on sentences where
one or more gold standard reference translations already exist [143].
84
Table 3.9 Scales of Evaluation
5 All Flawless
4 Most Good
3 Much Non-native
2 Little Disfluent
1 None Incomprehensible
The first and most widely-used first automatic evaluation measure is BLEU (BiLingual
Evaluation Understudy) [144]. It was introduced by IBM in Papineni et.al. (2002). It
finds the geometric mean of modified n-gram precisions. BLEU considers not only
single word matches between the output and the reference sentence, but also n-gram
matches, up to some maximum n. It is the ratio of correct n-gram of a certain order n in
relation to the total number of generated n-gram of that order. The maximum order n
for n-gram to be matched is typically set to four. This mean is then called BLEU-4.
Multiple reference are also be used to compute BLEU. Evaluating system translation
against multiple reference translation provides a more robust assignment of the
translation quality [144]. The BLEU metric then takes the geometric mean of the scores
assigned to all n-gram lengths. Equation 3.1 shows the formula for BLEU, where N is
85
the order of n-grams that are used, usually 4, pn is a modified n-gram precision, where
each n-gram in the reference can be matched by at most one n-gram from the
hypothesis. BP is a brevity penalty, which is used to penalize too short translations. It is
based on the length of the hypothesis c, and the reference length r. If several references
are used, there are alternative ways of calculating the reference length, using the
closest, average or shortest reference length. BLEU can only be used to give accurate
system wide scores, since the geometric mean formulation means it will be zero if there
are no overlapping 4-grams, which is often the case in single sentences.
1
BP=
The NIST metric (Doddington, 2002) is an extension of the BLEU metric [145]. The
introduction of this metric tried to meet two characteristics of BLEU. First, the
geometric average of BLEU makes the overall score more sensitive to the modified
precision of the individual n’s, than if the arithmetic average is used. This may be a
problem if not many high n-gram matches exist. Second, all word forms are weighted
equally in BLEU. Less frequent word forms may be of higher importance for the
translation than for example high frequent function words, which NIST tries to
compensate for by introducing an information weight. Additionally, the BP is also
changed to have less impact for small variations in length. The information weight of
an n-gram abc is calculated by the following equation:
This information weight is used in equation (3.4) instead of the actual count of
matching n-grams. In addition, the arithmetic average is used instead of the geometric,
and the BP is calculated based on the average reference length instead of the closest
reference length. The lengths of these are summed for the entire corpus (r) and the same
for the translations (t).
86
BP = exp . min ,1 (3.3)
∑
NIST = BP · ∑ ∑
(3.4)
The NIST metric is very similar to the BLEU metric, and their correlations with human
evaluations are also close. Perhaps NIST correlates a bit better with adequacy, while
BLEU correlates a bit better with fluency (Doddington, 2002) [145].
P(relevant | retrieved)
Recall: the number of relevant documents retrieved by a search divided by the total
number of existing relevant documents (which should have been retrieved)
P(retrieved | relevant)
For Example,
87
= 3 / 6 = 50%
= 3 / 7 = 42.85%
The F Measure (weighted harmonic mean) is a combined measure that assesses the
precision/recall tradeoff.
F= 2( P x R) / (P+R)
F= 46%
Edit Distance Measures provide an estimate of translation quality based on the number
of changes which must be applied to the automatic translation so as to transform it into
a reference translation
• WER- Word Error Rate (Nießen et al., 2000) [147]. This measure is based on
the Levenshtein distance (Levenshtein, 1966) [146] —the minimum number of
substitutions, deletions and insertions that have to be performed to convert the
automatic translation into a reference translation.
• TER- Translation Edit Rate (Snover et.al., 2006) [149]. TER measures the
amount of post-editing that a human would have to perform to change a system
output so it exactly matches a reference translation. Possible edits include
insertions, deletions, and substitutions of single words as well as shifts of word
sequences. All edits have equal cost.
88
TER = # of edits to closest reference / average # of reference words
The edits that TER considers are insertion, deletion and substitution of
individual words, as well as shifts of contiguous words. TER has also been
shown to correlate well with human judgment.
3.8 SUMMARY
89
CHAPTER 4
Grammar of a language is divided into syntax and morphology. Syntax is how words
are combined to form a sentence and morphology deals with the formation of words.
Morphology is also defined as the study of how meaningful units can be combined to
form words. One of the reasons to process a morphology and syntax together in
language processing is that a single word in a language is equivalent to combination of
words in another. The term “morpho-syntax” is a hybrid word that comes from
morphology and syntax. It plays a major role in processing different types of languages
and it is also a related term to machine translation because the fundamental unit of
machine translation is words and phrases. Retrieving the syntactic information is a
primary step in pre-processing English language sentences. The tool which is used for
retrieving syntactic structure from a given sentence is called parsing and which is used
to retrieve morphological features from a word is called as morphological analyzer.
Syntactic information includes dependency relation, syntactic structure and POS tag
morphological information consists of lemma and morphological features.
Klein and Manning (2003) [150] from Stanford University proposed a statistical
technique for retrieving the syntactical structure of English sentences. Based on this
technique a “Stanford Parser tool” was developed. This parser provides dependency
relationship as well as phrase structure trees for a given sentence. Stanford parser
90
package is a Java implementation of probabilistic natural language parsers, such as
highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG
parser. The parser was also developed for other languages such as Chinese, Italian,
Bulgarian, and Portuguese. This parser uses the knowledge gained from hand-parsed
sentences to produce the most likely analysis of new sentences. In this pre-processing
Stanford parser is used to retrieve the morpho-syntactic information of English
sentences.
Part-of-Speech (POS) tagging is the task of labeling each word in a sentence with its
appropriate parts-of-speech like noun, verb, adjective, etc. This process takes an
untagged sentence as input then assigns a POS tag to words and produces tagged
sentences as output. The most widely used part of speech tagset for English is PennTree
bank tagset which is given in the Appendix-A. In this thesis, English sentences are
tagged using this tagset. POS and lemma of word forms are shown in Table 4.1. The
example shown bellow represents the POS tagging for English sentences.
English Sentence
Part-of-Speech Tagging
91
Morphological analyzer or lemmatizer is used to find the lemma of a word.
Lemmas have special importance in highly inflected languages. Lemma is a dictionary
word or a root word. For example the word “play” is available in dictionary but other
word forms like playing, played, and plays aren’t available. So the word “play” is
called a lemma or dictionary word for the above mentioned word forms.
92
English Sentence
Parsing information
(ROOT
(S
(NP (DT The) (NN boy))
(VP (VBZ is)
(VP (VBG going)
(PP (TO to)
(NP (DT the) (NN school)))))
(. .)))
Phrases
93
English Sentence
Typed dependencies
det(boy-2, The-1)
nsubj(going-4, boy-2)
aux(going-4, is-3)
root(ROOT-0, going-4)
prep(going-4, to-5)
det(school-7, the-6)
pobj(to-5, school-7)
Recently, SMT systems are introduced with linguistic information in order to address
the problem of word order and morphological variance between the language pairs.
This preprocessing of source language is done constantly on the training and testing
corpora. More source side pre-processing steps brings the source language sentence
closer to that of the target language sentence.
94
syntactic trees for rearranging the phrases in the English sentence. Factorization takes
the surface words in the sentence and then factored using syntactic tool. This
information is appended to the words in the sentence. Part-of-Speech tags are
simplified and included as a factor in factorization. This factored sentence is given to
the compounding stage. Compounding is defined as adding additional morphological
information to the morphological factor of source (English) language words. Additional
morphological information includes function word, subject information, dependency
relations, auxiliary verbs, and model verbs. This information is based on the
morphological structure of the target language. After adding this information, few
function words, auxiliary information are removed and reordered information is
incorporated in integration phase.
English Sentence
Stanford
Parser
Tool
Reordering Factorization
Compounding
Integration
95
4.2.1 Reordering English Sentences
Reordering transforms the source language sentence into a word order that is closer to
that of the target language. Mostly in Machine Translation system the order of the
words in the source language sentence is often different from the words in the target
language sentence. The word-order difference between source and target languages is
one of the most significant errors in a Machine Translation system. Phrase based SMT
systems are limited for handling long distance reordering. A set of syntactic reordering
rules are developed and applied on the English language sentence to better align with
the Tamil sentence. The reordering rules elaborate the structural differences of English
and Tamil sentences. These transformation rules are applied to the parse trees of the
English source language. Parse trees are developed using Stanford parser tool. Quality
of parse trees plays an important role in syntactic reordering. In this thesis, the source
language is English and therefore the parses are more accurate and the reordering based
on the parses are exactly matched with target language. Generally, English parsers are
performing better than other language parsers because, English parsers developed from
longer and advanced statistical parsing techniques are applied.
96
4.2.1.1 Syntactic Comparison between English and Tamil
This subdivision gives a closer look and notable differences between the syntax of
English and Tamil language. Syntax is a theory of sentence structure and it guides
reordering when translations between a language pair contain disparate sentence
structure. English and Tamil are from different language families. English is an Indo-
European language and Tamil is a Dravidian language. English has the word order of
Subject–Verb-Object (SVO) and Tamil has the word order of Subject-Object-Verb
(SOV). For example, the main verb of a Tamil sentence always comes at the end but in
English it comes between subject and object. English is a fixed word order language
where Tamil word order is flexible. Flexibility in word order represent that the order
may change freely without affecting the grammatical meaning of the sentence. While
translating from English to Tamil, English verbs have to be moved from after the
subject to end of the sentence.
a noun ( த்தகம்) puththakam 'book', the sequence is translated into English 'This is a
book.' Tamil is a null subject language. Not all Tamil sentences have subjects, verbs,
and objects. It is possible to construct grammatically valid and meaningful sentences
using subject and verb only. For example, a sentence may only have a verb—such as
("completed")—or only a subject and object, without a verb such as (அ என் .)
athu envEdu ("That [is] my house").
Tamil language does not have a copula verb (a linking verb equivalent to the
word is). The word is included in the translations only to convey the meaning more
easily. Schiffman (1999) observed that Tamil syntax is the mirror-image of the order in
an English sentence, especially when there are relative clauses, quotations, adjectival
and adverbial clauses, conjoined verbal constructions, and aspectual and modal
auxiliaries, among others [156].
97
4.2.1.2 Reordering Methodology
Figure 4.3 shows the method of reordering English sentences. English sentence
is given to the Stanford parser for retrieving syntactic information. After retrieving the
information, the English sentence is reordered using pre-defined syntactic rules. In
order to obtain the similar word order of the target language, reordering is applied in
source sentence prior to translation. 180 rules are developed according to the syntactic
difference between English and Tamil languages. Syntax-based rules are used to reorder
the English language sentence to better match the sentence with Tamil language. It is
applied to the English parse tree at the sentence level. The rule for reordering is
restricted to the particular language pair. Reordering will be carried out in phrases as
well as words. All the created rules are compared with the production rule of an English
98
sentence. If match is found then transformation is performed according to the target rule.
Examples of English reordering rules are shown in Table 4.2. All the rules are included
in Appendix B.
English Sentence
Stanford
Parser
Tool
Syntactic Information
Syntactic
Reordering
Rules
Reordered English
Sentence
99
Table 4.2 Reordering Rules
Where, # divides the units of reordering rules, the last unit indicates source and
target indexes. In the above example, “0:1, 1:0” indicates first child of the target rule is
from second child of the source rule; second child of the target rule is from first child of
the source rule.
100
Source Æ 0(VBP) 1(PP)
The first production rule (i) S->NP VP is matched with the first reordering rule
in Table.4.2. The target transformation is same as the source pattern and therefore no
change in first production rule.
The next production rule (ii) VP->VBD NP PP is matched with the eighth
reordering rule in table and the transformation is 0:2 1:1 2:0 , it means that source
word order (0,1,2) is transformed into (2,1,0). (0,1,2) are the index of VBD NP and PP,
101
now the transformed pattern is PP NP VBD. This process is continuously applied to
each of the production rules. Finally the transformed production rule is given below.
English parallel corpora which is used for training is reordered and the testing
sentences are also reordered. 80% of English sentences are reordered correctly
according to the rules which are developed. Original and reordered English sentences
are shown in Table 4.3. After reordering the English sentences are given to the
compounding stage.
Sharmi gave her book to Arthi Sharmi her book Arthi to gave
She went to shop for buying fruits She fruits buying for shop to went
The current phrase-based models are limited to the mapping of small text chunks
without the use of any explicit linguistic information like morphological and
syntactical. Such information plays a significant role in morphologically rich
languages. In other hand, for many language pairs, the availability of bilingual corpora
is very less. SMT performance is based on the quality and quantity of corpora. So, SMT
strictly needs a new method which uses linguistic information explicitly with fewer
102
amounts of parallel data. Philip Koehn and Hoang (2007) developed a Factored
translation framework for statistical translation models to tightly integrate linguistic
information [10]. It is an extension of phrase-based Statistical Machine Translation that
allows the integration of additional morphological and lexical information, such as
lemma, word class, gender, number, etc., at the word level on both source and the target
languages. Factoring English language sentence is a basic step in factored translation
system. Factored translation model is one way of representing morphological knowledge
to Statistical machine translation explicitly. Factors which are considered in pre-
processing and their description of English language are shown in Table.4.4. In this
example, word refers surface word, lemma represents the dictionary word or root word,
word class represents word-class category and morphology tag represents compound tag
which contains morphological information and/or function words. In some cases the
“morphology” tag, also contains the dependency relations and/or PNG information. For
instance, the English sentence, “I bought vegetables to my home”, is factored into
linguistic factors which are shown in Table.4.5. Factored representation of English
sentence is shown in Table 4.6.
103
Table 4.5 Example of English Word Factors
to to TO PRE Prep
WORD FACTORS 1
I i|PRP|PRP_nsubj
bought buy|V|VBD
vegetables vegetable|N|NNS_dobj
to to|PRE|TO_prep
my my|PRP|PRP$_poss
home home|N|NN_pobj
104
Instead of mapping surface words in translation, factored models maps the linguistic
units (factors) of language pair. Stanford Parser is used for factorizing English language
sentence. From the parser output, linguistic information such as, lemma, part-of-speech
tags, syntactic information and dependency information are retrieved. This linguistic
information is integrated as factors to the surface word.
A baseline Statistical Machine Translation (SMT) system only considers surface word
forms and does not use linguistic information. Translating into target surface word form
is not only dependent on the source word-form and it also depends on additional
morpho-syntactic information. While translating from morphologically simpler
language to morphological rich language, it is very hard to retrieve the required
morphological information from the source language sentence. This morphological
information is an important term for producing a target language word-form. The
preprocessing phase compounding is used to retrieve the required linguistic information
from source language sentence. Morphologically rich languages have a large number of
surface forms in the lexicon to compensate for a free word-order. This large number of
word-forms in Tamil language is very difficult to generate from English language
words. Compounding is defined as adding additional morphological information to
morphological factor of source (English) language words. Additional morphological
information includes subject information, dependency relations, auxiliary verbs, model
verbs and few function words. This information is based on the morphological structure
of Tamil language. In compounding phase, dependency relations are used to identify
the function words from the English factored corpora. During integration, few function
words are deleted from the factored sentence and attached as a morphological factor to
the corresponding content word.
In Tamil language, function words are not directly available but it is fused with
corresponding content word. So instead of making the sentences into similar
representation, function words are removed from an English sentence. This process
reduces the length of the English sentences. Like function words, auxiliary verbs and
model verbs are also identified and attached in morphological factor of head word of
source sentence. Now the morphological factor representation of the English language
105
sentence is similar to that of the Tamil language sentence. This compounding step
indirectly integrates dependency information into the source language factor.
English words are divided into two types, one is content word which carries
meaning by referring objects and actions and another one is function word which gives
the relationship between content words. Table 4.8 and 4.9 shows the content word and
function words of English language. The relationship between content words is also
encoded in morphology. Content words are also called as open class words and
Function words are called as closed class words. Part-of-speech categories of content
words are verbs, nouns, adjectives and adverbs. For function words the categories are
prepositions, conjunctions and determiners.
106
Generally, languages not only differ in the word order but also differ in
encoding the relationship between words. English language is strictly in fixed word
order and involves heavy usage of function words but less usage in morphology. Tamil
language had a rich morphological structure and heavy usage of content word but free
word-order language.
107
Because of the function words, the average number of words in English
sentences is more compared to the words in an equivalent Tamil sentence. Some of the
function words in English don’t exist in Tamil language because these words are
coupled with Tamil content words. English language contains more function words
than content words but Tamil language has more content words. Corresponding
translation of English function words are coupled in Tamil content word. While
translating from English to Tamil language, equivalent translation will not available for
English function words and this leads to the alignment problem. Table 4.10 shows the
various word forms based on English tenses
In Tamil, verbs are morphologically inflected due to tense and PNG (Person-
Number-Gender) markers and nouns are inflected due to count and cases. Each Tamil
verb root is inflected into more than ten thousand surface word forms because of
agglutinative nature of Tamil language. This morphological richness of Tamil language
leads to sparse data problem in Statistical Machine Translation system. Examples of
Tamil word forms based on tenses are given in Table 4.11.
Play Past
Past perfect
Played
had played
108
Table 4.11 Tamil Word Forms based on Tenses
Morphological difference between English and Tamil makes the Statistical Machine
Translation into a complex task. English language mostly conveys the relationship
between words using function words or location of the words but Tamil language
expresses using morphological variations of word. Therefore Tamil language had larger
vocabulary of surface forms. This led to sparse data problem in English to Tamil SMT
system. In order to solve this, large amount of parallel training corpora is required to
cover the entire Tamil surface form. It is very difficult to create or collect the parallel
corpora which contain all the Tamil surface forms because Tamil is one of the less
resourced languages. Instead of covering entire surface forms a new method is required
to handle all word forms with the help of limited amount of data.
Consider an example English sentence (Figure 4.5) “the cat went to the room”.
From this sentence, the word “to” is a function word which does not have any separate
output translation unit in Tamil. Its translation output is coupled in Tamil function word
“அைற” (aRai).But, Statistical machine translation system uses phrase based models
109
and it will consider “to the room” is a single phrase and it is aligned to the Tamil word
problem is a raised for a new sentence which contains a phrase like “to the X” (eg. to the
school). Here X is considered as any noun. Even if X (or home) is available in bilingual
corpora, system cannot decode a correct translation for “to the X”. Because phrase based
SMT guess “to the X” is an unknown phrase even if X is aligned correctly. So the
function words should be treated separately prior to the SMT system. Here, these words
are taken care of by a preprocessing step called compounding. Compounding identifies
some of the function words and attaches to the morphological factor of related content
word in factored model. It retrieves morphological information of English content word
from dependency relations and function words.
110
defined linguistic rules. These rules are developed based on morphological difference
between English and Tamil language. This rule identifies the transformations from
English morphological factors to Tamil morphological factors. Sample compounding
rules are shown in Table 4.12. Based on the dependency information the rules are
developed.
English Sentence
Stanford
Parser
Dependency
Rules
Another important advantage of compounding is that it also used for solving the
difficulty of handling copula construction in English sentence. Copula is a special type
111
of verb in English, while in other languages other parts of speech serve the role of
copula. Copula is used to link the subject of a sentence with a predicate and it is also
referred as a linking verb because it does not describe action. Example for copula
sentences is given bellow.
1. Sharmi is a doctor.
2. Mani and Arthi are lawyers.
dobj - - +ACC
nsubj - +Subject -
112
Table 4.13 Average Words per Sentence
Average words
Method Sentences Words
per Sentence
Table 4.14 Factored English Sentence Table 4.15 Compounded English Sentence
I | i | PN | prn I | i | PN | prn
to | to | TO | TO to | to | TO | TO
my | my | PN | PRP$ my | my | PN | PRP$
Integration is the final stage in source side preprocessing. Here the preprocessed
English sentence is obtained from reordering and compounding stages. Reordering
takes the raw sentence and reorders according to the predefined rules. Compounding
takes the factored sentence and alters the morphological factors of the content words
using the compounding rules. Function words are identified in compounding stage.
From these function words few of them are removed during the integration process.
Figure.4.7 shows the integration process of preprocessing stages. Table.4.16 shows the
examples of preprocessed sentences.
113
Original English sentence:
I bought vegetables to my home.
0 1 2 3 4 5
Reordered English sentence:
I my house to vegetables bought.
0 4 5 3 4 5
Factored English sentence:
I | i | PN | prn bought | buy | V | VBD vegetables | vegetable | N | NNS
Factored Sentence
Function
word Index
Integration
114
Table 4.16 Preprocessed English Sentences
come|come|V|vb_3SF_may_not
went|go|V|vb.past_1S
him|him|PN|prp_to gave|give|V|vb.past_1S
killing|kill|V|vb.prog_3SN_was
4.3 SUMMARY
This chapter presented linguistic preprocessing for English language sentence for better
matching with Tamil language sentence. Preprocessing stages includes reordering,
factoring and compounding. Finally integration process incorporates the stages. The
chapter has also presented the effect of syntactic and morphological variance between
English and Tamil language. It is showed that reordering and compounding rules
produce significant gain in Factored translation system. However, reordering plays an
important role especially for language pairs with disparate sentence structure. The
difference in word order between two languages is one of the most significant sources
of errors in Machine Translation. While phrase based MT systems do very well at
reordering inside short windows of words, long-distance reordering seems to be a
challenging task. The translation accuracy can be significantly improved if the
reordering is done prior to translation. Reordering rules which are developed here is
only valid for English and Tamil language. It also can be used for other Dravidian
languages with small modifications. In future, automatic rule creation for reordering
using bi-lingual corpora will improve the accuracy and this system is applicable for any
language pair also. Compounding and factoring are used in order to reduce the amount
115
of English-Tamil bilingual data. Preprocessing also reduces the number of words in
English sentence. Accuracy of preprocessing heavily depends on the quality of the
parser. Different researches have proven that preprocessing is the effective method in
order to obtain a word-order and morphological information which match the target
language. Moreover, this preprocessing approach can be generally applicable for other
languages which differ in word order and morphology. This research work has proved
that adding linguistic knowledge in preprocessing of training data can lead to
remarkable improvements in translation performance.
116
CHAPTER 5
PART OF SPEECH TAGGER FOR TAMIL
5.1 GENERAL
The knowledge of the language pair is proved to improve the translation performance.
Adding pre-processing in SMT system convert the language pairs into more similar.
Philip Koehn and Hoang (2007) [10] developed a Factored translation framework for
statistical translation models to tightly integrate linguistic information. It is an extension
of phrase-based statistical machine translation that allows the integration of additional
morphological and lexical information, such as lemma, word class, gender, number, etc.,
at the word level on both source and the target languages. Preprocessing methods are
used to convert Tamil language sentences into factored Tamil sentences. The
preprocessing module for Tamil language sentence includes two stages, which are POS
tagging and Morphological analysis. The first step in preprocessing Tamil language
sentence is to retrieve the Part-of-Speech information of each and every word. This
information is included in the factors of surface word. This chapter explains about the
development of Tamil POS tagger system. In the next stage, Tamil morphological
analysis is used to retrieve the lemma and morphological information. This information
also included in factors of surface word. The next chapter (Chapter-6) explains the
implementation details about the Tamil morphological analyzer.
117
help in building automatic word-sense disambiguating algorithms. Parts of Speech are
very often used for shallow parsing texts, or for finding noun and other phrases for
information extraction applications. The corpora that have been marked for Part-of-
Speech are very useful for linguistic research, for example, to find instances or
frequencies of a particular word or sentence constructions in large corpora.
Apart from these, many Natural Language Processing (NLP) activities such as
summarization, document classification and Natural Language Understanding (NLU)
and Question Answering (QA) systems are dependent on Part-of-Speech Tagging.
Words are divided into different classes called Parts of Speech (POS), word classes,
morphological classes, or lexical tags. In traditional grammar, there are only a few parts
of speech (noun, verb, adjective, adverb, etc.). Many of the recent models have much
larger number of word classes (POS Tags). Part-of-Speech tagging (POS tagging or
POST), also called grammatical tagging, is the process of marking up the words in a
text as corresponding to a particular Part of Speech, based on both its definition, as well
as its context .
Closed classes are those that have relatively a fixed membership. For example,
prepositions are a closed class because there is a fixed set of them in English; new
prepositions are rarely coined. By contrast, nouns and verbs are open classes because
new nouns and verbs are continually coined or borrowed from other languages. There
are four major open classes that occur in the languages of the world; nouns, verbs,
adjectives, and adverbs. It turns out that Tamil and English have all the four of these,
although not every other language does.
Parts of Speech (POS) tagging means assigning grammatical classes i.e. suitable
Parts of Speech tags to each word in a natural language sentence. Assigning a POS tag
to each word of an un-annotated text by hand is a laborious and time consuming
process. This has led to the development of various approaches to automate the POS
tagging work. Automatic POS tagger take a sentence as input, assigns a POS tag to
each and every word in the sentence, and produces the tagged text as output. Tags are
118
also applied to punctuation markers; thus tagging for natural language is the same
process as tokenization for computer languages. The input to a tagging algorithm is a
string of words and a specified tagset. The output is a single best tag for each word. For
example in English,
Even in this simple sentence, automatically assigning a tag to each word is not
trivial. For example, the word book is ambiguous. That is, it has more than one possible
usage and Part of Speech. It can be a verb (as in, book that bus or to book the suspect)
or a noun (as in, hand me that book, or a book of matches). Similarly that can be a
determiner (as in, Does that flight serve dinner), or a complimentizer (as in, I thought
that your flight was earlier).
Tamil Example:
Considering the syntax or the context in the sentence, the word “adi” should be
tagged as noun (NN). The problem of automatic POS-tagging is to resolve these
ambiguities in choosing the proper tag for the context. Part-of-Speech tagging is thus a
disambiguation task. Another important point which was discussed and agreed upon
119
was that POS tagging is NOT a replacement for morph analyzer. A 'word' in a text
carries the following linguistic knowledge
The POS tag should be based on the 'category' of the word and the
features can be acquired from the morph analyzer.
Words can be classified under various parts of speech classes based on the role they
play in the sentence. Traditionally Tamil grammarian Tholkappiar has classified Tamil
word categories into four major classes.
1. Nouns
2. Verbs
3. Adjectives
4. Adverbs
5. Determiners
6. Post Positions
7. Conjunctions
8. Quantifiers
120
Other POS categories for Tamil
Apart from nouns and verbs, the other POS categories that are “open class” are
the adverbs and adjectives. Most adjectives and adverbs, in their root can be placed in
the lexicon. But there are adjectives and adverbs that can be derived from noun and
verb stems. Following are the Morphotactics of derived adjectives from noun root and
verb stems.
Examples:
Noun_root + adjective_suffix
uyaram + Ana = uyaramAna <ADJ>
உயரம்+ ஆன=உயரமான
verb_stem + relative_participle
cey + tha = ceytha <VNAJ>
ெசய் +த = ெசய்த
Following is the Morphotactics of derived adverbs from noun roots and verb stem.
noun_root + adverb_suffix
uyaram +Aka = uyaramAka <ADV>
உயரம்+ ஆக =உயரமாக
verb_stem +adverbial participle
cey + thu = ceythu <VNAV>
ெசய் + = ெசய்
There are number of non-finite verb structure forms in Tamil. Apart from
participles forms, Grammatically they are classified into structures such as
infinitive, conditional, etc,.
Examples:
verb_stem + Infinitive marker
paRa + kka=paRakka <VINT> (to fly)
பற + க்க =பறக்க
121
Examples:
verb_stem + Conditional suffix
vA+ wth+ Al = vawthAl <CVB> (if one comes)
வா+ந்+ஆள் = வந்தாள்
As Tamil is an agglutinative language, nouns get inflected for number and cases. Verbs
get inflected for various inflections which include tense, person, number, gender
suffixes. Verbs are adjectivalized and adverbialized. Also verbs and adjectives are
nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Tamil [159] are from nominal and verbal sources. So, many
times one has to depend on the syntactic function or context to decide upon whether
one is a noun or adjective or adverb or postposition. This leads to the complexity of
Tamil in POS tagging.
122
Morphological level inflection
Nouns are needed to be annotated into common noun, compound noun, proper
noun, compound proper noun, pronoun, cardinal and ordinal. Pronouns need to be
further annotated for personal pronoun. There occurs confusion between common noun
and compound noun and also between proper noun and compound proper noun.
Common noun can also occur as compound noun, for example
The verbal forms are complex in Tamil. A finite verb shows the following
morphological structure
‘I walked’
A number of non-finite forms are possible: adverbial forms, adjectival forms,
infinitive forms, and conditional.
124
Verb stem+Tense+Adjectivalizer
um-suffixed adjectival form clashes with other homophonous forms which leads
ambiguity.
A number of adjectival and adverbial forms of verbs are lexicalized as adjectives and
adverbs respectively and clash with their respective sentential adjectival and adverbial
forms semantically creating ambiguity in POS tagging. Adverbs too need to be
distinguished based on their source category. Many adverbs are derived by suffixing
aaka with nouns in Tamil. Functional clash can be seen between noun and adverb in
aaka suffixed forms. This type of clash is seen among other Dravidian languages too.
125
avaL azakAka irukkiRAL
‘she is beautiful’
Postpositions are from various categories such as verbal, nominal and adverbial in
Tamil. Many a time, the demarking line between verb/noun/adverb and postposition is
slim leading to ambiguity. Some postpositions are simple and some are compound.
Postpositions are conditioned by the nouns inflected for case they follow. Simply
tagging one form as postposition will be misleading.
There are postpositions which come after noun and also after verbs which
makes the postposition ambiguous (spatial vs. temporal).
For developing a POS tagged corpus, it is necessary to define a Tagset (POS Tagset)
used in that corpus. Collection of all the possible tags is called tagset. Tagsets differ
from language to language. After referring and considering the available tagsets for
Tamil and other languages, a customized Tagset named AMRITA Tagset was
developed. The guidelines from “AnnCorra, IIIT Hyderabad [160]” and EAGLES,
(1996) , were also considered while developing the AMRITA Tagset. Guidelines
followed while developing the AMRITA Tagset are given below.
126
Another point that was considered while deciding the tags was whether to come
up with a totally new tag set or take any other standard tagger as a reference and make
modifications in it according to the objective of the new tagger. It was felt that the later
option is often better because the tag names which are assigned by an existing tagger
may be familiar to the users and thus can be easier to adopt for a new language rather
than a totally new one. It saves time in getting familiar to the new tags and then work
on it.
Tagset by AUKBC
POS Tag Set for Indian Languages was developed by IIIT, Hyderabad.
Their tags are decided on coarse linguistic information with an idea to expand it to finer
knowledge if required. The annotation standards for POS tagging for Indian languages
include 26 tags [163].
127
Ganesan’s POS Tagset
Ganesan has prepared a POS tagger for Tamil. His tagger works well in CIIL
corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for Tamil.
He tagged a portion of CIIL corpus by using a dictionary as well as a morphological
analyzer. He corrected it manually and trained the rest of the corpus with it. The tags
are added morpheme by morpheme [165].
The main drawback in majority of tagsets used for Tamil is that they take into account
the verb and noun inflections for tagging. Hence at the tagging time, one needs to split
each and every inflected word into morphemes in the corpus. It is a tough and time
consuming process. At POS level, one needs to determine only the word’s grammatical
category, which can be done using a limited number of tagset. The inflectional forms
can be taken care of morph analyzer. So there is no need of using a large number of
tags. Moreover a large number of tags will lead to more complexity which in turn
reduces the tagging accuracy. Considering the complexity of Tamil in POS tagging and
referring to various tagsets, a customized tagset has been developed (AMRITA POS
tagset). The customized POS tagset which has been used for the present reaearch work
contains 32 tags without considering the inflections. The 32 tags are listed in the Table
5.1.
In AMRITA POS tagset, compound tags for common noun (NNC) and proper
noun (NNPC) were used. Tag VBG is used for verbal nouns and participle nouns.
These 32 POS tags are used for POS tagger and chunker. For morphological analyzer,
these 32 tags were further simplified and reduced to 10 Tags.
128
Table 5.1 AMRITA POS Tagset
Corpus linguistics seeks to further the understanding of language through the analysis
of large quantities of naturally occurring data. Text corpora are used in a number of
different ways. Traditionally, corpora have been used for the study and analysis of
language at different levels of linguistic description. Corpora have been constructed for
the specific purpose of acquiring knowledge for information extraction systems,
knowledge-based systems and e-business systems [167]. Corpora have been used for
studying child language development. Speech corpora play a vital role in the
specification, design and implementation of telephonic communication and for the
broadcast media.
There is a long tradition of corpus linguistic studies in Europe. The need for
corpus for a language is multifarious. Starting from the preparation of a dictionary or
lexicon to machine translation, corpus has become an inevitable resource for
technological development of languages. Corpus means a body of huge text
incorporating various types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on. Corpus represents all the styles
of a language. Corpus must be very huge in size as it is going to be used for many
129
language applications such as preparation of lexicons of different sizes, purposes and
types, NLP tools, machine translation programs and so on.
Untagged or un-annotated corpus provides limited information to the users. Corpus can
be augmented with additional information by way of labeling the morpheme, word,
phrase and sentence for their grammatical values. Such information helps the user to
retrieve information selectively and easily. Figure 5.1 presents an example of untagged
corpus.
The frequency of the lemma is useful in the analysis of the corpus. When the
frequency of a particular word is compared to other context words, it is useful to find
whether the word is common or rare. The frequencies are relatively reliable for the
most common words in a corpus, but to analyze the senses and association patterns of
words, a very large number of occurrences with a very large corpus containing many
different texts, a wider range of topics should be represented, so that the frequencies of
words are less influenced by individual texts. Frequency list based on an untagged
corpus are limited in usefulness, because they do not provide grammatical uses which
are common or rare. Tagged corpus is an important dataset for NLP applications.
Figure 5.2 shows an example of tagged corpus.
130
5.4.2 Available Corpus for Tamil
Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus.
The tagged corpus is that which is tagged for Part-of-Speech, morphology, lemma,
phrases etc. A parallel corpus contains texts and translations in each of the languages
involved in it. It allows wider scopes for double-checking of the translation equivalents.
Aligned corpus is a kind of bilingual corpus where text samples of one language and
their translations into another language are aligned, sentence by sentence, phrase by
phrase, word by word, or even character by character.
As for as building corpus for the Indian languages is concerned, it was Central
Institute of Indian Languages (CIIL) which took the initiative and started preparing
corpus for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam).
Department of Electronics (DOE) financed the corpus-building project. The target was
to prepare corpus with ten million words for each language. But, due to financial
crunch and time restriction, it ended up with just three million words for each language.
Tamil corpus, with three million words, is built by CIIL in this way. It is a partially
tagged corpus.
AUKBC Research Centre which has taken up NLP oriented works for Tamil,
has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also
developed a parallel corpus for English-Tamil to promote its goal of preparing an MT
tool for English-Tamil translation. Parallel corpus is very useful for training the corpus
and for building example based machine translation. Parallel corpus is a useful tool for
MT programs.
The tagged corpus is the immediate requirement for different analyses in the field of
Natural Language Processing. Most of the language processing works are in need of
such large database of texts, which provide a real, natural, native language of varying
types. Annotation of corpora can be done at various levels through, Part of Speech,
131
phrase/clause level, dependency level, etc. Part of Speech tagging forms the basic step
towards building an annotated corpus.
1. Pre-editing
2. Manual Tagging
3. Bootstrapping
Pre-editing
Tamil text documents have been collected from Dinamani website, Yahoo
Tamil, Tamil short stories etc (For example, Figure 5.3). The corpus has been cleaned
using simple program i.e. to remove punctuations (except dots, commas and question
marks). The corpus has been sententialy aligned. The next step is to change the corpora
into a column format because the SVMTool training data must be in column format, i.e.
a token (word) per line corpus in a sentence by sentence fashion. The column separator
is the blank space.
132
Manual tagging
After pre-editing, the untagged corpus is tokenized into column format (Figure
5.4). In second stage, the untagged corpus is manually POS tagged using AMRITA
tagset. Initially 10,000 words were manually tagged. During manual POS tagging
process, great difficulties were faced while assigning tags to corpora.
Bootstrapping
After completing the manual tagging, the tagged corpus is given to the learning
component of training algorithm for generating the model. Using the model generated,
decoder of the training algorithm tags the untagged sentences. The output of the
component is a tagged corpus with some error. Then the tags are corrected manually.
After correcting the tags, the tagged corpus was added into the training corpus for
increasing the size of training corpus.
133
5.4.4 Applications of POS Tagged Corpus
• Chunking
• Parsing
• Information extraction and retrieval
• Machine Translation
• Tree bank creation
• Document classification
• Question answering
• Automatic dialogue system
• Speech processing
• Summarization
• Statistical training of Language models
• Machine Translation using multilingual corpora
• Text checkers for evaluating spelling and grammar
• Computer Lexicography
• Educational application like Computer Assisted Language Learning
The POS tagged corpus details are given in the corpus statistics and tag count table
(Table 5.2 and 5.3).
No of sentences 45682
No of words 510467
134
Table 5.3 Tag Count
135
5.5 DEVELOPMENT OF POS TAGGER USING SVMTOOL
5.5.1 SVMTool
This section presents the SVMTool, a simple, flexible, and effective generator of
sequential taggers based on Support Vector Machines (SVM) and explains how it is
applied to the problem of Part-of-Speech tagging. This SVM-based tagger is robust and
flexible for feature modeling (including lexicalization), trains efficiently with almost no
parameters to tune, and is able to tag thousands of words per second, which makes it
really practical for real NLP applications. Regarding accuracy, the SVM-based tagger
significantly outperforms the TnT tagger [39] exactly under the same conditions, and
achieves a very competitive accuracy of 94.6% for Tamil.
Moreover, some languages like Tamil have a richer morphology than others.
This leads the tagger to have a large set of feature patterns. Also, the tagset size and
ambiguity rate may vary from language to language and from problem to problem.
Besides, if few data are available for training, the proportion of unknown words may be
huge. Sometimes, morphological analyzers could be utilized to reduce the degree of
ambiguity when facing unknown words. Thus, a sequential tagger should be flexible
with respect to the amount of information utilized and context shape.
Flexibility: The size and shape of the feature context can be adjusted. Also,
rich features can be defined, including word and POS (tag) n-grams as well as
ambiguity classes and “may be’s”, apart from lexicalized features for unknown words
and sentence general information. The behavior at tagging time is also very flexible,
allowing different strategies.
137
Efficiency: Performance at tagging time depends on the feature set size and the
tagging scheme selected. For the default (one-pass left-to-right greedy) tagging scheme,
it exhibits a tagging speed of 1,500 words/second whereas the C++ version achieves a
tagging speed of over 10,000 words/second. This has been achieved by working in the
primal formulation of SVM. The use of linear kernels causes the tagger to perform
more efficiently both at tagging and learning time, but forces the user to define a richer
feature space.
The SVMTool [12] software package consists of three main components, namely the
model learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval).
Previous to the tagging, SVM models (weight vectors and biases) are learned from a
training corpus using the SVMTlearn component. Different models are learned for
different strategies. Then, at tagging time, using the SVMTagger component, one may
choose the tagging strategy that is most suitable for the purpose of tagging. Finally,
given a correctly annotated corpus and the corresponding SVMTool predicted
annotation, the SVMTeval component displays tagging results.
5.5.3.1 SVMTlearn
Given a set of examples (either annotated or unannotated for training), the SVMTlearn
trains a set of SVM classifiers. So as to do that, it makes use of SVM–light, an
implementation of Vapnik’s SVMs in C, developed by Thorsten Joachim’s (2002). The
SVMlight software implementation of Vapnik’s Support Vector Machine by Thorsten
Joachim’s has been used to train the models.
Training data must be in column format, i.e. a token per line corpus in a
sentence by sentence format. The column separator is the blank space. The word is to
be the first column of the line. The tag to predict takes the second column in the output.
The rest of the line may contain additional information. Example is given bellow in
Figure 5.5. No special ‘<EOS>’ mark is employed for sentence separation. Sentence
punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence
separators. In this system these symbols [.?] are used as sentence separators.
138
Figure 5.5 Training Data Format
139
Models
Five different kinds of models have been implemented in this Tool. Models 0, 1,
and 2 differ only in the features they consider. Model 3 and Model 4 are just like Model
0 with respect to feature extraction but examples are selected in a different manner.
Model 3 is for unsupervised learning. Hence, given an unlabeled corpus and a
dictionary, at learning time it can only count on knowing the ambiguity class, and the
POS information only for unambiguous words. Model 4 achieves robustness by
simulating unknown words in the learning context at training time.
Model 0: This is the default model. The unseen context remains ambiguous. It was
thought of having in mind the one-pass on-line tagging scheme, i.e. the tagger goes
either left-to-right or right-to-left making decisions. So, past decisions feed future ones
in the form of POS features. At tagging time, only the parts-of-speech of already
disambiguated tokens are considered. For the unseen context, ambiguity classes are
considered instead (Table 5.4).
Model 1: This model considers the unseen context already disambiguated in a previous
step. So it is thought for working at a second pass, revisiting and correcting already
tagged text (Table 5.5).
Model 2: This model does not consider pos features at all for the unseen context. It is
designed to work at a first pass, requiring Model 1 to review the tagging results at a
second pass (Table 5.6).
Model 3: The training is based on the role of unambiguous words. Linear classifiers
are trained with examples of unambiguous words extracted from an unannotated
corpus. So, fewer POS information is available. The only additional information
required is a morpho-syntactic dictionary.
Model 4: The errors caused by unknown words at tagging time punish the system
severely. So as to reduce this problem, during learning, some words are artificially
marked as unknown in order to learn a more realistic model. The process is very
simple. The corpus is divided in a number of folders. Before starting to extract samples
from each of the folders, a dictionary is generated out from the rest of folders. So, the
words appearing in a folder but not in the rest are unknown words to the learner.
140
Table 5.4 Example of Suitable POS Features for Model 0
Ambiguity classes a ,a ,a
0 1 2
May_be’s m ,m ,m
0 1 2
POS Features p , p p-3, p-2, p-1,
−2 −1
POS Bigrams ( p−2, p−1),( p−1, a+1),( a+1, a+2 )
POS Trigrams ( p−2, p−1, a0 ), ( p−2, p−1, a+1),
( p−1,a0 ,a+1 ),( p−1, a+1, a+2 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC,
MW,L
Sentence_info punctuation ( '.','?','!' )
Ambiguity classes a ,a ,a
0 1 2
May_be’s m ,m ,m
0 1 2
POS Features p ,p ,p ,p
−2 −1 +1 +2
POS Bigrams ( p−2, p−1),( p−1, p+1),( p+1, p+2 )
POS Trigrams ( p−2, p−1, a0 ), ( p−2, p−1, p+1),
( p−1,a0 , p+1 ),( p−1, p+1, p+2 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L
Sentence_info punctuation ( '.','?','!' )
141
Table 5.6 Example of Suitable POS Features for Model 2
Ambiguity classes a
0
May_be’s m
0
POS Features p ,p
−2 −1
POS Bigrams ( p−2, p−1)
POS Trigrams ( p−2, p−1, a0 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC,
MW,L
Sentence_info punctuation ( '.','?','!' )
In this component, input is POS tagged training corpus. The training corpus is
given to the SVMTlearn component that trains the model using the features given in the
configuration file. The features are defined in the configuration file, based on Tamil
language. The outputs of the SVMTlearn are dictionary file, merged files for unknown
and known words (for all models). Each merged file contains all the features of known
words and unknown words (Figure 5.6).
142
Tagged Tamil Corpus
SVM Learn
Known Unknown
143
*********************************************************************
************************
C-PARAMETER TUNING by 10-fold CROSS-VALIDATION
on </media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN>
on <MODE 0> <DIRECTION LR> [KNOWN]
C-RANGE = [0.01..1] :: [log] :: #LEVELS = 3 :: SEGMENTATION RATIO =
10
*********************************************************************
************************
=====================================================================
========================
LEVEL = 0 :: C-RANGE = [0.01..1] :: FACTOR = [* 10 ]
=====================================================================
========================
---------------------------------------------------------------------
------------------------
******************************** level - 0 : ITERATION 0 - C = 0.01 -
[M0 :: LR]
---------------------------------------------------------------------
------------------------
144
TEST ACCURACY: 90.8836%
---------------------------------------------------------------------
----
******************************** level - 0 : ITERATION 1 - C = 0.1 -
[M0 :: LR]
---------------------------------------------------------------------
----
TEST ACCURACY: 91.7702%
KNOWN [ 94.2402% ] AMBIG.KNOWN [ 87.5492% ] UNKNOWN [ 78.7175% ]
TEST ACCURACY: 91.8881%
KNOWN [ 94.4737% ] AMBIG.KNOWN [ 88.4324% ] UNKNOWN [ 77.9821% ]
TEST ACCURACY: 91.3219%
KNOWN [ 94.0596% ] AMBIG.KNOWN [ 88.0441% ] UNKNOWN [ 77.5928% ]
TEST ACCURACY: 91.0615%
KNOWN [ 93.6037% ] AMBIG.KNOWN [ 86.6795% ] UNKNOWN [ 77.9326% ]
TEST ACCURACY: 92.0852%
KNOWN [ 94.2575% ] AMBIG.KNOWN [ 88.3275% ] UNKNOWN [ 80.5811% ]
TEST ACCURACY: 91.3927%
KNOWN [94.1299% ] AMBIG.KNOWN [ 87.4226% ] UNKNOWN [ 77.286% ]
TEST ACCURACY: 91.9891%
KNOWN[94.2944% ] AMBIG.KNOWN [ 88.0182% ] UNKNOWN [ 79.4589% ]
TEST ACCURACY: 91.3063%
KNOWN[93.9605% ] AMBIG.KNOWN [ 87.1258% ] UNKNOWN [ 77.8502% ]
145
TEST ACCURACY: 91.3654%
KNOWN[93.8499% ] AMBIG.KNOWN [87.2127% ] UNKNOWN [ 78.2339% ]
TEST ACCURACY: 91.8693%
KNOWN [ 94.1% ] AMBIG.KNOWN [ 87.0546% ] UNKNOWN [ 79.9416% ]
OVERALL ACCURACY [Ck = 0.1 :: Cu = 0.07975] : 91.60497%
KNOWN [ 94.09694% ] AMBIG.KNOWN [ 87.58666% ] UNKNOWN [ 78.55767% ]
MAX ACCURACY -> 91.60497 :: C-value = 0.1 :: depth = 0 :: iter = 2
5.5.3.2 SVMTagger
Given a text corpus (one token per line) and the path to a previously learned SVM
model (including the automatically generated dictionary), it performs the POS tagging
of a sequence of words. The tagging goes on-line, based on a sliding window which
gives a view of the feature context to be considered at every decision.
In any case, there are two important concepts to be considered:
• Example generation
• Feature extraction
Example generation: This step is to define what an example is, according to the
concept in which the machine is to be learned. For instance, in POS tagging, the
machine has to correctly classify the words according to their POS. Thus, every POS
tag is the class of a word that generates a positive example for its class, and a negative
example for the rest of the classes. Therefore, every sentence may generate a large
number of examples.
Feature Extraction: The set of features based on the algorithm to be used have to be
defined. For instance, the POS tags should be guessed according to the preceding and
following words. Thus, every example is represented by a set of active features. These
representations will be the input for the SVM classifiers. If the working of SVMTool
has to be learned, it is necessary to run the SVMTlearn (Perl version). By setting the
REMOVE_FILES (in the configuration file) option to 0, it will not remove the
intermediate files; if option 1 is given, it will remove all the intermediate files.
146
Taking this context into account, a number of features may be extracted. The feature set
depends on, how the tagger is going to proceed later (i.e., the context and information
that’s going to be available at tagging time). Generally, all the words are known before
tagging, but POS tag is available only for some words (those already tagged).
In tagging stage, if the input word is known and ambiguous, the word is tagged
(i.e., classified), and the predicted tag feeds forward next decisions. This will be done
in the "sub classify_sample_merged ( )” subroutine in the SVMTAGGER file. In order
to speed up SVM classification, merge mapping and SVM weights and biases, into a
single file. Therefore, when a new example is to be tagged, the tagger just accesses the
merged model and for every active feature, retrieves the associated weight. Then, for
every possible tag, the bias will also be retrieved. Finally, SVM classification rule (i.e.,
scalar product + bias) is applied.
Example:
Tags:
while applying the SVM classification rule for a given POS Tagger, it is
necessary to go to the merged model and retrieve the weight for these features, and the
bias (first line after the header, beginning with "BIASES "), corresponding to the given
POS. For instance, suppose this ".MRG" file:
147
BIASES <ADJ>:0.37059487 <ADV>:-0.19514606 <CNJ>:0.43007979
<COM>:-0.037037037 <CRD>:0.55448766 <CVB>:-0.19911161 <DET>:-1.1815452
<EMP>:-0.86491783 <INT>:0.61775334 <NN>:-0.21980137 <NNC>:1.3656117
<NNP>:0.072242349 <NNPC>:0.7906585 <NNQ>:0.44012828 <ORD>:0.30304924
<PPO>:-0.2182171 <PRI>:0.89491131 <PRID>:-0.15550162 <PRIN>:0.56913633
<PRP>:0.35316978 <QW>:0.039121434 <RDW>:0.84771943 <VAX>:0.041690388
<VBG>:0.23199934 <VF>:0.33486366 <VINT>:0.0048185684 <VNAJ>:0.42063524
<VNAV>:0.18009116
Here SVM score for <VF> is more compared to <VNAJ>, So, the tag VF is assigned to
the word ‘wiRaivERRappadum’.
148
Figure 5.7 Example Input Figure 5.8 Example Output
149
Here the format of backup lexicon file is same as the dictionary format. So a
PERL program can be used for converting a tagged corpus into a dictionary format.
Tagging will be complex for open tag categories. The main drawback in POS tagging is
tagging the proper nouns. For English, they use capitalization for tagging the proper
noun words. But in Tamil, it is not possible; therefore a large backup lexicon with
proper nouns is provided to the system. A large dataset for proper noun (Indian place
and person names) was collected and given as the input to the morphological generator
(using PERL program). Morph generator generates nearly twelve inflections for every
proper noun. This new dataset is converted into SVMTool dictionary format and given
to SVMTagger as a back up lexicon. Figure 5.9 shows the steps in implementation of
SVMTagger for Tamil. The input to the system is an untagged cleaned Tamil corpus
and output is tagged or annotated corpus. Supporting files are training corpus,
dictionary file, merged models for unknown and known words and backup lexicon.
SVMLearn
Untagged Tamil
Corpus
Backup lexicon
Training Data
Features
Strategies
Merged model Tagged Tamil Corpus
150
5.5.3.3 SVMTeval
151
Correct Tagged file SVM Tagged
(Gold Standard) Output
Model
Model SVMTeval
Name
Report
SVMTeval report
Brief report
By default, a brief report mainly returning the overall accuracy is elaborated. It
also provides information about the number of tokens processed, and how much were
known/unknown and ambiguous/unambiguous according to the model dictionary.
152
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------
---------------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================OVERALLACCURACY===============================
=======================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
1002 1063 94.2615% 71.2135%
*
================================================================
=========================
Accuracy for four different sets of words is returned. The first set is that of all
known tokens, tokens which were seen during the training. The second and third sets
contain respectively all ambiguous and all unambiguous tokens among these known
tokens. Finally, there is the set of unknown tokens, which were not seen during the
training.
*=========================SVMTevalreport
==============================
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
======================================================================
==
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> vs. <E:\\SVMTool-
1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY======================================
================
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
153
* --------------------------------------------------------------------
---------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================KNOWNvsUNKNOWNTOKENS================================
===============
HITS TRIALS ACCURACY
* --------------------------------------------------------------------
---------------------
*=======known=========================================================
================
816 854 95.5504%
-------- known unambiguous tokens ------------------------------------
---------------------
604 623 96.9502%
-------- known ambiguous tokens --------------------------------------
---------------------
212 231 91.7749%
*=======unknown=======================================================
=================
186 209 88.9952%
*
======================================================================
===================
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------------
---------------------
1002 1063 94.2615% 71.2135%
*
======================================================================
===================
Level of ambiguity
This view of the results groups together all words having the same degree of
POS–ambiguity.
154
*=========================SVMTevalreport
==============================
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
======================================================================
==
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> vs. <E:\\SVMTool-
1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY======================================
================
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------------
---------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================ACCURACY PER LEVEL OF AMBIGUITY
=======================================
#CLASSES = 5
*=====================================================================
====================
LEVEL HITS TRIALS ACCURACY MFT
*---------------------------------------------------------------------
--------------------
1 605 624 96.9551% 96.6346%
2 204 220 92.7273% 66.8182%
3 7 9 77.7778% 66.6667%
4 2 3 66.6667% 33.3333%
28 184 207 88.8889% 0.0000%
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
*---------------------------------------------------------------------
--------------------
1002 1063 94.2615% 71.2135%
155
*
======================================================================
===================
Kind of ambiguity
<ADJ>_<ADV>_<CNJ>_<COM>_<CRD>_<CVB>_<DET>_<ECH>_<INT>_<NN>_<NNC>_<NNP>
_<NNPC>_<NNQ>_<ORD>_<PPO>_<PRID>_<PRIN>_<PRP>_<QTF>_<QW>_<RDW>_<VAX>_<
156
VBG>_<VF>_<VINT>_<VNAJ>_<VNAV> 184 207
88.8889% 0.0000%
157
<PRIN> 2 2 100.0000% 100.0000%
<PRP> 33 33 100.0000% 100.0000%
<QM> 4 4 100.0000% 100.0000%
<QTF> 5 5 100.0000% 100.0000%
<QW> 4 4 100.0000% 100.0000%
<RDW> 1 1 100.0000% 100.0000%
<VAX> 7 7 100.0000% 100.0000%
<VAX>_<VF> 8 8 100.0000% 100.0000%
<VBG> 11 11 100.0000% 100.0000%
<VBG>_<VF> 2 2 100.0000% 100.0000%
<VF> 39 40 97.5000% 97.5000%
<VF>_<VNAJ> 12 12 100.0000% 91.6667%
<VINT> 15 16 93.7500% 93.7500%
<VNAJ> 20 20 100.0000% 100.0000%
<VNAV> 17 18 94.4444% 94.4444%
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------------
---------------------
1002 1063 94.2615% 71.2135%
*
======================================================================
=========
Class
158
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------
---------------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*================= ACCURACY PER PART-OF-SPEECH
===========================================
POS HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
<ADJ> 30 31 96.7742% 90.3226%
<ADV> 47 48 97.9167% 70.8333%
<CNJ> 21 21 100.0000% 95.2381%
<COM> 17 17 100.0000% 100.0000%
<COMM> 49 49 100.0000% 100.0000%
<CRD> 26 26 100.0000% 84.6154%
<CVB> 7 8 87.5000% 75.0000%
<DET> 36 36 100.0000% 100.0000%
<DOT> 77 77 100.0000% 100.0000%
<EMP> 1 1 100.0000% 100.0000%
<INT> 6 7 85.7143% 85.7143%
<NN> 243 259 93.8224% 57.9151%
<NNC> 145 162 89.5062% 46.2963%
<NNP> 43 44 97.7273% 86.3636%
<NNPC> 0 16 0.0000% 0.0000%
<NNQ> 4 4 100.0000% 100.0000%
<ORD> 2 2 100.0000% 100.0000%
<PPO> 9 9 100.0000% 100.0000%
<PRID> 2 3 66.6667% 66.6667%
<PRIN> 2 2 100.0000% 100.0000%
<PRP> 34 34 100.0000% 97.0588%
<QM> 4 4 100.0000% 100.0000%
<QTF> 5 5 100.0000% 100.0000%
<QW> 6 6 100.0000% 66.6667%
159
<RDW> 1 1 100.0000% 100.0000%
<VAX> 18 18 100.0000% 77.7778%
<VBG> 20 22 90.9091% 54.5455%
<VF> 68 68 100.0000% 66.1765%
<VINT> 16 18 88.8889% 83.3333%
<VNAJ> 41 42 97.6190% 69.0476%
<VNAV> 22 23 95.6522% 73.9130%
*=================OVERALLACCURACY===============================
======================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
1002 1063 94.2615% 71.2135%*
================================================================
=========================
Apart from SVMTool, three other taggers namely TnT [39], MBT [60] and WEKA
[168] were trained with the same corpus. The accuracy result of SVMTool is compared
with the above tools for the same testing corpus. Following is brief description of the
above mentioned taggers.
160
WEKA is a collection of machine learning algorithms for solving real-world data
mining problems. The J48 classifier was used for implementation of Tamil POS
tagging. All the three tools were trained using the same corpus as used in SVMTool.
The same data format was followed in all the cases [168].
The experiments were conducted with our tagged corpus. The corpus was
divided into training set and test set. For SVMTool, 94.6% overall accuracy is obtained,
which is much higher than that of the other taggers. POS Tagging using MBT gave a
very low accuracy (65.65%) for unknown words since the algorithm is based on direct
reuse of stored experiences. Ambiguous words were handled poorly by TnT, whereas
WEKA gave a high accuracy of 90.11% for ambiguous words (Table 5.7). Though
SVMTool gave very high accuracy for all cases, the training time was significantly
higher when compared to other tools. The unknown word accuracy of the SVMTool is
86.25%. The accuracy goes down in case of some specific tags. Accuracy results of
SVMTool compared to the various tools for the same corpus is given in Table 5.7.
The detailed error analysis is conducted to identify the miscalculation of tags. The
untagged sentences size of about 1200 sentences (10 k words) is taken for testing the
system. For analyzing the error, 8 frequently error occurred tags are considered. The
tags and their trials and errors are shown in Table 5.8. For instance, errors represents
the tagger is failed to identify the CRD tag at 30 occurrences.
Table 5.9 shows the confusion matrix for 8 POS tags. This matrix shows the
performance of the tagger.
161
Table 5.8 Trials and Error
5.8 SUMMARY
This chapter gave the detail about the development of POS tagger and tagged corpora.
Part of Speech tagging plays an important role in various speech and language
processing applications. Currently, many statistical tools are available to do Part of
Speech tagging. The SVMTool has been already successfully applied to English and
Spanish POS Tagging, exhibiting state–of–the–art performance (97.16% and 96.89%,
respectively). In both cases, results clearly outperform the HMM–based TnT part–of–
speech tagger. For Tamil, an accuracy of 94.6% has been obtained. Any language can
be trained easily using the existing statistical tagger tools. POS tagging can be extended
by applying this to other languages. The obstacle for the POS tagging for Indian
languages is there is no annotated (tagged) corpus. 45k sentences (5 lakh words) POS
annotated sentences are developed for train the POS Tagger.
162
CHAPTER 6
MORPHOLOGICAL ANALYZER FOR TAMIL
6.1 GENERAL
Grammar of any Language can be broadly divided into morphology and syntax. The
term morphology was coined by August Schleicher in 1859. Morphology deals with the
words and their construction. Syntax deals with how to put the words together in some
order to make meaningful sentences. Morphology is the field within linguistics that
studies the internal structure of words. While words are generally accepted as being the
smallest units of syntax, it is clear that in most languages, words can be related to other
words by rules. Morphology is attempts to formulate rules that model the knowledge of
the speakers of those languages. Morphemes are the smaller elements of which words
are built. Two broad classes of morphemes are stems and affixes. Affixes that are added
to the base to denote relations of words are morphemes. Morphemes can either be free
(they can stand alone, i.e. they can be words in their own right) e.g. dog, or they can be
bound (they must occur as part of a word) e.g. the plural suffix –s on dogs.
163
6.1.3 Morphological Analyzer
With the above definition, an analyzer of words in a sentence does not have to
do much work in identifying a word. It simply has to look for the delimiters. Having
identified the word, it must determine whether it is a compound word or simple word.
If it is a compound word, it must first break it up into its constituent simple words
before proceeding to analyze them. The former is called as sandhi analyzer and the
later is morphological analyzer, both of which are important parts of a word analyzer.
164
The detailed linguistic analysis of a word can be useful for NLP. However, most
NLP researchers have concentrated on other aspects, like grammatical analysis,
semantic interpretation etc. As a result, NLP systems use rather simple morphological
analyzers. A generator does the reverse of an analyzer. Given a root and its features (or
affixes), a morphological generator generates a word. Similarly, a sandhi generator can
take the output of a morphological generator, and group simple words into compound
words, where possible.
books = book+Noun+PluraL(or)book+Verb+Pres+3SG.
stopping = stop+Verb+Cont
happiest = happy+Adj+Superlative
went = go+Verb+Past
book+Noun+Plural = books
stop+Verb+Cont = stopping
happy+Adj+Superlative = happiest
go+Verb+Past = went
• Spell checker
• Search Engines
• Information extraction and retrieval
• Machine Translation system
• Grammar checker
• Content analysis
• Question Answering system
165
• Automatic sentence Analyzer
• Dialoge system
• Knowlege representation in learning
• Language Teaching
• Language based educational exercises
Text Processing
Tools
Search Engines-
IR/IE
Tamil Morphology is very rich. It is an agglutinative language, like the other Dravidian
languages. Tamil words are made up of lexical roots followed by one or more affixes.
The lexical roots and the affixes are the smallest meaningful units and are called
morphemes. Tamil words are therefore made up of morphemes concatenated to one
another in a series. The first one in the construction is always a lexical morpheme
(lexical root). This may or may not be followed by other functional or grammatical
morphemes. For instance, a word த்தகங்கள் ‘puththakangkaL’ in Tamil, can be
world entity and கள் ‘kaL ’ is the plural feature marker (suffix). கள் ‘kaL ’ is a
166
grammatical morpheme that is bound to the lexical root to add plurality to the lexical
root. Unlike English, Tamil words can have a large sequence of morphemes. For
instance,
Tamil nouns can take case suffixes after the plural marker. They can also have
post positions after that. Tamil words consist of a lexical root to which one or more
affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational
suffixes, which either changes the part of speech of the word or its meaning, or
inflectional suffixes, which mark categories such as person, number, mood, tense, etc.
The words can be analyzed like the one above by identifying the constituent
morphemes and their features can be identified.
Tamil is a consistently head-final language. The verb comes at the end of the clause,
with typical word order Subject Object Verb (SOV). Tamil is also a free word-order
language. Due to this relatively free word-order nature of Tamil language, the Noun
Phrase arguments before a final verb can appear in any permutation, yet it conveys the
same sense of a sentence. Tamil has postpositions rather than prepositions.
Demonstratives and modifiers precede the noun within the noun phrase. Subordinate
clauses precede the verb of the matrix clause.
Tamil is a null subject language. Not all Tamil sentences have subjects, verbs
and objects. It is possible to construct valid sentences that have only a verb—such as
such as அ என் “athu en viitu” ("That, my house"). Tamil does not have a
copula (a linking verb equivalent to the word “is”). The word is included in the
translations only to convey the meaning more easily.
167
6.2.3 Word Formation Rules (WFR) in Tamil
Any new word created by Word Formation Rules (WFR) must be a member of a major
lexical category. The WFR determines the category of the output of the rule. In Tamil,
the grammatical category may change or may not change after the operation of WFR.
The following is the list of inputs and outputs of different kinds of WFR's in the
derivation of simple words in Tamil [170].
1. Noun → Noun
2. Verb → Noun
[ [ப ]V + ப் ]suf ]N ப ப்
[ [எ ]V + ]suf ] எ த்
3. Adjective → Noun
168
[ [ெபாிய ]adj + ]suf ]N ெபாிய
4. Noun → Verb
5. Adjective → Verb
6. Verb → Verb
[ [வி ]V + வி ]suf ]V வி வி
7. Noun → Adjective
169
[ [nErmai ]N + Ana ]suf ]adj 'honest'
8. Verb → Adverb
[ [ ப ]V + த் ]suf ]adv ப த்
Table 6.1 shows the possible combinations for compound word formation.
Examples:
{ [பணி ]N # [ ாி ]V # }V பணி ாி
170
Table 6.1 Compound Word-forms Formation
Tamil verbs are inflected by means of suffixes. Tamil verbs can be finite or non-finite
forms. Finite verb forms occur in the main clause of the sentence and non-finite forms
occur as the predicate of subordinate or embedded clauses. Morphologically, finite verb
forms are inflected for tense, mood, aspect, person, number and gender.
The simple finite verb forms are given in Table 6.2. First column presents the
PNG (Person-Number-Gender) Tag and the further columns presents present, past and
future tenses respectively. For the word “ப ” padi (study), various simple finite
inflection forms with tense markers and PNG markers are given in Table 6.2.
171
Modal verbs can be defective in that, they cannot take any more inflectional
suffixes, or they can be regular verbs that can get inflected for tense and PNG suffixes.
Tamil nouns (and pronouns) are classified into two super-classes “rational” and the
"irrational" which include a total of five classes. Humans and deities are classified as
"rational", and all other nouns (animals, objects, abstract nouns) are classified as
irrational. The "rational" nouns and pronouns belong to one of three classes masculine
singular, feminine singular, and rational plural. The "irrational" nouns and pronouns
belong to one of two classes - irrational singular and irrational plural. The plural form
for rational nouns may be used as an honorific, gender-neutral, singular form [132].
172
genitive, instrumental, locative, and ablative. The various noun forms are given in the
Table 6.3. The table represents the singular and plural forms of the word “எ ” eli (rat)
with the case markers.
Noun form without any inflections are called noun stem. Nouns in their stem
forms are singular.
ேபனாக்கள்= ேபனா+கள்
The examples shown above are a few instances of plural inflection. Creating a
plural form of a noun isn’t simply about concatenating ‘kaL’. Similarly, in
“puththakangkaL”, the stem (puththakam) is inflected to puththakangkaL (‘am’ in the
stem is replaced by ‘ang’, followed by ‘kaL’). These differences are due to the ‘Sandhi’
173
changes that take place when the noun stem is concatenated to the ‘kaL’ morpheme.
Tamil uses case suffixes and post positions for case marking instead of prepositions.
Case markers indicate the relationship between the noun phrases and the verb phrase. It
indicates the semantic role of the noun phrases in the sentence. Genitive case, tells the
relationship between noun phrases. This is expressed by ‘in’ morpheme. Case suffixes
are concatenated to the nouns in their stem form or after the plural morpheme if it’s a
plural noun.
Post positions are of two kinds: bound and free. In case of bound post positions,
they occur with their respective governing case suffixes. In such a case, the
Morphotactics would be,
Sometimes the post positions follow a blank space after the case suffix as
another word. Free post positions follow noun stems without any case suffixes.
However they are written as another word and do not concatenate with the noun.
Basically, there are eight cases in Tamil. Verbs can take the form of nouns when
followed by nominal suffixes. Nominalized verb forms are an example of derivational
Morphology in Tamil. They occur in the following format.
174
6.2.6 Tamil Morphological Analyzer
175
Compared to verb morphological analysis, noun morphological analysis is
relatively easy. Noun can occur separately or with plural, oblique, case, postpositions
and clitics suffixes. A corpus was developed with all morphological feature
information. So the machine by itself captures all morphological rules, including
‘Sandhi’ and morphotactic rule.
POS
Tagged
Sentence
Morphologically Annotated
Sentence
2. Noun/Verb Analyzer
3. Pronoun Analyzer
176
4. Proper Noun Analyzer
The input to the morphological system is a POS tagged sentence. In the first
module, POS Tagged sentence is refined according to the simplified POS tagset given
in Table 6.4. The refined POS tagged sentence is split according to the simplified POS
tags. The second module morphologically analyzes the Noun (<N>) and Verb (<V>)
forms. The third and fourth modules morphologically analyze Pronoun (P) and Proper
nouns (PN). Other word classes are analyzed in the fifth stage. This module considers
the POS tag as morphological features.
The morphological analyzer identifies root and suffixes of a word. Generally, rule
based approaches are used for morphological analysis which are based on a set of rules
and dictionary that contains root words and morphemes. In rule based approach, a
particular word is given as an input to the morphological analyzer and if the
corresponding morphemes or root word is missing in the dictionary, then the rule based
system fails. Here, each rule depends on the previous rule. So if one rule fails, it affects
the entire set of rules which follows.
177
Recently, machine learning approaches are found to be dominating the Natural
Language Processing field. Machine learning is a branch of Artificial Intelligence (AI)
concerned with the design of algorithms that learn from the examples. Machine
learning algorithms can be supervised or unsupervised. The input and corresponding
output data are used in supervised learning. In unsupervised learning, only input
samples are used. The goal of machine learning approach is to use the given examples
and find out generalization and classification rules automatically. All the rules
including complex spelling rules can be handled by this method. Morphological
Analyzer based on machine learning approaches does not require any hand coded
morphological rules. It only needs morphologically segmented corpora. H.Poon et.al
(2009) [189] reported the first log-linear model for unsupervised morphological
segmentation. For Arabic and Hebrew language, it outperforms the state-of-the-art
systems by a large margin. The sequence labeling is a significant generalization of the
supervised classification problem. One can assign a single label to each input element
in a sequence. The elements to be assigned are typically like parts of speech or
syntactic chunk labels [171]. Many tasks are formalized as sequence labeling problems
in various fields such as natural language processing and bioinformatics. There are two
types in sequence labeling approaches [171].
• Raw labeling.
In raw labeling, each element gets a single tag whereas in joint segmentation
and labeling, whole segments get a single label. In a morphological analyzer, sequence
is usually a word and, a character, is an element. As mentioned earlier, in
morphological analyzer, input is a word and output is root and inflections. Input word is
denoted as ‘W’, and, root word and inflections are denoted by ‘R’ and ‘I’ respectively.
In turn, notation ‘I’ can be expressed as i1+ i2+…. + in where ‘n’ refers to the
number of inflections or morphemes. Further ‘W’ is converted into a set of characters.
Morphological analyzer accepts a sequence of characters as input and generates a
sequence of characters as output. Let X be the finite set of input characters and Y be the
finite set of output characters. If the input string is ‘x’, it is segmented as x1x2....xn
178
where each xn є X. Similarly, if y is an output string, it is segmented as y1y2...yn and yn
є Y where ‘n’ is the number of segments.
The main objective of sequence labeling approach is predicting y from the given
‘x’. In training data, the input sequence ‘x’ is mapped with output sequence ‘y’. Now
the morphological analyzer problem is transformed into a sequence labeling problem.
The information about the training data is explained in the following sub sections.
Finally the morphological analysis is redefined as a classification task which is solved
by using sequence labeling methodology.
Data formulation plays the key role in supervised machine learning approaches. The
first step involved in the corpora development for morphological analyzer is classifying
paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based
respectively on tense markers and case markers. Each paradigm will inflect with the
same set of inflections. The second step is to collect the list of root words for all
paradigms.
Paradigm provides information about all the possible word forms of a root word in a
particular word class. Tamil noun and verb paradigm classification is done based on its
case and tense markers respectively. Number of paradigms for each word class
(noun/verb) is defined. For the sake of computational data modeling, Tamil verbs were
classified into 32 paradigms [13]. Nouns are classified into 25 paradigms to resolve the
challenges in noun morphological analysis. Based on the paradigm, the root words are
grouped into its paradigm. Table 6.5 shows the number of paradigms and inflections of
verb and noun which are handled in the system. Total represents the total number of
inflections that are handled in this analyzer system. Noun and verb paradigm list is
shown in Tables 6.6 and 6.7.
179
Table 6.5 Number of Paradigms and Inflections
Word forms
Paradigms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320
6.4.2.2 Word-forms
The Morphological System for noun handles more than three hundred word
forms including postpositions. Traditional grammarians group the various suffixes into
8 cases corresponding to the cases used in Sanskrit. These were the nominative,
accusative, and dative, sociative, genitive, instrumental, locative, and ablative. The
sample word forms which are used in this thesis are shown in Table 6.8. Remaining
word forms are included in Appendix B.
180
Table 6.7 Verb Paradigms
181
Verb word forms
Verbs also morphologically deficient i.e. some verbs do not take all the suffixes
meant for verbs. Verb is an obligatory part of a sentence except copula sentences.
Verbs can be classified into different types based on morphological, syntactic and
semantic characteristics. Based on the tense suffixes, verbs can be classified into weak
verb, strong verbs and medium verbs. Based on the form and function, verbs can be
classified into finite verb (ex. va-ndt-aan 'come_PAST_he') and non-finite verb (ex. va-
ndt-a 'come_PAST_RP' and va-ndt-u 'come_PAST_VPAR'). Depending the non-finite
whether non-finite form occur before noun or verb, they can be classified as adjectival
or relative participle form (ex. vandta paiyan 'the boy who came') and adverbial or
verbal participle form (ex. vandtu poonaan 'having come he went'). The Morphological
system for verb handles more than ten thousand word forms including auxiliaries and
clitics. The sample verb forms which are used in this research are shown in Table 6.9.
Remaining word forms are given in Appendix B.
182
6.4.2.3 Morphemes
Noun morphemes
The Morphological analyzer system for noun handles 92 morphemes in the Morpho-
lexical Tagging (Phase II). The morphemes which are used in this thesis are given in
Table 6.10.
Table 6.10 Noun Morphemes
ஐ ேமேல ெவளிேய
ஆக ேமல் ைவத்
ஆன ற்கு அ யில்
க் அண்ைட அப்பால்
ச் அ ேக அ கில்
த் உக்கு இைடயில்
ப் உள்ேள இ ந்
அ எதிேர எதிாில்
ஆல் ஒட் கிழக்கு
இன் கிட்ட ந வில்
இல் க்காக வைரயில்
உள் பதில் அப் றம்
ஓ பற்றி அல்லாமல்
கண் பிறகு இல்லாமல்
கள் தல் குறித்
ப லம் கு க்ேக
ேபால ஆட்டம் பார்த்
வைர ெகாண் பின்னால்
விட சுற்றி ன்னால்
அதன் தாண் ெவளியில்
இடம் ெதற்கு எதிர்க்கு
இனம் ேநாக்கி எதிர்க்ேக
உடன் பக்கம் தவிர்த்
உைடய பதிலாக வைரக்கும்
ஒழிய பிந்தி அ த்தாற்
கீேழ பின்ேன ேபால்
கீழ் பின் அ கி ந்
க்கு மாதிாி எதிர்த்தார்
தவிர ேமற்கு ேபால்
பின் வடக்கு எதிர்த்தாற்
ேபால் வழியாக ேபால்
ன் விட்
183
Verb morphemes
The Morphological system for verb handles 170 morphemes in the Morpho-
lexical tagging (Phase II). The morphemes which are used in this analyzer are shown in
Table 4.11.
அ ய
ஆ வா யல்
இ ேவ ரல்
உ ைவ லல்
ஏ வ் ளல்
ஓ அ வன்
க அல் வர்
ட ஆத் வள்
ண ஆன் ஆமல்
ன ஆைம இயல்
ய ஆம் காத்
ர ஆய் காைம
ற ஆர் கிற்
ல ஆல் கிழி
ள ஆள் கும்
ைக இன் கூ
க் இ ெகா
சா ஈர் ெகாள்
ச் உம் சாகு
உள் ெசய்
ட் ஏன் ணாத்
ஓன் ணாைம
ஓம் ம்
த் கல் தான்
கள் திாி
ன் கிட தீர்
ேபா க்க ெதாைல
ப் டல் த்
யி ணல் த்த்
தல் ந்
ற் னர் ந்த்
னல் னாத்
ப னாைம
184
Ambigious Morphemes of Noun and Verb
185
6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer
The data creation for the first phase of Noun/Verb Morphological analyzer system is
done by the following stages.
• Preprocessing
• Mapping
• Bootstrapping
Preprocessing
Romanization
The input word forms are converted to Romanized forms using the Unicode to
Roman mapping. Romanization is done for easy computational processing. In Tamil,
syllable (Compound characters) exists as a single character, where one cannot separate
vowel and consonant. So, for this separation, Tamil graphemes are converted into
Roman forms. Tamil roman mapping is given in Appendix A.
Segmentation
After Romanization, each and every word in the corpora is segmented based on
the Tamil grapheme and additionally, each syllable in the corresponding word is further
segmented into consonants and vowels. To the segmented syllable, postfix “–C” and “–
V” to the consonant and vowel respectively. It is named as C-V representation i.e.
Consonant–Vowel representation. The C-V representation is given only for input data.
In the output data, morpheme boundaries are indicated by “*” symbol.
Alignment
The segmented words are aligned vertically as segments using the gap between
them.
186
Figure 6.4 Preprocessing Steps
Mapping and Bootstrapping
The aligned input segments are consequently mapped with output segments in
the mapping stage. Bootstrapping is done to increase the training data size. Sample data
format for the word ப த்தான் ‘padiththAn’ is given in Table 6.13. First column
represents the input data and the second one represents output data. “*” indicates the
morpheme boundaries.
I/P O/P
p-C p
a-V a
d-C D
i-V i*
th Th
th-C th*
A-V A
n n*
187
6.4.2.5 Issues in Data Creation
Mismatching is the main problem which occurs in mapping the input characters
with output characters. Mismatching occurs in two cases, i.e., either the input units are
larger or smaller than those of the output units. The mismatching problem is solved by
inserting a null symbol “$” or combining two units based on the morpho-phonemic
rules and further the input segments are mapped with output segments. After mapping,
machine learning tool is used for training the data.
In case 1, the input sequence is having more number of segments (14) than the
segments (13) in the output sequence. Tamil verb, “padikkayiyalum” is having 14
segments in input sequence but in output, only 13 segments are present. The first
occurrence of “y”(8th Segment) in the input sequence becomes null due to the morpho-
phonemic rule. So there is no segment to map the “y” segment in input sequence. For
this reason, in training, the input segment “y” is mapped with “$” symbol (“$” indicates
null) in output sequence. Now the input and the output segments are matched equally.
Case 1:
Input Sequence:
P-C | a-V | d-C | i-V | k | k-C | a-V | y-C | i-V | y-C |a-V | l-C | u-V | m
(14 segments)
Mismatched Output Sequence:
p | a | d | i* | k | k | a* | i | y | a | l* | u | m*
(13 segments)
Corrected Output Sequence:
p | a | d | i* | k | k | a* | $ | i | y | a | l* | u | m* (14 segments)
In case 2, the input sequence is having less number of segments than the output
sequence. Tamil verb OdinAn is having 6 segments in input sequence but output has 7
segments. Using morpho-phonemic rule, the segment “d-C”(2nd Segment) in the input
sequence is mapped to two segments “d” &”’u*”(2nd and 3rd Segments) in the output
sequence. For this reason, in training, “d-C” is mapped with “du*”. Now the input and
the output segments are equalized and thus the problem of sequence mismatching is
solved.
188
Case 2:
Input Sequence:
O | d-C | i-V | n-C | A-V | n (6 segments)
Support Vector Machine (SVM) approaches have been around since the mid 1990s,
initially as a binary classification technique, with later extensions to regression and
multi-class classification. Here, Morphological analyzer problem is converted into a
classification problem. These classifications can be done through supervised machine
learning algorithms [12]. Support Vector Machine is a machine learning algorithm for
binary classification, which has been successfully applied to a number of practical
examples, where each instance xi is a vector in R and yi ∈ {−1, +1} is the class label.
N
6.4.3.2 SVMTool
Different models are learned for the different strategies. Given a training set of
annotated examples, it is responsible for the training of a set of SVM classifiers. So as
to do that, it makes use of SVM–light an implementation of Vapnik’s SVMs in C,
developed by Thorsten Joachims (2002). Given a text corpus (one token per line) and
the path to a previously learned SVM model (including the automatically generated
dictionary), it performs tagging of a sequence of characters. Finally, given a correctly
annotated corpus, and the corresponding SVMTool predicted annotation, the
SVMTeval component displays tagging results. SVMTeval evaluates the performance
in terms of accuracy.
Exiting Tamil morphological analyzers are explained in Chapter 2. Jan Hajic et.al
(1998) [190] developed morphological tagging for inflectional languages using
exponential probabilistic model based on automatically selected features. The
parameters of the model are computed using simple estimates.
Using the machine learning approach, the morphological analyzer for Tamil is
developed. Separate engines are developed for noun and verb. Noun morphological
analyzer can handle inflected noun forms and postpositionally inflected nouns. The
verb analyzer handles all the verb forms like finite, infinite and auxiliaries.
Morphological analyzer is redefined as a classification task. Classification problem is
solved by using the SVM. In this machine learning approach, two training models are
developed for morphological analysis. These two models are grouped as Model-I
(segmentation model) and Model-II (morpho-syntactic tagging model). First model
(Model-I) is trained using the sequence of input characters and their corresponding
output labels. This trained Model-I is used for predicting the morpheme boundaries.
Second model (Model-II) is trained using sequence of morphemes and their
grammatical categories. This trained Model-II is used for assigning grammatical classes
to each morpheme. Figure 6.5 illustrates the three phases involved in the process of
morphological analyzer.
• Pre-processing.
190
• Morpheme Segmentation.
Preprocessing
The word that has to be morphologically analyzed is given as the input to the
pre-processing phase. The word primarily undergoes Romanization process. The
romanized word is segmented based on Tamil graphemes. Tamil grapheme consists of
vowel, consonant and syllable. The syllables are broken into vowel and consonant. To
these consonant and vowel, –C and –V are suffixed.
Segmentation of morpheme
Trained
Model-I
Input
ப த்தான் Preprocessing
Morph Morpheme
Analyzer Alignment
Postprocessing
Output:
ப <ROOT>
த்த் <PAST >
ஆன் <3SM>
Trained
Model-II
Morpho-syntactic tagging
191
6.5 MORPHOLOGICAL ANALYZER FOR PRONOUN USING
PATTERN MATCHING
The morphological analyzer for Tamil pronoun is developed by using pattern matching
approach. Personal pronouns are playing an important role in Tamil language therefore
they need very special attention while generating as well as analyzing. Figure 6.6
shows the implementation of pronoun morphological analyzer. Morphological
processing of Tamil pronoun word form is handled independently by using pronoun
morphological analysis system. Morphological analysis of pronoun is based on the
pattern matching and pronoun root word. Structure of the pronoun word form is used
for creating a pattern file. Pronoun word structure is divided into four stages. They are,
i. PRN – ROOT
ii. CASE – CLITIC
iii. PPO – CL
iv. CLITIC
PP
Case Clitic
Clitic
Pronoun
Clitic
192
Example for the Structure of Pronoun
அவ க்க கில்
அவன் அவ க்கா
அவனா
Pronoun word class is a closed class word, so it is easy to collect all root words of
pronoun. In pronoun morphological system the word form is treated from left to right.
Generally in morphological analysis systems handles the word from right to left but
here limited vocabulary of pronoun makes to formulate a system from left to right. The
pronoun word form is Romanized using Unicode to roman mapping this Romanized
word is first compared with pronoun stem file. The pronoun stem file is consists of all
the stems and roots of pronoun words. If the roman form is matched with any entry in
the pronoun stem file then the matched part of the roman form is replaced with the
value of the corresponding entry. After this process the remaining part is compared
with three different suffix files. In this comparison, the matched part is replaced with
corresponding value of the suffix element. Finally the root word is converted into
Unicode form. Figure 6.7 shows the implementation of Pronoun Morph Analyzer
Steps
193
6.6 MORPHOLOGICAL ANALYZER FOR PROPER NOUN
USING SUFFIXES
The morphological analyzer for Proper Noun is developed by using the suffixes. Figure
6.8 shows the implementation of Proper Noun Morph analyzer. Proper noun word form
is taken as input for proper noun morphological analysis. Proper noun word form is
taken from minimized POS Tagged sentence. It is identified from a POS tag <PN>.
Initially proper noun word form is converted into roman form for easy computation.
This Roman conversion is done by using simple key pair mapping of Roman and Tamil
characters. This mapping program recognizes each Tamil character unit and replace
with corresponding roman character.
This Roman form is given to the proper noun analyzer system. System
compares the word with the suffix which is predefined. First, it identifies the suffix and
replaced with the corresponding information in the proper noun suffix data set. The
suffix data set is created using various proper noun inflection and their end characters.
For example from a table 6.14, the word “sithamparam”(சிதம்பரம்) is end with
‘m’(ம்), and the other word “pANdisEri” (பாண் ச்ேசாி) is end with ‘ri‘(ாி) , the
possible inflections of both words are given in table. Morphological changes are
differing for the proper noun based on the end characters. So end characters are used in
creating rules. From the various inflections of the word-form the suffix is identified and
the remaining part is stem. This suffix is mapped to the original morphological
information. This algorithm replaces the encountered suffix with the morphological
information in a suffix table.
Steps
194
Input Word
Word
No
Suffix
Stem ?
Yes
Stem Suffix
Convert into Morpho-Lexical
Lemma Information
Lemma + MLI
Morphological
Output
Efficiency of the system is compared in this sub section. Various machine learning
tools are also compared using the same morphologically annotated data. The system
accuracy is estimated at various levels, which are briefly discussed below.
195
Training Data
a Vs Accuraacy
In Figuure 6.9, X ax
xis represennts training data
d and Y axis represeents accuracyy.
om the grapph, it is fou
Fro und that M
Morphologicaal Analyzer accuracy inncreases witth
inccrease in thee volume off training daata. Accuraccies are calcculated from
m 10k to3000k
traiining corpuss size.
Accuraacy
100
90
A
c 80
c
70
u
r 60
a
c 50
y
40
30
10k 25k 40k 50k 75k 100k 125k 150k 175k 200k 300k
Training Data
In the sequence
s bassed morphollogical system, output is obtained inn two differennt
stages using thhe trained models.
m First stage takes a sequence of characterr as input annd
givves untaggedd morphemees as output using the trrained Model-I. It also represents as
a
moorpheme ideentification. In the secoond stage, these morphemes are tagged usinng
traiined Model--II. Accuraccies of the uuntagged an
nd tagged m
morphemes for
f verbs annd
nou
un are shown
n in Table 6.15.
ble 6.15 Taggged Vs Untaagged Accu
Tab uracies
196
Word level and character level accuracies
Accuracies are compared with word level as well as character level. Two
thousand three hundred verb data and one thousand seven hundred and fifty noun data
are taken randomly from POS Tagged corpus for testing the system. Table 6.16 shows
the number of words as well as the characters in the whole testing data set as well as the
efficiency of prediction.
VERB NOUN
Category
Words Characters Words Characters
Testing data 2300 20627 1750 10534
Predicted correctly 2071 19089 1639 9645
Efficiency 90.4% 92.5% 91.5% 93.6%
The percentage of Word level efficiencies and Character level efficiencies are
calculated by the following formulae.
The POS tagged sentences are given to the Morphological analyzer tool.
Therefore, the accuracy of POS tagging affects the performance of the analyzer. Here,
1200 POS tagged sentences consisting of 8358 words were taken for testing the
Morphological system. Table 6.17 shows the Sentence level accuracy of Morphological
analyzer system. For other categories of simplified POS tags, part-of-speech
information is considered as morphological information.
197
Table 6.17 Sentence Level Accuracies
WORD COUNT
Categories
Input Correct Output Percentage
N 2821 2642 93.65
V 2794 2543 91.00
P (Pronoun) 562 543 96.61
PN 279 258 92.47
O (Others) 1902 1817 95.53
Overall Accuracy 93.86
6.9 SUMMARY
This chapter explains the development of Morphological analyzer for Tamil language
using Machine learning approach. Capturing the agglutinative structure of Tamil words
by an automatic system is a challenging job. Generally, rule based approaches are used
for building morphological analyzer. Tamil morphological analyzer for noun and verb
is developed using the new and state of the art machine learning approach.
Morphological analyzer problem is redefined as a classification problem. This approach
is based on sequence labeling and training by kernel methods that captures the non
198
linear relationships of the morphological features from training data samples in a better
and simpler way. SVM based tool is used for training the system with the size of 6 lakh
morphologically tagged verbs and nouns. The same methodology is implemented for
other Dravidian languages like Malayalam, Telugu, and Kannada. Tamil Pronouns and
Proper nouns are handled using separate analyzer system. Other word classes need not
to be further analyzed for morphological features. So, POS tag information is
considered as the morphological information.
199
CHAPTER 7
FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL
This chapter describes the outline about Statistical Machine Translation system and its
components. This section also explains the integration of linguistic knowledge in
factored translation models and the development of factored corpora.
7.1 STATISTICAL MACHINE TRANSLATION
An outline of the noisy channel model is used in speech recognition and machine
translation. If one needs to translate a sentence, f in the source language F to a sentence,
e in the target language E, the noisy channel model describes the situation as: the
sentence f to be translated was initially conceived in language E as some sentence e.
During the process the sentence e was corrupted by the channel to the sentence f. Now
an assumption is made that each sentence in E is a translation of the sentence f with
some probability, and the sentence which choose as the translation ( e ) is the one that
has the highest probability which is given in equation (7.1). In mathematical terms
[172],
P (e) P ( f | e)
P (e | f ) = (7.2)
P( f )
P (e) P ( f | e)
e = arg max (7.3)
e P( f )
200
Here, P ( e | f ) is split
s into P (e) and P ( f | e ) accordinng to Bayess rule. This is
don
ne because practical
p gh probabilitiies to P ( f | e ) or P (e | f )
trannslation moddels give hig
wh ds in f are geenerally transslations of thhe words in e. For instannce, when thhe
hen the word
English sentennce, “He willl come tomoorrow” is trranslated to Tamil, bothh P (e | f ) annd
P ( f | e ) gives equal probaabilities to th
he following sentences.
அவ . (avan
( wALa
ai varuvAn))
அவ . (avan
( varuvvAn wALai))
The abo
ove problem
m can be avooided when the equationn (7.4) is ussed instead of
o
equuation (7.1).. The second
d sentence will
w be ruled
d out as the first sentennce which haas
muuch higher vaalue of P (e) than the seccond one an
nd the first seentence will be taken intto
connsideration during
d the traanslation proocess.
Figure.7.1 The
T Noisy Channel
C Moodel to Machine Transllation
7.22 COMP
PONENTS OF SMT
T
1. Translation model
201
2. Language model
Translation system is capable of producing the words that retrieves its original meaning
and arranging those words in a sequence that form fluent sentences in the target
language. The role of the translation model is to find P ( f | e ) the probability of the
source sentence f given the translated sentence e. Note that P ( f | e ) that is computed
by the translation model and not P (e | f ) . The training corpus for the translation model
is a sentence-aligned parallel corpus of the languages F and E.
A word alignment between sentences tells us exactly how each word in sentence
f is translated in e. The problem is getting the word alignment probabilities given a
training corpus that is only sentence aligned. This problem is solved by using the
Expectation-Maximization (EM) algorithm.
The key intuition behind Expectation Maximization is that if the number of times a
word aligns with another in the corpus is known then it is easy to calculate the word
translation probabilities. Conversely, if the word translation probability is known then it
should be possible to find the probability of various alignments.
However, if one can start with some uniform word translation probabilities and
calculate alignment probabilities and then use these alignment probabilities to get better
translation probabilities. This iterative procedure, which is called the Expectation-
202
Maximization algorithm, works because words that are actually translations of each
other co-occur in the sentence-aligned corpus.
Although this limitation does not account for many real-life alignment
relationships, in principle IBM models can solve this by estimating the probability of
generating the source empty word, which can translate into non-empty target words.
However, many current statistical machine translation systems do not use IBM model
parameters in their training methods, but only the most probable alignment (using a
Viterbi search) given the estimated IBM models. Therefore, in order to obtain many-to-
many word alignments, usually alignments from source-to-target and target-to-source
are performed, and symmetrization strategies have to be applied.
203
alignments are then used to extract phrase or induce syntactical rules. And the word
alignment problem is still actively discussed in the community. Because of the
importance of GIZA++, there are now several distributed implementations of GIZA++
available online.
In phrase-based translation model [175], the aim is to reduce the restrictions of word-
based translation by translating whole sequences of words, where the lengths may
differ. The sequences of words are called blocks or phrases, but typically are not
linguistic phrases but phrases found using statistical methods from corpora.
The job of the translation model, given a Tamil sentence T and an English
sentence E, is to assign a probability that T generates E. While one can estimate these
probabilities by thinking about how each individual word is translated. Modern
statistical machine translation is based on the intuition that a better way to compute
these probabilities is by considering the behavior of phrases. The intuition of phrase-
based statistical machine translation is to use phrases i.e., sequences of words as well as
single words as the fundamental units of translation.
The generative story of phrase based translation has three steps. First, the source
word is grouped into phrases E1 , E2 ,… El . Second, each Ei is translated into Ti . Finally,
each phrase in the source is reordered.
of generating source phrase Ti from target phrase Ei . The reordering of the source
phrase is done by distortion probability d. The distortion probability in phrase based
204
translation means the probability of two consecutive Tamil phrases being separated in
English by a span of English word of a particular length. The distortion is
parameterized by d (ai − bi −1 ) where ai is the start position of the source English phrase
generated by the ith Tamil phrase, and bi −1 is the end position of the source English
phrase generated by i-1th Tamil phrase. One can use a very simple distortion probability
which penalizes large distortions by giving lower and lower probability for larger
distortion. The final translation model for phrase based machine translation is based on
the equation (7.5).
Phrase based models works in a successful manner only if the source and the
target language have almost same in word order. Difference in the order of words in
phrase based models is handled by calculating distortion probabilities. Reordering is
done by the phrase based models. It has been shown that restricting the phrases to
linguistic phrases decreases the quality of translation. By the turn of the century it
became clear that in many cases specifying translation models at the level of words
turned out to be inappropriate, as much local context seemed to be lost during
translation. Novel approaches needed to describe their models according to longer
units, typically sequences of consecutive words or phrases.
1. The sentence is first split into phrases - arbitrary contiguous sequences of words.
3. The translated phrases are permuted into their final order. The permutation
problem and its solutions are identical to those in word-based translation.
Consider the following particular set of phrases for our example sentences:
Position 1 2 3 4
205
Netru naAn avaLai pArththEn
Since each phrase follows are not directly in order, the distortions are not all 1,
and the probability P ( E | T ) can be computed as:
P(E|T)=P(yesterday|Netru)×d(1)
×P(i|naAn)×d(1)
×P(her|avaLai)×d(2)
×P(saw|pArththaen)×d(2)
In general, the language model is used to estimate the fluency of the translated
sentence. This plays an important role in the statistical approach as it chooses the best
fluent sentence with high value of P (e) among all possible translations generated by
the translation model P ( f | e ) . Language model can be defined as the model which
estimates and assigns a probability P (e) to the sentence, e. A high value will be
assigned for the most fluent sentence and a low value for the least fluent sentence.
Language model can be estimated from a monolingual corpus of the target language in
the translation process.
206
எ ேகால் ேமல் ேமைஜயின் உள்ள .
Even the second and third translation looks awkward to read, the probability
assigned to the translation model to each sentences will be same, as translation model
mainly concerns with producing the best output words for each word in the source
sentence, e. But when the fluency and accuracy of the translation comes into picture,
only the first translation of the given English sentence is correct. This problem can be
very well handled by the language models. This is because the probability assigned by
the language model for the first sentence will be greater when compared with the other
two sentences.
That is,
ேமைஜயின் உள்ள .)
எ ேகாள் உள்ள .)
n
P (e) = ∏ P ( wi ) (7.6)
i =1
Where,
count ( wi )
P( wi ) = (7.7)
count ( wn )
The above equation (7.6) will assign a zero probability to the sentence e, if a
word in the sentence has not occurred in the monolingual corpus. This in turn will
207
affect the accuracy of the translation process. In order to overcome this problem the
probabilities that have been estimated from the corpus have to be approximated and this
can be done by using n-gram language models.
One of the methods for language models that have been widely used for language
modelling is the n-gram language models. In general, n-gram language models are
based on statistics of how likely the words are to follow each other. For the above
example, analyse a corpus for the determining the probability of the sentence using n-
gram language model, the probability for the word ேமல் will be greater for following
Thus this type of chain where one can consider only a limited history is called
as Markov Chain. And the number of previous words considered is termed as the order
of the model. This is because of the Markov assumption which states that only a limited
number of previous words affect the probability of the next word. But the above
assumption can be proved wrong with counter examples that a longer history is needed.
Typically the order of the language model is based on the amount of training data
available. Limited data resource restricts to short histories i.e., small order for the
language model. Generally, trigram language models are used, whereas language
models of small order such as unigrams and bigrams as well also models of high orders
208
are also used. In most cases this depends mainly on the amount of data from which
language model probabilities are estimated.
m
P ( wn | w1 ,… , wm ) = ∏ P ( wi | wi −( n −1) ,… , wi −1 ) (7.10)
i =1
by the previous N-1 words. Though N-gram approach is simple, it has been
incorporated with many applications such as speech recognition, spell checker,
translation and many other tasks where language modelling is required. In general, n-
gram model that generates the probability by taking account of the word and the
previous one word is termed as bigram model and the previous two words is the trigram
model. Language model probabilities with n-gram approach can be directly calculated
from a monolingual corpus. The equation for calculating trigram probabilities is given
by equation (7.11).
count ( wn − 2 wn −1wn )
P( wn | wn − 2 , wn −1 ) = (7.11)
∑ count (wn−2 wn−1w)
w
Here count ( wn −2 wn −1wn ) denotes the number of occurrences of the sequence wn-
2wn−1wn in the corpus. The denominator on the right hand side sums over all words w in
the corpus the number of times wn − 2 wn −1 occurs before any word. Since this is just the
count ( wn−2 wn−1 ) , the above equation (7.10) can be writing as in equation (7.12).
count ( wn − 2 wn −1wn )
P ( wn | wn − 2 , wn −1 ) = (7.12)
count ( wn − 2 wn −1 )
209
7.2.3 The Statistical Machine Translation Decoder
The statistical machine translation decoder performs decoding which is the process of
finding a target translated sentence for a source sentence using translation model and
language model. In general, decoding is a search problem that maximizes the
translation and language model probability. Statistical machine translation decoders use
best-first search based on heuristics. In other words, decoder is responsible for the
search of best translation in the space of possible translations. Given a translation
model and a language model, the decoder constructs the possible translations and look
for the most probable one. There are a numerous decoders for statistical machine
translation. A few of them is greedy decoders and beam search decoders. In greedy
decoders, the initial hypothesis is a word to word translation which was refined
iteratively using the hill climbing heuristics. Beam search decoders use a heuristic
search algorithm that explores a graph by expanding the most promising node in a
limited set.
Factored translation models differ from the standard phrase based models from
the following [10]:
• The parallel corpus must be annotated with factors such as lemma, part-of-
speech, morphology, etc., before training.
• Additional language models for every factor annotated can be used in training
the system.
210
• Transllation steps will be siimilar to sttandard phrrase based systems. Buut
mply trainingg only on thee target side of the corpuus.
generaation steps im
Therefo
ore, a fram
mework is ddeveloped for
f statisticaal translatioon models to
t
inteegrate additional inform
mation. This framework is an extennsion of the phrase-baseed
appproach. It addds additionaal annotationn at the word
d level.
In baselline SMT, thhe word houuse is compleetely indepenndent of the word housees.
ny instance of house in
An i the trainiing data dooes not addd any know
wledge to thhe
211
translation of houses. In the extreme case, while the translation of house may be known
to the model, the word houses may be unknown and the system will not be able to
translate it. While this problem does not show up as strongly in English - due to the
very limited morphological production in English. But it is a significant problem for
morphologically rich languages.
In this model the translation process is broken up into the following three
mapping steps:
212
translation operations at each node of the tree. In general, this model incorporates
syntax to the source and/or target languages.
Graehl et al. [177] and Melamed [178], propose methods based on tree-to-tree
mappings. Imamura et al. (2005) [179] present a similar method that achieves
significant improvements over a phrase based baseline model for Japanese-English
translation. Recently, various preprocessing approaches have been proposed for
handling syntax within Statistical machine translation. These algorithms attempt to
reconcile the word order differences between the source and target language sentences
by reordering the source language data prior to the SMT training and decoding cycles.
7.4.1 MOSES
[180]. Moses has an efficient data structure that allows memory-intensive translation
model and language model by exploiting larger data resources with limited hardware. It
implements an efficient representation of phrase translation table using the prefix tree
structure, which allows loading only the fraction of phrase table into memory that is
needed to translate the test sentences. Moses uses the beam-search algorithm that
quickly finds the highest probability translation among the exponential number of
choices.
GIZA++ is a statistical machine translation toolkit that is used to train IBM models 1-5
and an HMM word alignment word alignment model. It is an extension of GIZA which
was designed as part of the SMT toolkit, in Egypt. It includes IBM models 3 and 4. It
uses the mkcls [181] tool for unsupervised classification to help the model. It also
implement the HMM alignment model and various smoothing techniques for fertility,
distortion or alignment parameters. A bilingual dictionary will be built by this tool from
the bilingual corpus. More details about GIZA++ can be found in [182].
7.4.3 SRILM
SRILM is a toolkit for language modelling that can be used in speech recognition,
statistical tagging and segmentation, and statistical machine translation. It is a freely
available collection of C++ libraries, executable programs, and supporting scripts. It
214
can build and manage language models. SRILM implements various smoothing
algorithm such as Good-Turing, Absolute discounting, Written-Bell and modified
Kneser-Ney. Besides the standard word based n-gram back-off models, SRILM
implements several other language model types [182], such as word-class based n-gram
models, cache-based models, disfluency and hidden event language models, HMM of
n-gram models and many more.
Corpora are the term used on Linguistics, which corresponds to a (finite) collection of
texts (in a specific language). A collection of documents in more than one language is
called multilingual corpora. A parallel corpus is a collection of texts in different
languages where one of them is the original text and the other is their translations. A
bilingual corpus is a collection of texts in two different languages where each of one is
translation of other.
Parallel corpora are very important resources for tasks in the translation field
like linguistic studies, information retrieval systems development or natural language
processing. In order to be useful, these resources must be available in reasonable
quantities, because most application methods are based on statistics. The quality of the
results depends a lot on the size of the corpora, which means robust tools are needed to
build and process them. The alignment at sentence and word levels makes parallel
corpora both more interesting and more useful.
Aligned bilingual corpora have been proved useful in many ways including
machine translation, sense disambiguation and bilingual lexicography. The availability
of parallel sentences for English-Tamil language pair is available, but not abundantly.
In European countries, parallel data for many European language pair are available
from the proceedings of the European Parliament. But in case of Tamil, no such parallel
data are readily available. Hence English sentences have to be collected and manually
translated to Tamil in order to create a bilingual corpus for English-Tamil language
pair. Even though, if parallel data are available for English-Tamil language pair, there
are chances that it might not be aligned properly and have to be separate the paragraphs
215
in to individual sentences (Example-News paper corpora). This will employ a lot of
human resource. This is a time extensive work and has it is the main resource for the
statistical machine translation system, more time and importance has to be provided in
developing a bilingual corpus for English-Tamil language pair. During manual
translations of English sentences to Tamil, terminology data banks for English-Tamil
language pair are found to be very useful for humans
216
Table 7.1 Factored Parallel Sentences
217
An efficient search algorithm finds quickly the highest probability translation
among the exponential number of choices. Morphological, syntactic and semantic
information can be integrated in factors during training. Figure.7.5 explains the mapping
of English factors and Tamil factors in Factored SMT. Initially, English factors
“Lemma” and “Minimized-POS” are mapped to Tamil factors “Lemma” and “M-POS”
then “Minimized-POS” and “Compound-Tag” factors of English language is mapped to
“Morphological information” factor of Tamil language.
Here, the important thing is Tamil surface word forms are not generated in SMT
decoder. Only factors are generated from SMT and the word is generated in the post
processing stage. Tamil morphological generator is used in post processing to generate a
Tamil surface word from output factors.
218
>ngram-count -order n -[options] -text CORPUS_FILE –lm LM_FILE
Where,
order n - the order of the n-gram language model can be mentioned here, with ‘–
order n’, where ‘n’ denotes the order of the n-gram model.
lm – the file name of the language model file to be created by the script.
• Prepare the data: convert the parallel corpus into a format that is suitable to
GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is
converted into a numbered format. The vocabulary files contain words, integer word
identifiers and word count information. GIZA++ also requires words to be placed into
word classes. This is done automatically calling the mkcls program. Word classes are
only used for the IBM reordering model in GIZA++.
219
union, srctotgt and tgttosrc. Alternative alignment methods can be specified with the
switch alignment.
• Get lexical translation table: Given the word alignment, it is quite straight-
forward to estimate a maximum likelihood lexical translation table. w(e | f ) and the
inverse w( f | e) are estimated from word translation table.
• Extract Phrases: In the phrase extraction step, all phrases are dumped into
one big file. The content of this file is for each line: foreign phrase, English phrase, and
alignment points. Alignment points are pairs (English,Tamil). Also, an inverted
alignment file extract.inv is generated, and if the lexicalized reordering model is trained
(default), a reordering file extract.o.
f, msd-fe, msd-f, monotonicity-bidirectional-fe, monotonicity-bidirectional-f,
monotonicity-fe and monotonicity-f.
• Build Generation model: The generation model is built from the target side
of the parallel corpus. By default, forward and backward probabilities are computed. If
you use the switch generation-type single only the probabilities in the direction of the
step are computed.
7.7 SUMMARY
This chapter described the factored translation model which extends the phrase based
model to incorporate linguistic information as additional factors in the representation of
words. The unavailability of more training data increases the advantage of using word-
level information in more linguistically motivated models. Mapping translation factors
in the factored model aids in the disambiguation of source words and improves the
grammar of target factors. SMT’s generation model is not utilized in this translation
system. The model is tuned only for producing the lemma, POS tag and morphological
factors. It is shown that the developed system improves translation over a base line
system and other factored system.
221
CHAPTER 8
POSTPROCESSING FOR ENGLISH TO TAMIL SMT
8.1 GENERAL
222
For Indian languages, many attempts have been made to build morphological
generator. A Hindi morphological generator has been developed based on data driven
approach [186]. Tel-More, a morphological generator for Telugu is based on linguistic
rules and implemented in Perl program [187]. Morphological generator has been
designed for syntactic categories of Tamil using paradigms and sandhi rules [72]. Finite
state machines are used for developing morphological generator for Tamil [131].
223
the required word form of Tamil verbs. To resolve this complexity, a classification of
Tamil verbs based on tense markers and inflections is made.
224
The input of the morphological generator system is a factored sentence from SMT
output. The factored Tamil sentence is categorized according to the simplified POS tag.
The simplified POS tagset is shown in Table 8.2. Based on this simplified tag factor the
morphological generator generates the word-form. The morphological generator for
noun handles the proper nouns and common nouns. The generation of Tamil verb forms
is taken care of morphological generator for verbs.
Figure 8.1 shows the categorization of Tamil sentence generation system. This
system contains five different modules. Morphological generators for Tamil noun and
verb are developed using suffix based approach. Tamil pronouns come under the
‘closed word-class’ category. So a pattern matching technique is followed for
generating pronominal word forms.
225
This section depicts a new simple morphological generator algorithm for Tamil.
Generally, morphological generator tool is developed using rule based approach where
it requires a set of morpho-phonemic (spelling) rules and morpheme dictionary. The
method which is proposed here can be applied to any morphologically rich language. In
this novel approach, morphemes and dictionaries are not required. This algorithm only
needs the suffix table and the code for paradigm classification. If the lemma, POS
category and Morpho-Lexical Inflection (MLI) are given, the proposed algorithm will
generate the intended word form.
ஓ + V + FT_3SM = ஓ வான்
கா + N + ACC = காட்ைட
In the above example “V” represents verb and “FT_3SM” represents future
tense with third person singular masculine.”N” symbolizes noun and ACC means
accusative case. FT_3SM and ACC are called as Morpho-lexical Information (MLI).
Three different modules are used to build the noun and verb generator system.
The first module takes the lemma and POS category as input and gives the lemma’s
paradigm number and word’s stem as output. The second module takes morpho-lexical
226
information as an input and gives its index number as an output. In the third module a
suffix-table is used to generate the word form with the information from the above two
modules.
This subsection illustrates the new algorithm which is developed for morphological
generator system. This algorithm is implemented using Java program. Algorithm is
shown in Figure 8.2.
lemma,wc,morph =SPLIT(Input)
roman_lemma=ROMAN(lemma)
parnum=PARNUM(roman_lemma,wc)
col-index=parnum
row-index=INDEX(morph,wc)
suff=SUFFIX-TABLE[row-index][col-index]
stem=STEM(roman_lemma,wc,parnum)
word=JOIN(stem,suff)
output=UNICODE(word)
In this algorithm, lemma represents the root word of the word form, wc denotes
the word class and morph stands for morpho-lexical information. The given input is
divided into lemma, word class and Morpho-lexical information; this is done by using
SPLIT function. The lemma or root word in Unicode form is romanized using the
function ROMAN. roman_lemma represents the romanized lemma. parnum represents
paradigm number of lemma. PARNUM function identifies the paradigm number for
the given lemma using the end suffixes. Romanized lemma and the paradigm number
227
are given as input to the STEM function along with the word class. This function is
used to find the stem of the root word. Given morpho-lexical information is matched
with the morpho-lexical information list, and the corresponding index number is
retrieved. This index number is referred as row-index. Paradigm number of the input
lemma is named as col-index. Using the row and column index the suffix part is
retrieved from the suffix-table. The stem and the retrieved suffix are attached to
generate the word form. This word form is then converted to Tamil Unicode form.
Figure 8.3 shows the architectural view for Tamil morphological generator
system. The morphological generator system need to process three major component;
first one is the lemma part, then the word class and finally the morpho-lexical
information. By the way the generator is implemented makes it distinct from other
generator system. The input which is in Tamil Unicode form is first romanized and then
the paradigm number is identified by using the end characters. Romanized form is used
for the purpose of easy computation and efficient processing. The morpho-lexical
information of the required word form is given as input. From the morpho-lexicon
information list the index number of the corresponding input is identified. This is
228
referred as row-index. Based on the word class specified the system uses the
corresponding suffix table. In two-dimensional suffix table, rows are morpho-lexical
information index and columns are paradigm numbers. For each paradigm a complete
set of morphological inflections corresponding to the morpho-lexical information list is
created. Finally using the column index and row index morphological suffix is retrieved
from the suffix table. This suffix form is affixed with the stem to generate the word
form.
Tamil verb morphological generator is accomplished for generating more than ten
thousand forms of single Tamil verb. Some of the word-forms are shown in Table 8.3,
remaining word forms are listed in Appendix-B. Noun morphological generator system
handles nearly three hundred word forms including postpositions. These noun word-
forms are also given in the Appendix-B.
ப . நரம் .
ப த்தான். நரம்ைப.
ப த்தாள். நரம்பிைன.
ப த்தார். நரம்பினத்ைத.
ப த்தார்கள். நரம்ேபா .
ப த்த . நரம்பிேனா .
ப த்தன. நரம்பினத்ேதா .
ப த்தாய். நரம்பினால்.
ப த்தீர். நரம்பால்.
229
ப த்ேதன். நரம்பிற்கு.
ப த்ேதாம். நரம்பின்.
ப க்கிறான். நரம்ப .
ப க்கிறாள். நரம்பின .
ப க்கிறார். நரம்பின்கண்.
ப க்கின்றன. நரம்பாலான.
ப க்கின்றீர்கள். நரம்பில்.
ப க்கின்றீர்கள். நரம்பினில்.
Tamil morphological generator for noun and verb requires following resources for
generating a word form.
230
PT+3SM PRT+1P
PT+3SF FT+3SM
PT+3SE FT+3SF
PT+3SE+PL FT+3SE
PT+3SN FT+3SE+PL
PT+NOM_athu FT+3SN
PT+RP+3SN FT+RP+3SN
PT+3PN FT+NOM_athu
PT+RP+3PN FT+3PN
PT+NOM_ana FT+RP+3PN
PT+2S FT+NOM_ana
PT+2EH FT+2S
PT+2EH+PL FT+2EH
PT+1S FT+2EH+PL
PT+1P FT+1S
PRT+3SM FT+1P
PRT+3SF FT_3SN
PRT+3SE RP_UM
PRT+3SE+PL PT+RP
PRT+3SN PRT+RP
PRT+RP+3SN NM+RP
PRT+NOM_athu PT+RP+3SM
PRT+3PN PT+RP+3SF
PRT+RP+3PN PT+RP+3SE
PRT+NOM_ana PRT+RP+3SM
PRT+2S PRT+RP+3SF
PRT+2EH PRT+RP+3SE
PRT+2EH+PL PRT+3SN
PRT+1S PRT+3SN
231
This section explains how the paradigm is classified based on the end characters.
Generally paradigm is identified using look up table. Initially, the root word is
romanized using Tamil Unicode to roman mapping file. This romanized form is
compared with end suffixes in paradigm classification file. If an end suffix is matched
with the end characters of the root word then the paradigm number is identified. End
suffixes are created based on the paradigms and sorted according to their character
length. Figure 8.4 shows the algorithm for paradigm classification.
Look up table is used for only the paradigm 25 (ப , padi) and 26 (நட, wada),
as the end suffixes cannot be generalized for these two paradigms. Figure 8.4 shows the
pseudo code for paradigm classification. End suffixes with corresponding paradigm
number are given in the Table 8.7. For example, the verb பயில் (payil), is matched
with the end suffix ‘ல்’ (il), therefore, the first paradigm is 11 and there is no possibility
In some cases, the word may have two paradigms. For example the words like
(ப , padi) (தீர், thIr) have two paradigms. Because of word’s sense padi has two
paradigms and intransitive form makes the word thIr to be fall under two paradigms.
Example of various word forms is given bellow.
ப க்க, ப ய,
ப த் , ப ந் .
தீர,தீர்க்க
தீர்ந் ,தீர்த்
232
End if
End for
Table 8.5 shows the look up table for Tamil verb paradigm and end suffixes. For
instance, the first paradigm is ெசய் (sey), some other words fall under the paradigm are
ெபய் (pey), ெமய் (mey), உய் (uy). So the generalized end suffixes are எய் (ey),
ஏய்(Ey), உய்(uy).
233
ai 8 N UN 20 N
akal 11 N uN 21 N
kal 18 N AN 22 N
al 11 18 in 23 N
AL 12 N wil 24 N
aL 12 N il 11 N
uL 12 N en 27 N
kEL 19 N In 28 N
sol 16 N Aku 31 N
ol 13 N puku 32 N
el 13 N Nuku 15 N
oL 14 N uku 32 N
Ul 18 N iku 32 N
Ol 18 N u 15 N
vil 18 N r 6 N
El 19 N
UL 19 N
IL 19 12
The suffix table is the most essential resource for this algorithm. It is a simple two-
dimensional (2D) table where row corresponds to the morpho-lexical information and
column corresponds to the paradigm number. Noun and verb has its own suffix table.
The noun suffix table contains 325 rows (word-forms) and 25 columns (paradigms).
Verb suffix table has two suffix tables, first table is made for without auxiliaries and
the next is designed for with auxiliary. First table contains 164 rows and the next has
67 rows. Both tables contain 32 columns (paradigms). Table 8.6 shows the number of
paradigms and inflections of verb and noun which are handled. Total represents the
total number of inflections which are handled by generator system.
Word forms
Paradigms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320
234
heaad word, alll the inflectted word foorms are creeated. For eexample thee verb
“pA
Adu”, the in
nflected worrd forms aree (pAdinAn)), (pAdinAL),
wo
ord forms. Using
U these word-forms a 2D tablee is created with colum
mns and rows,
eacch column corresponds to
t paradigm
m and rows reepresent morpho-lexicall informationn.
The word-form
ms of every paradigm iss romanized and put in a table. Thee stem of thhe
eacch head worrd is identiffied and rem
moved from its word-foorm (i.e com
mmon term is
i
rem
moved in alll word form
ms). Now, thhe remaining portion iss only availaable in tablee.
maining porrtion is calleed as suffix and the tablle is called bby suffix tab
Rem ble. Table 8..7
illu
ustrates the sample
s suffix-table for T
Tamil verbs.. In this table row (MLI-1, MLI-2…
…)
speecifies the morpho-lexic
m cal inflectioon and colum
mn (P-1, P--2…) indicaates paradigm
m
num
mber.
8.33.3.4 Stemm
ming Rules
235
fall in second paradigm. If paradigm number has an end character “*” means then, no
character should be removed from the word.
Paradigm End 16 L
Number Characters 17 *
1 * 18 L
2 Zu 19 L
3 sAku 20 *
4 Du 21 *
5 Ru 22 kAN
6 wOku 23 *
7 Zu 24 L
8 * 25 *
9 wOku 26 *
10 A 27 *
11 L 28 *
12 L 29 U
13 L 30 Ru
14 L 31 U
15 U 32 U
Tamil pronouns are very essential and its structure is used in every day conversation. It
include personal pronouns (refer to the persons speaking, the persons spoken to, or the
persons or things spoken about), indefinite pronouns, relative pronouns (connect parts
of sentences) and reciprocal or reflexive pronouns (in which the object of a verb is
being acted on by verb's subject). Personal pronouns, indefinite pronouns, relative
pronouns, reciprocal or reflexive pronouns play an important role in Tamil language.
Therefore they need very special attention while generating as well as analyzing. Tamil
pronouns come under the closed word class category. Morphological generator for
236
Tamil pronouns is developed separately using pattern matching technique. The primary
advantage of using pattern matching method is that it performs well for closed class
words. The reverse process of pronoun morphological analyzer is used in generating
pronoun surface word. So, the same data which are used in pronoun morphological
analyzer is used in generation too. Pronoun word form structure is shown in Figure 8.5.
Example for the pronoun structure is described bellow.
PP
Case Clitic
Clitic
Pronoun
Root
Clitic
அவ க்க கில்
அவன் அவ க்கா
அவனா
237
Root word
Romanization
SUFFIX DATABASE
1 2
3 4
The input for this morphological generator is pronoun root word and the
morphological information. Pronoun root word is first romanized and then verify with
the suffix database which is used in morphological analysis. Four types of suffix lookup
tables are used in this system. Each table is created based on the levels in pronoun
structure.
8.5 SUMMARY
Post-processing generates a Tamil sentence from the factored output of SMT system.
Each unit in the factored output contains root word, word class and morphological
information. Morphological generator is used as a post processing component for word
238
239
CHAPTER 9
EXPERIMENTS AND RESULTS
9.1 GENERAL
This section explains the experiment and results of English to Tamil Statistical Machine
Translation system. Experiments include installation of SMT toolkit, training and
testing regulations. This machine translation system is an integartion of various
modules. So, the accuracy is depends on each module in the system. Roughly, this
translation system is divided into three modules. They are preprocessing, translation
and post-processing. So, the results and the errors depends on the preprocessing stages
of English sentence, factored SMT, and postprocessing. In preprocessing, English
language sentence are transformed using reordering and compoundeing phases.
Preprocessing use rules for reordering and compounding and English parser.
The accuracy of preprocessing depends on parser’s output and the rules developed. The
next stage is factored SMT system. The output of factored SMT system is depends on
size and quality of corpora. Parameter tuning and decoding steps are also plays a major
role in producing output for SMT system. Finally the output is given to the post-
processing stage. Tamil morphological generator is utilized in post processing. Therfore
the accuracies of morphological generator decides the precision of postprocesing stage.
Sample outputs of three modules are shown in appendix. Finally the output of this
machine translation system is evaluated by using BLEU and NIST metrics. Different
models are also developed to compare the results of developed translation engine.
This section describes the experimental setup and data used in the English to Tamil
statistical machine translation system. The training data consists of approximately 16K
English to Tamil parallel sentences. Health domain English-Tamil parallel corpora of
EILMT (English to Indian Language Machine Translation, Project funded by DIT)
project is used in experiments. The training set is built with 13,500 parallel sentences
and a test set is constructed with 1532 sentences. 1000 parallel sentences are used for
tuning the system. For language model, sizes of 90k Tamil sentences are used.
240
Average word length of sentences in baseline and factored parallel corpora used in
these experiments are shown in the Table 9.1 and 9.2.
Nine different types of model are trained, tuned and tested with the help of parallel
corpora. The general categories of the models are Baseline and Factored systems. The
detailed models are,
1. Baseline (BL)
2. Baseline with Automatic Reordering (BL+AR)
3. Baseline with Rule based Reordering (BL+RR)
4. Factored system + Morph-Generator (Fact)
5. Factored system + Auto Reordering +Morph-Generator (Fact+AR)
6. Factored system +Rule based Reordering + Morph-Generator (Fact+RR)
7. Factored system + Compounding + Morph-Generator (Fact+Comp)
8. Factored system + Auto Reordering +Compounding +Morph-Generator
(Fact+AR+Comp)
9. Factored system +Rule based Reordering +Compounding+ Morph-
Generator (Fact+RR+Comp)
241
For a baseline system, a standard phrase based system is built using the surface forms
of the words without any additional linguistic knowledge and with a 4-gram LM in the
decoder. Cleaned raw parallel corpus is used for training the system. Lexicalized
reordering model (msd-bidirectional-fe) is used in the baseline with automatic
reordering model. Another baseline system is built with the use of rule based
reordering. In all the developed factored models, Tamil morphological generator is
used in post processing stage. Instead of using the surface form of the word, a root,
part-of-speech and morphological information are included into the word as an
additional factors. A factored parallel corpus is used for training the system. English
factorization is done by using Stanford Parser tool and for Tamil, POS Tagger and
Morphological analyzers are used to factor the sentence. In this factored model system,
a token/word is represented with four factors as Surface|Root|Wordclass|Morphology.
Where, Morphology factor contains morphological information and function words on
English side, and morphological tags on Tamil side. In factored model with rule based
reordering and compounding (Fact+RR+Comp), English words are factored and
reordered. In addition to this Compounding is also performed in English side.
All the developed models are evaluated with the same test-set which contains 1532
English sentences. The well known Machine Translation metrics BLEU [144] and
NIST [145] are used to evaluate the developed models. In addition to that the existing
“Google Translate” online English-Tamil machine translation system is also evaluated
to compare with the developed models. The results are in terms of BLEU-1, BLEU-4
and NIST score and it is shown in Table 9.3. In figure 9.1 and 9.2, X axis represents the
various machine translation models and Y axis denotes the BLEU-1 and BLEU-4
scores. Figure 9.3 shows the NIST scores of developed models. From the graphs in the
figures, it is clearly shown that the proposed system (Fact+RR+Compounding)
improves the BLEU and NIST score compare to other developed models and “Google
Translate” system. The Google translation system’s output is shown in the Figure 9.4.
In this output, both sentences are failed to produce a noun verb agreement. Case
markers are also not identified in second sentence. The grammatically correct output is
not available in alternate translations also. Detailed output comparison is shown in
Appendix-B. The developed F-SMT based system in this thesis handles the noun-verb
agreement also. This is an important and challenging job for translating into
morphologically rich languages like Tamil.
242
Table 9.3 BL
LEU and NIST Scores
BL+RR
R 0.2594 0.0258 2.4148
Fact+M
Mgen 0.6406 0.3722 3.9831
Fact+A
AR+Mgen 0.6405 0.3725 3.9876
Fact+R
RR+Mgen 0.6285 0.3653 3.3887
F
FACTORED
D
Fact+C
Comp 0.6239 0.3573 3.9626
Fact+A
AR+Comp+
+Mgen 0.6237 0.3577 3.9673
Fact+R
RR+Comp+
+Mgen 0.6753 0.3894 4.2667
B
BLEU‐1
0.8
0.7
0.6
BLEU‐1 Score
BLEU 1 Score
0.5
0.4
0.3
0.2
0.1
Variouss MT models
Figu
ure 9.1 BLE
EU-1 Scores for Various Models
243
BLEU‐4
0.45
0.4
0.35
0.3
BLEU‐4 Score
0.25
0.2
0.15
0.1
0.05
Variou
us MT models
Figu
ure 9.2 BLE
EU-4 Scores for Various Models
4.5
NIIST Score
es 4.2
2667
3.5
2.9526
3
NIST Scores
2.5
1.5
0.5
Various MT modells
Figu
ure 9.3 NIST
T Scores forr Various M
Models
244
Figure 9.4 Google Translation System1
9.3 SUMMARY
This chapter described the results which are used to test the effectiveness of English
to Tamil Machine translation systems. These results of the evaluation performed clearly
confirm that the new techniques proposed in this thesis are definitely significant. The
pre-processing and post-processing allows the developed system to achieve a relative
improvement in BLEU and NIST score. Furthermore, the Tamil linguistic tools which
are the modules of translation system are also implemented in this research. The
preprocessing techniques developed in this work helps to increase the translation
quality. BLEU and NIST evaluation scores clearly shows that the factored model with
an integration of linguistic knowledge gives better result for English to Tamil Statistical
Machine translation system.
1
Google Translate Output is Tested on 31‐07‐12
245
CHAPTER 10
CONCLUSION AND FUTURE WORK
In this chapter, the main contributions and most significant achievements of this
research work is summarized. The conclusion, which follows after the summary,
attempts to highlight the research contributions in the field of Tamil language
processing. At the same time, the limitations and future scope of the developed systems
are also mentioned, so that researchers who are interested in extending any of this work
can easily explore the possibilities.
10.1 SUMMARY
Due to the multilingual nature of the present information society, Human Language
Processing and Machine Translation have become essential for languages. Several
linguistic features (both morphological and syntactical) makes, translation a truly
challenging issue. The importance of machine translation will grow as the need for
translating resources of knowledge from one language to other increases.
This thesis presents the novel methods for incorporating linguistic knowledge in
SMT to achieve an enhancement in English to Tamil machine translation system. Most
of the technique presented in this thesis can be applied directly to other language pairs
especially for translating from morphologically simple language to morphologically
rich language.
The precision of the translation system depends on the performance of each and
every modules and linguistic tools used in the system. The experimental results clearly
246
demonstrate that the new techniques proposed in this thesis are definitely significant.
Four different machine translation models are experimented and the BLEU and NIST
scores are compared. The developed model (Factored SMT with pre and post-
processing) has reported a 4.2667 NIST score for English to Tamil translations. Adding
pre and post processing in factored SMT provided about 0.38 BLEU-1 improvement
over a word based baseline system and 0.03 score improvement of a factored baseline
system. Finally this score is compared with “Google Translate” online machine
translation system. The developed model has reported 1.3 score improvement in NIST
score and about 0.36 improvement in BLEU-1 score. Improvement in BLEU and NIST
evaluation scores shows that this proposed approach is appropriate for English to Tamil
Machine Translation system.
In this section, the research work done is summarized with reference to the objectives
of the proposed work. This thesis mainly highlights the five different developments in
the field of language processing which are listed below.
• Tamil POS Tagger is developed using SVM based machine learning tool. The
major challenge for developing the statistical POS tagger for Indian languages
is that unavailability of annotated (tagged) corpus. The developed Tamil POS
tagger has 5 lakh POS annotated words. This tagged corpus is also built as a
part of this research. This tagged corpus is a major resource for Tamil language
processing.
segmented and tagged words are also the most significant resource for
analyzing the Tamil word forms. The same methodology is successfully
implemented for other Dravidian languages like Malayalam, Telugu, and
Kannada.
The major outcomes of the research is mapped to the publications which have
resulted from this work is as shown in Table 10.1
248
10.3 CONCLUSION
The major achievement of this research has been the development of Factored
Statistical Machine Translation System for English to Tamil language by integrating
linguistic tools. Linguistic tools like POS Tagger and Morphological analyzer are also
developed as part of this research work. Developing these linguistic tools are
challenging and demanding tasks especially for highly agglutinative language like
Tamil.
The performance of the statistical and machine learning methods mainly depends
on the size and correctness of the corpus. If the corpus consists of all types of surface
word forms, word categories and sentence structures, then it is possible for a learning
algorithm to extract all required features. Preprocessing systems are automated for
creating factored parallel and monolingual corpora. Factored corpora are an essential
resource for developing a Factored Machine Translation system from English to Tamil.
249
The Tamil linguistic tools can be used in future to implement Machine translation
system between Tamil to any other language, especially for Dravidian languages like
Telugu, Malayalam and Kannada. The applications of these developed linguistic tools
and annotated corpus can also be used in other language processing tasks such as
Information Retrieval and Extraction, Speech processing, Question Answering and
Word Sense Disambiguation etc.
This thesis addresses the technique to improve the quality of Machine Translation
by separating root and morphological information of surface word. The main limitation
of the approach presented here is that it is not directly applicable in the reverse
direction (Tamil to English). All this developed computational linguistic tools and MT
systems are domain specific and scalable, so that researchers who are interested in
extending any of this work can easily explore the possibilities. There are a number of
possible directions for future work, based on the findings in this thesis. Some of the
directions are given bellow.
• Increasing the size of parallel corpora always help to improve the accuracy of
the system. Adding different sentence structures and handling the idioms and
phrases externally would help to improve the system.
• The tools and methodologies which are developed are used to perform on
translation between English to Tamil. It would be interesting to apply the
similar methods for translating English to other morphologically rich languages.
• The tools and methodologies which are developed are can be used to develop a
translation system that translate other languages into Tamil.
250
• The reordering rules and compounding rules suggested in this thesis are
relatively simple, and do not perform very well on large sentences. It would be
possible to replace the handcrafted rules with automatically learned rules.
• Applying the advanced SMT models like syntax based model, hierarchal model
and hybrid approaches would improve the system.
251
APPENDIX-A
252
ேஞா njO தா thA pU
ெஞௗ njau தி thi ெப pe
ட da தீ thI ேப pE
ட் d thu ைப pai
டா dA thU ெபா po
di ெத the ேபா pO
டீ dI ேத thE ெபௗ pau
du ைத thai ம ma
dU ெதா tho ம் m
ெட de ேதா thO மா mA
ேட dE ெதௗ thau மி mi
ைட dai ந wa மீ mI
ெடா do ந் w mu
ேடா dO நா wA mU
ெடௗ dau நி wi ெம me
ண Na நீ wI ேம mE
ண் N wu ைம mai
ணா NA wU ெமா mo
ணி Ni ெந we ேமா mO
ணீ NI ேந wE ெமௗ mau
Nu ைந wai ய ya
NU ெநா wo ய் y
ெண Ne ேநா wO யா yA
ேண NE ெநௗ wau யி yi
ைண Nai ப pa யீ yI
ெணா No ப் p yu
ேணா NO பா pA yU
ெணௗ Nau பி pi ெய ye
த tha பீ pI ேய yE
த் th pu ைய yai
253
ெயா yo வ் v Lu
ேயா yO வா vA LU
ெயௗ yau வி vi ெள Le
ர ra vI ேள LE
ர் r vu ைள Lai
ரா rA vU ெளா Lo
ாி ri ெவ ve ேளா LO
ாீ rI ேவ vE ெளௗ Lau
ru ைவ vai ற Ra
rU ெவா vo ற் R
ெர re ேவா vO றா RA
ேர rE ெவௗ vau றி Ri
ைர rai ழ za றீ RI
ெரா ro ழ் z Ru
ேரா rO ழா zA RU
ெரௗ rau ழி zi ெற Re
ல la ழீ zI ேற RE
ல் l zu ைற Rai
லா lA zU ெறா Ro
li ெழ ze ேறா RO
லீ lI ேழ zE ெறௗ Rau
lu ைழ zai ன na
lU ெழா zo ன் n
ெல le ேழா zO னா nA
ேல lE ெழௗ zau னி ni
ைல lai ள La னீ nI
ெலா lo ள் L nu
ேலா lO ளா LA nU
ெலௗ lau ளி Li ென ne
வ va ளீ LI ேன nE
254
ைன nai ெஸௗ Sau
ெனா no ஷ sha
ேனா nO ஷ் sh
ெனௗ nau ஷா shA
ஜ ja ஷி shi
ஜ் j ஷீ shI
ஜா jA ஷு shu
ஜி ji ஷூ shU
ஜீ jI ெஷ she
ஜு ju ேஷ shE
ஜூ jU ைஷ shai
ெஜ je ெஷா sho
ேஜ jE ேஷா shO
ைஜ jai ெஷௗ shau
ெஜா jo ஹ ha
ேஜா jO ஹ் h
ெஜௗ jau ஹா hA
ஸ Sa ஹி hi
ஸ் S ஹீ hI
sri ஹு hu
ஸா SA ஹூ hU
Si ெஹ he
ஸீ SI ேஹ hE
ஸு Su ைஹ hai
ஸூ SU ெஹா ho
ெஸ Se ேஹா hO
ேஸ SE ெஹௗ hau
ைஸ Sai
ெஸா So
ேஸா SO
255
A.2 DETAILS OF AMIRTA POS TAGS
The major POS tags are noun, verb, adjective and adverb. The noun tag is further
classified into 10 tag categories; verb tag is classified into 7 tag categories; the rest are adjective,
adverb and others.
Noun Tags
Nouns are the words which denote a person, place, thing, time, etc. In Tamil language,
nouns are inflected for the number and case in the morphological level. However on
phonological level, four types of suffixes can occur with noun stem.
256
‘by- bird’
Example for Compound Nouns (NNC)
UrAdsi <NNC> thalaivar <NNC>
‘Township leader’
vanap <NNC> pakuthi <NNC>
‘forest area’
Proper Nouns are the words which denote a particular person, place, or thing. Indian
languages, unlike English, do not have any specific marker for proper nouns in its orthographic
convention. English proper nouns begin with a capital letter which distinguishes them from
common nouns. Most of the words which occur as proper nouns in Indian languages can also
occur as common nouns. For example in English, John, Harry, Mary occur only as proper nouns
whereas in Tamil, thAmarai, maNi, pissai, arasi etc are used as proper nouns as well as common
nouns. Given below, is a list of Tamil words with their grammatical category and English
glosses. These words can be occurred in the text as common and proper nouns.
thAmarai noun lotus
maNi noun bell
pissai noun beg
arasi noun queen
Two tags have been used for proper noun,
• Proper noun <NNP>
• Compound proper noun <NNPC>
257
Cardinal Tag
Any word denoting a cardinal number is tagged as <CRD>.
Ordinal Tag
Ordinals are an extension of the natural numbers different from integers and from
cardinals. In Tamil, Ordinals are formed by adding the suffixes Am and Avathu. Expressions
denoting ordinals are marked as <ORD>.
Adjective Tag
Adjectives are the noun modifiers. In modern Tamil, simple and derived adjectives are
present. The derived adjectives in Tamil are formed by adding the suffixes (Ana) to the noun
root. <ADJ> tag is used for representing adjectives.
258
Example for Adjectives <ADJ>
iwtha walla <ADJ> paiyan
“this nice boy”
oru azakAna <ADJ> peN
“A beautiful girl”
Adverb Tag
Adverbs are words which tell more about the verbs. In modern Tamil, simple and derived
adverbs are present. The derived adverbs in Tamil are formed by adding the suffixes (Aka and
Ay) to the verb root. The temporal and spatial entities are also tagged as adverbs. <ADV> tag is
used for representing adverbs.
Verbs are defined as the action word, which can take tense suffixes, person, number,
gender suffixes and few other verbal suffixes. Tamil verb forms can be distinguished into finite
and non finite verbs.
Tamil distinguishes between four types of non-finite verb forms.They are verbal
participle <VNAV>, adjectival participle <VNAJ>, infinitive <VINT> and conditional <CVB>.
259
Example for verbal participle <VNAV>
wowthu<VNAV> poo
“become vexed”
Other Tags
Postposition tag
Tamil has a few free forms which are referred as postpositions. They are added
after the case marker.
Example
kovil kiri vIddukku pakkaththil <PPO> uLLathu
In the above example, the post position pakkaththil 'near' occurs after the dative noun
phrase vIddukku. Here the form pakkaththil is not considered as a suffix, it is a free form and
because of its place of occurrence it is termed as postposition. Postpositions are historically
derived from verbs. Schiffman (1999) describes various postpositions. Postpositions are
conditioned by the nouns inflected for the case they follow. In Tamil, some postpositions are
simple and some are compound. <PPO> tag is used for representing postposition words.
Conjunction tag
Co-ordination or conjunction in Tamil is mainly realized by certain noun forms, verb
forms and clitics. <CNJ> tag is used for representing conjunctions.
Determiner tag
A determiner is a noun-modifier that expresses the reference of a noun or noun-phrase in
a context, rather than attributes expressed by adjectives. This function is usually performed by
261
articles, demonstratives or possessive determiners. <DET> tag is used for annotating
determiners.
Complimentizer
A complementizer is a syntactic category roughly equivalent to the term subordinating
conjunction in traditional grammar. For example, the word “that” is a complimentizer in the
following English sentence. “Mary believes that it is raining” . <COM> is used for tagging
complimentizer.
Emphasis tag
Force or intensity of expression that gives impressiveness or importance to the
word category is termed emphasis. <EMP > tag is used for annotating emphasis.
Symbol tags
Only two symbols dot <DOT> tag and comma <COMM> tag are considered in the
corpus. <DOT> tag is used to show the sentence separation. <COMM> tag is used in-between
the multiple nouns and proper noun.
263
APPENDIX-B
264
B.2 DEPENDENCY TAGS
Depend Depende
ency Meaning ncy Meaning
Label Label
dep Dependent acomp adjectival complement
aux Auxiliary agent Agent
auxpass passive auxiliary ref Referent
cop Copula expl expletive (expletive there)
conj Conjunct mod Modifier
cc Coordination advcl adverbial clause modifier
arg Argument purpcl purpose clause modifier
subj Subject tmod temporal modifier
nsubj nominal subject rcmod relative clause modifier
nsubjpass passive nominal subject amod adjectival modifier
csubj clausal subject infmod infinitival modifier
comp Complement partmod participial modifier
obj Object num numeric modifier
dobj direct object number element of compound number
iobj indirect object appos appositional modifier
pobj object of preposition nn noun compound modifier
attr Attributive abbrev abbreviation modifier
ccomp clausal complement+internal Sub advmod adverbial modifier
xcomp clausal complement+external Sub neg negation modifier
compl Complementizer poss possession modifier
mark marker ( introducing an advcl) possessive possessive modifier (’s)
rel relative (introducing a rcmod) prt phrasal verb particle
acomp adjectival complement det Determiner
agent Agent prep prepositional modifier
xsubj controlling subject sdep semantic dependent
265
B.3 TAMIL VERB MLI FILE
Null PT+RP
PT+3SM PRT+RP
PT+3SF NM+RP
PT+3SE PT+RP+3SM
PT+3SE+PL PT+RP+3SF
PT+3SN PT+RP+3SE
PT+NOM_athu PRT+RP+3SM
PT+RP+3SN PRT+RP+3SF
PT+3PN PRT+RP+3SE
PT+RP+3PN PRT+3SN
PT+NOM_ana PRT+3SN
PT+2S PRT+NOM_athu
PT+2EH FT+RP+3SM
PT+2EH+PL FT+RP+3SF
PT+1S FT+RP+3SE
PT+1P NM+RP+3SM
PRT+3SM NM+RP+3SF
PRT+3SF NM+RP+3SE
PRT+3SE NM+RP+NOM_athu
PRT+3SE+PL PT+RP+3PN_vai
PRT+3SN NM+NOM_ana
PRT+RP+3SN VP
PRT+NOM_athu INF
PRT+3PN NM+3SN
PRT+RP+3PN NM+VP
PRT+NOM_ana NM+VP_aamal
PRT+2S NOM_thal
PRT+2EH NOM_al
PRT+2EH+PL NOM_kai
PRT+1S NM+NOM_mai
PRT+1P NOM_kkal+MOOD_aam
FT+3SM NOM_kkal+MOOD_aakum
FT+3SF NOM_kkal+NM+3SN
FT+3SE VP+PAAR_AUX+PT+3SM
FT+3SE+PL VP+PAAR_AUX+PT+3SF
FT+3SN VP+PAAR_AUX+PT+3SE
FT+RP+3SN VP+PAAR_AUX+PT+2S
FT+NOM_athu VP+PAAR_AUX+PT+1P
FT+3PN VP+PAAR_AUX+PT+1S
FT+RP+3PN VP+PAAR_AUX+PRT+3SE
FT+NOM_ana VP+PAAR_AUX+PRT+3SE+PL
FT+2S VP+PAAR_AUX+PRT+3SF
FT+2EH VP+PAAR_AUX+PRT+3SM
FT+2EH+PL VP+PAAR_AUX+PRT+1P
FT+1S VP+PAAR_AUX+PRT+2S
FT+1P VP+PAAR_AUX+PRT+1S
FT_3SN VP+PAAR_AUX+FT+3SE
RP_UM VP+PAAR_AUX+FT+3SF
266
VP+PAAR_AUX+FT+3SM VP+KODU_AUX+FT+2EH
VP+PAAR_AUX+FT+2S VP+KODU_AUX+FT+2EH+PL
VP+PAAR_AUX+FT+1S VP+KODU_AUX+FT+1S
VP+PAAR_AUX+FT+1P VP+KODU_AUX+FT+1P
VP+IRU_AUX+PT+3SE VP+POO_AUX+PT+3SM
VP+IRU_AUX+PT+3SF VP+POO_AUX+PT+3SF
VP+IRU_AUX+PT+3SM VP+POO_AUX+PT+3SE
VP+IRU_AUX+PT+2S VP+POO_AUX+PT+3SE+PL
VP+IRU_AUX+PT+1S VP+POO_AUX+PT+3SN
VP+IRU_AUX+PT+1P VP+POO_AUX+PT+3PN
VP+IRU_AUX+PRT+3SE VP+POO_AUX+PT+2S
VP+IRU_AUX+PRT+3SF VP+POO_AUX+PT+2EH
VP+IRU_AUX+PRT+3SM VP+POO_AUX+PT+2EH+PL
VP+IRU_AUX+PRT+1S VP+POO_AUX+PT+1S
VP+IRU_AUX+PRT+1P VP+POO_AUX+PT+1P
VP+IRU_AUX+FT+3SE VP+POO_AUX+PRT+3SM
VP+IRU_AUX+FT+3SF VP+POO_AUX+PRT+3SF
VP+IRU_AUX+FT+3SM VP+POO_AUX+PRT+3SE
VP+IRU_AUX+FT+2S VP+POO_AUX+PRT+3SE+PL
VP+IRU_AUX+FT+1S VP+POO_AUX+PRT+3SN
VP+IRU_AUX+FT+1P VP+POO_AUX+PRT+3PN
VP+KODU_AUX+PT+3SM VP+POO_AUX+PRT+2S
VP+KODU_AUX+PT+3SF VP+POO_AUX+PRT+2EH
VP+KODU_AUX+PT+3SE VP+POO_AUX+PRT+2EH+PL
VP+KODU_AUX+PT+3SE VP+POO_AUX+PRT+1S
VP+KODU_AUX+PT+3SN VP+POO_AUX+PRT+1P
VP+KODU_AUX+PT+3PN VP+POO_AUX+FT+3SM
VP+KODU_AUX+PT+2S VP+POO_AUX+FT+3SF
VP+KODU_AUX+PT+2SE VP+POO_AUX+FT+3SE
VP+KODU_AUX+PT+2SE+PL VP+POO_AUX+FT+3SE+PL
VP+KODU_AUX+PT+1S VP+POO_AUX+FT+3SN
VP+KODU_AUX+PT+1P VP+POO_AUX+FT+3PN
VP+KODU_AUX+PRT+3SM VP+POO_AUX+FT+2S
VP+KODU_AUX+PRT+3SF VP+POO_AUX+FT+2EH
VP+KODU_AUX+PRT+3SE VP+POO_AUX+FT+2EH+PL
VP+KODU_AUX+PRT+3SE+PL VP+POO_AUX+FT+1S
VP+KODU_AUX+PRT+3SN VP+POO_AUX+FT+1P
VP+KODU_AUX+PRT+3PN VP+VIDU_AUX+PT+3SE
VP+KODU_AUX+PRT+2S VP+VIDU_AUX+PT+3SF
VP+KODU_AUX+PRT+2EH VP+VIDU_AUX+PT+3SM
VP+KODU_AUX+PRT+2EH+PL VP+VIDU_AUX+PT+2S
VP+KODU_AUX+PRT+1S VP+VIDU_AUX+PT+1P
VP+KODU_AUX+PRT+1P VP+VIDU_AUX+PT+1S
VP+KODU_AUX+FT+3SM VP+VIDU_AUX+PRT+3SE
VP+KODU_AUX+FT+3SF VP+VIDU_AUX+PRT+3SM
VP+KODU_AUX+FT+3SE VP+VIDU_AUX+PRT+3SF
VP+KODU_AUX+FT+3SE+PL VP+VIDU_AUX+PRT+2S
VP+KODU_AUX+FT+3SN VP+VIDU_AUX+PRT+1P
VP+KODU_AUX+FT+3PN VP+VIDU_AUX+PRT+1S
VP+KODU_AUX+FT+2S VP+VIDU_AUX+FT+3SE
267
VP+VIDU_AUX+FT+3SF VP+THOLAI_AUX+PRT+3SM
VP+VIDU_AUX+FT+3SM VP+THOLAI_AUX+PRT+2S
VP+VIDU_AUX+FT+2S VP+THOLAI_AUX+PRT+1P
VP+VIDU_AUX+FT+1S VP+THOLAI_AUX+PRT+1S
VP+VIDU_AUX+FT+1P VP+THOLAI_AUX+FT+3SE
VP+KI_AUX+PT+3SE VP+THOLAI_AUX+FT+3SM
VP+KI_AUX+PT+3SM VP+THOLAI_AUX+FT+3SF
VP+KI_AUX+PT+3SF VP+THOLAI_AUX+FT+2S
VP+KI_AUX+PT+2S VP+THOLAI_AUX+FT+1S
VP+KI_AUX+PT+1P VP+THOLAI_AUX+FT+1P
VP+KI_AUX+PT+1S VP+THALLU_AUX+PT+3SE
VP+KI_AUX+PRT+3SE VP+THALLU_AUX+PT+3SF
VP+KI_AUX+PRT+3SF VP+THALLU_AUX+PT+3SM
VP+KI_AUX+PRT+3SM VP+THALLU_AUX+PT+2S
VP+KI_AUX+PRT+2S VP+THALLU_AUX+PT+1P
VP+KI_AUX+PRT+1P VP+THALLU_AUX+PT+1S
VP+KI_AUX+PRT+1S VP+THALLU_AUX+FT+3SE
VP+KI_AUX+FT+3SE VP+THALLU_AUX+FT+3SM
VP+KI_AUX+FT+3SM VP+THALLU_AUX+FT+3SF
VP+KI_AUX+FT+3SF VP+THALLU_AUX+FT+2S
VP+KI_AUX+FT+2S VP+THALLU_AUX+FT+1S
VP+KI_AUX+FT+1P VP+THALLU_AUX+FT+1P
VP+KI_AUX+FT+1S VP+KIZI_AUX+PT+3SE
VP+AUX_aayiRRu VP+KIZI_AUX+PT+3SM
VP+FT+POODU_AUX+PT+3SE VP+KIZI_AUX+PT+3SF
VP+FT+POODU_AUX+PT+3SM VP+KIZI_AUX+PT+2S
VP+FT+POODU_AUX+PT+3SF VP+KIZI_AUX+PT+1S
VP+FT+POODU_AUX+PT+2S VP+KIZI_AUX+PT+1P
VP+FT+POODU_AUX+PT+1S VP+KIZI_AUX+FT+3SE
VP+FT+POODU_AUX+PT+1P VP+KIZI_AUX+FT+3SM
VP+FT+POODU_AUX+PRT+3SE VP+KIZI_AUX+PRT+3SF
VP+FT+POODU_AUX+PRT+3SM VP+KIZI_AUX+FT+2S
VP+FT+POODU_AUX+PRT+3SF VP+KIZI_AUX+FT+1S
VP+FT+POODU_AUX+PRT+2S VP+KIZI_AUX+FT+1P
VP+FT+POODU_AUX+PRT+1S VP+KIZI_AUX+PRT+3SE
VP+FT+POODU_AUX+PRT+1P VP+KIZI_AUX+PRT+3SM
VP+FT+POODU_AUX+FT+3SE VP+KIZI_AUX+PRT+3SF
VP+FT+POODU_AUX+FT+3SM VP+KIZI_AUX+PRT+2S
VP+FT+POODU_AUX+FT+3SF VP+KIZI_AUX+PRT+1S
VP+FT+POODU_AUX+FT+2S VP+KIZI_AUX+PRT+1P
VP+FT+POODU_AUX+FT+1P VP+KIZI_AUX+FT+3SE
VP+FT+POODU_AUX+FT+1S VP+KIZI_AUX+FT+3SM
VP+THOLAI_AUX+PT+3SE VP+KIZI_AUX+FT+3SF
VP+THOLAI_AUX+PT+3SM VP+KIZI_AUX+FT+2S
VP+THOLAI_AUX+PT+3SF VP+KIZI_AUX+FT+1S
VP+THOLAI_AUX+PT+2S VP+KIZI_AUX+FT+1P
VP+THOLAI_AUX+PT+1P VP+KIDA_AUX+PT+3SE
VP+THOLAI_AUX+PT+3SM VP+KIDA_AUX+PT+3SM
VP+THOLAI_AUX+PRT+3SE VP+KIDA_AUX+PT+3SF
VP+THOLAI_AUX+PRT+3SF VP+KIDA_AUX+PT+2S
268
VP+KIDA_AUX+PT+1S INF+AUX_attum
VP+KIDA_AUX+PT+1P INF+MOD_vendum
VP+KIDA_AUX+PRT+3SE INF+MOD_vendam
VP+KIDA_AUX+PRT+3SM INF+MOD_koodum
VP+KIDA_AUX+PRT+3SF INF+MOD_koodathu
VP+KIDA_AUX+PRT+2S INF+MAATTU_AUX+3SM
VP+KIDA_AUX+PRT+1S INF+MAATTU_AUX+3SF
VP+KIDA_AUX+PRT+1P INF+MAATTU_AUX+3SE
VP+KIDA_AUX+FT+3SE INF+MAATTU_AUX+1S
VP+KIDA_AUX+FT+3SM INF+MAATTU_AUX+2S
VP+KIDA_AUX+FT+3SF INF+MAATTU_AUX+1P
VP+KIDA_AUX+FT+2S INF+MOD_illai
VP+KIDA_AUX+FT+1S INF+IYAL_AUX+RP_UM
VP+KIDA_AUX+FT+1P INF+IYAL_AUX+FT_3SN
VP+THEER_AUX+PT+3SE INF+IYAL_AUX+PT+3SN
VP+THEER_AUX+PT+3SM INF+IYAL_AUX+PRT+3SN
VP+THEER_AUX+PT+3SF INF+MUDI_AUX+PT+3SN
VP+THEER_AUX+PT+2S INF+MUDI_AUX+PRT+3SN
VP+THEER_AUX+PT+1P INF+MUDI_AUX+FT_3SN
VP+THEER_AUX+PT+1S INF+MUDI_AUX+RP_UM
VP+THEER_AUX+PRT+3SE INF+IRU_AUX+PT+3SM
VP+THEER_AUX+PRT+3SM INF+IRU_AUX+PT+3SF
VP+THEER_AUX+PRT+3SF INF+IRU_AUX+PT+3SE
VP+THEER_AUX+PRT+2S INF+POO_AUX+PT+3SM
VP+THEER_AUX+PRT+1P INF+POO_AUX+PT+3SF
VP+THEER_AUX+PRT+1S INF+POO_AUX+PT+3SE
VP+THEER_AUX+FT+3SE INF+VAA_AUX+PT+3SM
VP+THEER_AUX+FT+3SM INF+VAA_AUX+PT+3SF
VP+THEER_AUX+FT+3SF INF+VAA_AUX+PT+3SE
VP+THEER_AUX+FT+2S INF+PAAR_AUX+PT+3SM
VP+THEER_AUX+FT+1P INF+PAAR_AUX+PT+3SF
VP+THEER_AUX+FT+1S INF+PAAR_AUX+PT+3SE
VP+MUDI_AUX+PT+3SE INF+VAI_AUX+PT+3SE
VP+MUDI_AUX+PT+3SM INF+VAI_AUX+PT+3SM
VP+MUDI_AUX+PT+3SF INF+VAI_AUX+PT+3SF
VP+MUDI_AUX+PT+2S INF+PANNU_AUX+PT+3SM
VP+MUDI_AUX+PT+1P INF+PANNU_AUX+PT+3SF
VP+MUDI_AUX+PT+1S INF+PANNU_AUX+PT+3SE
VP+MUDI_AUX+PRT+3SE INF+SEY_AUX+PT+3SM
VP+MUDI_AUX+PRT+3SM INF+SEY_AUX+PT+3SF
VP+MUDI_AUX+PRT+3SF INF+SEY_AUX+PT+3SE
VP+MUDI_AUX+PRT+2S INF+PERU_AUX+PT+3SM
VP+MUDI_AUX+PRT+1P INF+PERU_AUX+PT+3SF
VP+MUDI_AUX+PRT+1S INF+PERU_AUX+PT+3SE
VP+MUDI_AUX+FT+3SE INF+PADU_AUX+PT+3SN
VP+MUDI_AUX+FT+3SM INF+PADU_AUX+PT+3SM
VP+MUDI_AUX+FT+3SF INF+PADU_AUX+PT+3SF
VP+MUDI_AUX+FT+2S INF+PADU_AUX+PT+3SE
VP+MUDI_AUX+FT+1S INF+PADU_AUX+PRT+3SN
VP+MUDI_AUX+FT+1P INF+PADU_AUX+PRT+3SM
269
INF+PADU_AUX+PRT+3SF VP+PAAR_AUX+INF+MOD_vendum
INF+PADU_AUX+PRT+3SE VP+IRU_AUX+INF+MOD_vendum
INF+PADU_AUX+FT_3SN VP+KODU_AUX+INF+MOD_vendu
INF+PADU_AUX+RP_UM VP+POO_AUX+INF+MOD_vendum
INF+VVA_AUX+PT+3SN VP+VIDU_AUX+INF+MOD_vendum
INF+VVA_AUX+PRT+3SN VP+KI_AUX+INF+MOD_vendum
INF+VVA_AUX+FT_3SN VP+POODU_AUX+INF+MOD_vendu
INF+VVA_AUX+RP_UM VP+THALLU_AUX+INF+MOD_vendum
INF+VI_AUX+PT+3SN VP+KIDA_AUX+INF+MOD_vendum
INF+VI_AUX+PT+3SN VP+THEER_AUX+INF+MOD_vendum
INF+VI_AUX+PRT+3SN VP+MUDI_AUX+INF+MOD_vendum
INF+VI_AUX+PRT+3SN VP+PAAR_AUX+INF+MOD_vendum
INF+VI_AUX+FT_3SN VP+SEY_AUX+INF+MOD_vendum
INF+VI_AUX+RP_UM VP+KAADDU_AUX+INF+MOD_venda
VP+KAADDU_AUX+PT+3SM VP+VAI_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PT+3SF VP+THOLAI_AUX+INF+MOD_venda
VP+KAADDU_AUX+PT+3SE VP+KIZI_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PT+2S VP+PAAR_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PT+1S VP+IRU_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PT+1P VP+KODU_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PRT+3SM VP+POO_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PRT+3SF VP+VIDU_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PRT+3SE VP+KI_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PRT+2S VP+POODU_AUX+INF+MOD_vendam
VP+KAADDU_AUX+PRT+1P VP+THALLU_AUX+INF+MOD_vendam
VP+KAADDU_AUX+FT+3SE VP+KIDA_AUX+INF+MOD_vendam
VP+KAADDU_AUX+FT+3SM VP+THEER_AUX+INF+MOD_vendam
VP+KAADDU_AUX+FT+3SF VP+MUDI_AUX+INF+MOD_venda
VP+KAADDU_AUX+FT+2S VP+PAAR_AUX+INF+MOD_vendam
VP+KAADDU_AUX+FT+1P VP+SEY_AUX+INF+MOD_vendam
VP+VAI_AUX+PT+3SE VP+PAAR_AUX+INF+MOD_illai
VP+VAI_AUX+PT+3SM VP+IRU_AUX+INF+MOD_illai
VP+VAI_AUX+PT+3SF VP+KODU_AUX+INF+MOD_illai
VP+VAI_AUX+PT+2S VP+POO_AUX+INF+MOD_illai
VP+VAI_AUX+PT+1P VP+VIDU_AUX+INF+MOD_illai
VP+VAI_AUX+PT+1S VP+KI_AUX+INF+MOD_illai
VP+VAI_AUX+PRT+3SE VP+POODU_AUX+INF+MOD_illai
VP+VAI_AUX+PRT+2S VP+THOLAI_AUX+INF+MOD_illai
VP+VAI_AUX+PRT+3SM VP+THALLU_AUX+INF+MOD_illai
VP+VAI_AUX+PRT+3SF VP+KIZI_AUX+INF+MOD_illai
VP+VAI_AUX+PRT+1P VP+KIDA_AUX+INF+MOD_illai
VP+VAI_AUX+FT+3SE VP+THEER_AUX+INF+MOD_illai
VP+VAI_AUX+FT+3SM VP+MUDI_AUX+INF+MOD_illai
VP+VAI_AUX+FT+3SF VP+VAA_AUX+INF+MOD_illai
VP+VAI_AUX+FT+2S VP+VAI_AUX+INF+MOD_illai
VP+VAI_AUX+FT+1P VP+KAADDU_AUX+INF+MOD_illai
VP+KAADDU_AUX+INF+MOD_vend VP+KODU_AUX+INF+MOD_illai
VP+VAI_AUX+INF+MOD_vendum INF+VAA_AUX+INF+MOD_illai
VP+THOLAI_AUX+INF+MOD_vendu INF+VAI_AUX+INF+MOD_illai
VP+KIZI_AUX+INF+MOD_vendum INF+SEY_AUX+INF+MOD_illai
270
INF+PANNU_AUX+INF+MOD_illai INF+PADU_AUX+VP+VIDU_AUX+RP_UM
INF+PERU_AUX+INF+MOD_illai INF+PADU_AUX+VP+VIDU_AUX+PT+3SN
INF+PADU_AUX+INF+MOD_illai VP+KII_AUX+PRT+1S
INF+VENDU_AUX+VP+NOM_athu+ VP+KII_AUX+PT+1S
VP+KII_AUX+FT+1S
INF+VENDU_AUX+VP+3SN+MOD_illai VP+KII_AUX+PRT+1P
FT+NOM_athu+MOD_illai VP+KII_AUX+PT+1P
FT+3SN+MOD_illai VP+KII_AUX+FT+1P
FT+NOM_athu+CM_dat+MOD_illai VP+KII_AUX+PRT+2S
INF+UL_AUX+RP+3SM VP+KII_AUX+PT+2S
INF+UL_AUX+RP+3SF VP+KII_AUX+FT+1S
INF+UL_AUX+RP+3PE VP+KII_AUX+PRT+3SM
INF+UL_AUX+RP+3SN VP+KII_AUX+PT+3SM
Ungal VP+KII_AUX+FT+3SM
INF+CL_um VP+KII_AUX+PRT+3SF
INF+PADU_AUX+VP+IRU_AUX+PT+3S VP+KII_AUX+PT+3SF
INF+PADU_AUX+VP+IRU_AUX+PT+3 VP+KII_AUX+FT+3SF
VP+KII_AUX+PRT+3SN
INF+PADU_AUX+VP+IRU_AUX+PT+3 VP+KII_AUX+PT+3SN
VP+KII_AUX+FT+3SN
INF+PADU_AUX+VP+IRU_AUX+PT+2 VP+KII_AUX+PRT+3PE
VP+KII_AUX+PT+3PE
INF+PADU_AUX+VP+IRU_AUX+PT+1s VP+KII_AUX+FT+3PE
INF+PADU_AUX+VP+IRU_AUX+PT+1S VP+IRU_AUX+PRT+1P
INF+PADU_AUX+VP+IRU_AUX+PRT+3SE VP+IRU_AUX+PRT+2S
INF+PADU_AUX+VP+IRU_AUX+P3SE+PL VP+IRU_AUX+PT+2S
INF+PADU_AUX+VP+IRU_AUX+PRT+3SF VP+IRU_AUX+PRT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+3S VP+IRU_AUX+PT+3SN
VP+IRU_AUX+FT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+2S VP+KI_AUX+PRT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+1P VP+KI_AUX+PT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+1S VP+KI_AUX+FT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+3S VP+IRU_AUX+PRT+3PE
VP+IRU_AUX+PT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+3SE VP+IRU_AUX+FT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+3SM VP+KI_AUX+PRT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+3SF VP+KI_AUX+PT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+2S VP+KI_AUX+FT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+1P VP+IRU_AUX+FT_3SN
INF+PADU_AUX+VP+IRU_AUX+FT+1S VP+IRU_AUX+RP_UM
INF+PADU_AUX+VP+IRU_AUX+PT+3SN VP+KI_AUX+FT_3SN
INF+PADU_AUX+VP+IRU_AUX+FT_3SN VP+KI_AUX+RP_UM
INF+PADU_AUX+VP+IRU_AUX+RP_UM VP+KII_AUX+FT_3SN
INF+PADU_AUX+VP+IRU_AUX+INF+MOD VP+KII_AUX+RP_UM
INF+PADU_AUX+VP+IRU_AUX+INF+MOD
INF+PADU_AUX+VP+UL_AUX+RP+3PE
INF+PADU_AUX+VP+VIDU_AUX+FT_3SN
271
B.4 TAMIL NOUN WORD FORMS
272
நரம்பிைனக்குறித் நரம்பிைனப்ேபால்
நரம்பினத்ைதக்குறித் நரம்பினத்ைதப்ேபால்
நரம்ைபப்பார்த் நரம் மாதிாி
நரம்பிைனப்பார்த் நரம்ைபமாதிாி
நரம்பினத்ைதப்பார்த் நரம்ைபவிட
நரம்ைபேநாக்கி நரம்பிைனவிட
நரம்பினத்ைதேநாக்கி நரம்பினத்ைதவிட
நரம்பிைனேநாக்கி நரம் க்குப்பதிலாக
நரம்ைபச்சுற்றி நரம் க்காக
நரம்பினத்ைதச்சுற்றி நரம்பி க்காக
நரம்பிைனச்சுற்றி நரம் க்குப்பிறகு
நரம்ைபத்தாண் நரம் க்கப் றம்
நரம்பிைனத்தாண் நரம் க்கப்பால்
நரம்பினத்ைதத்தாண் நரம் க்குேமல்
நரம்ைபத்தவிர்த் நரம் க்குேமேல
நரம்பிைனத்தவிர்த் நரம் க்குங்கீழ்
நரம்பினத்ைதத்தவிர்த் நரம் க்குங்கீேழ
நரம்ைபத்தவிர நரம் க்குள்
நரம்பிைனத்தவிர நரம் க்குள்ேள
நரம்பினத்ைதத்தவிர நரம் க்குெவளியில்
நரம்ெபாழிய நரம் க்குெவளிேய
நரம்ைபெயாழிய நரம் க்க யில்
நரம்பினத்ைதெயாழிய நரம் க்க கில்
நரம்ைபெயாட் நரம் ன்
நரம்பிைனெயாட் நரம் க்கு ன்
நரம்பினத்ைதெயாட் நரம் ன்னால்
நரம்ைபக்ெகாண் நரம் க்கு ன்னால்
நரம்பிைனக்ெகாண் நரம் பின்னால்
நரம்பினத்ைதக்ெகாண் நரம் க்குப்பின்னால்
நரம்ைபைவத் நரம் க்குப்பின்
நரம்பினத்ைதைவத் நரம் க்குப்பிந்தி
நரம்பிைனைவத் நரம் க்குகு க்ேக
நரம்ைபவிட் நரம் க்குள்
நரம்பிைனவிட் நரம்பி ள்
நரம்பினத்ைதவிட் நரம் க்குள்ேள
நரம்ைபப்ேபால நரம்பி க்குள்ேள
நரம்பிைனப்ேபால நரம்ெபதிேர
நரம்பினத்ைதப்ேபால நரம் க்ெகதிேர
நரம்ைபப்ேபால் நரம்ெபதிர்க்கு
273
நரம்ெபதிாில் நரம் கள்வைரக்கும்
நரம் க்ெகதிாில் நரம் கள்வைரயில்
நரம் க்ெகதிர்த்தாற்ேபால் நரம் களில்லாமல்
நரம் க்கிைடயில் நரம் களல்லாமல்
நரம் க்குந வில் நரம் களாட்டம்
நரம் க்க த்தாற்ேபால் நரம் கள் தல்
நரம்பி ந் நரம் களின்ப
நரம் டன் நரம் களின்வழியாக
நரம்பிடமி ந் நரம் களிடம்
நரம் லமி ந் நரம் களின் லம்
நரம் ப்பக்கமி ந் நரம் களின்பக்கம்
நரம்பண்ைடயி ந் நரம் களினண்ைட
நரம்ப ேகயி ந் நரம் கள ேக
நரம் க்க ேகயி ந் நரம் களின ேக
நரம்ப கி ந் நரம் களின கில்
நரம் க்க கி ந் நரம் கள்கிட்ட
நரம் கிட்டயி ந் நரம் களின்ேமல்
நரம் ேம ந் நரம் களின்ேமேல
நரம் க்குேம ந் நரம் களின்கீழ்
நரம் ேமேலயி ந் நரம் களின்கீேழ
நரம் க்குேமேலயி ந் நரம் கைளப்பற்றி
நரம் க்குங்கீழி ந் நரம் கைளக்குறித்
நரம்பின்கீழி ந் நரம் கைளப்பார்த்
நரம்பதின்கீழி ந் நரம் கைளேநாக்கி
நரம்பின்கீேழயி ந் நரம் கைளச்சுற்றி
நரம்பதன்கீேழயி ந் நரம் கைளத்தாண்
நரம் கள் நரம் கைளத்தவிர்த்
நரம் கைள நரம் கைளத்தவிர
நரம் களிைன நரம் கைளெயாழிய
நரம் களின் நரம் கைளெயாட்
நரம் க க்காக நரம் கைளக்ெகாண்
நரம் க க்கான நரம் கைளைவத்
நரம் க க்கு நரம் கைளவிட்
நரம் களில் நரம் கைளப்ேபால்
நரம் க டன் நரம் கைளப்ேபால
நரம் கேளா நரம் கைளமாதிாி
நரம் கள நரம் கைளவிட
நரம் களால் நரம் க க்குப்பதிலாக
நரம் க டன் நரம் க க்காக
274
B.5 TAMIL VERB WORD FORMS
ப ப க்கின்ற
ப த்தான் ப க்காத
ப த்தாள் ப த்தவன்
ப த்தார் ப த்தவள்
ப த்தார்கள் ப த்தவர்
ப த்த ப த்த
ப த்தன ப க்கின்றவன்
ப த்தாய் ப க்கின்றவள்
ப த்தீர் ப க்கின்றவர்
ப த்தீர்கள் ப க்கின்ற
ப த்ேதன் ப ப்பவன்
ப த்ேதாம் ப ப்பவள்
ப க்கிறான் ப ப்பவர்
ப க்கிறாள் ப ப்ப
ப க்கிறார் ப க்காதவன்
ப க்கிறார்கள் ப க்காதவள்
ப க்கின்ற ப க்காதவர்
ப க்கின்றன ப க்காத
ப க்கின்றாய் ப த்தைவ
ப க்கின்றீர் ப க்காதன
ப க்கின்றீர்கள் ப த்
ப க்கின்ேறன் ப க்க
ப க்கின்ேறாம் ப க்கா
ப ப்பான் ப க்காமல்
ப ப்பாள் ப த்தல்
ப ப்பார் ப க்காைம
ப ப்பார்கள் ப க்கலாம்
ப ப்ப ப க்கலாகும்
ப ப்பன ப க்கலாகா
ப ப்பாய் ப த் ப்பார்த்தான்
ப ப்பீர் ப த் ப்பார்த்தாள்
ப ப்பீர்கள் ப த் ப்பார்த்தார்
ப ப்ேபன் ப த் ப்பார்த்தாய்
ப ப்ேபாம் ப த் ப்பார்த்ேதாம்
ப க்கும் ப த் ப்பார்த்ேதன்
ப த்த ப த் ப்பார்க்கின்றார்
275
ப த் ப்பார்க்கின்றார்கள் ப த் க்ெகா த்ேதாம்
ப த் ப்பார்க்கின்றாள் ப த் க்ெகா க்கிறான்
ப த் ப்பார்க்கின்றான் ப த் க்ெகா க்கிறாள்
ப த் ப்பார்க்கின்ேறாம் ப த் க்ெகா க்கிறார்
ப த் ப்பார்க்கின்றாய் ப த் க்ெகா க்கிறார்கள்
ப த் ப்பார்க்கின்ேறன் ப த் க்ெகா க்கின்ற
ப த் ப்பார்ப்பார் ப த் க்ெகா க்கின்றன
ப த் ப்பார்ப்பாள் ப த் க்ெகா க்கின்றாய்
ப த் ப்பார்ப்பான் ப த் க்ெகா க்கின்றீர்
ப த் ப்பார்ப்பாய் ப த் க்ெகா க்கின்றீர்கள்
ப த் ப்பார்ப்ேபன் ப த் க்ெகா க்கின்ேறன்
ப த் ப்பார்ப்ேபாம் ப த் க்ெகா க்கின்ேறாம்
ப த்தி ந்தார் ப த் க்ெகா ப்பான்
ப த்தி ந்தாள் ப த் க்ெகா ப்பாள்
ப த்தி ந்தான் ப த் க்ெகா ப்பார்
ப த்தி ந்தா ப த் க்ெகா ப்பார்கள்
ப த்தி ந்ேதன் ப த் க்ெகா ப்ப
ப த்தி ந்ேதாம் ப த் க்ெகா ப்பன
ப த்தி க்கின்றார் ப த் க்ெகா ப்பாய்
ப த்தி க்கின்றாள் ப த் க்ெகா ப்பீர்
ப த்தி க்கின்றான் ப த் க்ெகா ப்பீர்கள்
ப த்தி க்கின்ேறன் ப த் க்ெகா ப்ேபன்
ப த்தி க்கின்ேறாம் ப த் க்ெகா ப்ேபாம்
ப த்தி ப்பார் ப த் ப்ேபானான்
ப த்தி ப்பாள் ப த் ப்ேபானாள்
ப த்தி ப்பான் ப த் ப்ேபானார்
ப த்தி ப்பாய் ப த் ப்ேபானார்கள்
ப த்தி ப்ேபன் ப த் ப்ேபான
ப த்தி ப்ேபாம் ப த் ப்ேபாயின
ப த் க்ெகா த்தான் ப த் ப்ேபானாய்
ப த் க்ெகா த்தாள் ப த் ப்ேபானீர்
ப த் க்ெகா த்தார் ப த் ப்ேபானீர்கள்
ப த் க்ெகா த்தார் ப த் ப்ேபாேனன்
ப த் க்ெகா த்த ப த் ப்ேபாேனாம்
ப த் க்ெகா த்தன ப த் ப்ேபாகிறான்
ப த் க்ெகா த்தாய் ப த் ப்ேபாகிறாள்
ப த் க்ெகா த்தீர் ப த் ப்ேபாகிறார்
ப த் க்ெகா த்தீர்கள் ப த் ப்ேபாகிறார்கள்
ப த் க்ெகா த்ேதன் ப த் ப்ேபாகின்ற
276
ப த் ப்ேபாகின்றன ப த் க்ெகாண் ந்ேதாம்
ப த் ப்ேபாகின்றாய் ப த் க்ெகாண் ந்ேதன்
ப த் ப்ேபாகின்றீர் ப த் க்ெகாண் க்கிறார்
ப த் ப்ேபாகின்றீர்கள் ப த் க்ெகாண் க்கிறாள்
ப த் ப்ேபாகின்ேறன் ப த் க்ெகாண் க்கிறான்
ப த் ப்ேபாகின்ேறாம் ப த் க்ெகாண் க்கிறாய்
ப த் ப்ேபாவான் ப த் க்ெகாண் க்கிேறாம்
ப த் ப்ேபாவாள் ப த் க்ெகாண் க்கிேறன்
ப த் ப்ேபாவார் ப த் க்ெகாண் ப்பார்
ப த் ப்ேபாவார்கள் ப த் க்ெகாண் ப்பான்
ப த் ப்ேபாவ ப த் க்ெகாண் ப்பாள்
ப த் ப்ேபாவன ப த் க்ெகாண் ப்பாய்
ப த் ப்ேபாவாய் ப த் க்ெகாண் ப்ேபாம்
ப த் ப்ேபா ர் ப த் க்ெகாண் ப்ேபன்
ப த் ப்ேபா ர்கள் ப த்தாயிற்
ப த் ப்ேபாேவன் ப த் ப்ேபாட்டார்
ப த் ப்ேபாேவாம் ப த் ப்ேபாட்டான்
ப த் விட்டார் ப த் ப்ேபாட்டாள்
ப த் விட்டாள் ப த் ப்ேபாட்டாய்
ப த் விட்டான் ப த் ப்ேபாட்ேடன்
ப த் விட்டாய் ப த் ப்ேபாட்ேடாம்
ப த் விட்ேடாம் ப த் ப்ேபா கிறார்
ப த் விட்ேடன் ப த் ப்ேபா கிறான்
ப த் வி கின்றார் ப த் ப்ேபா கிறாள்
ப த் வி கின்றான் ப த் ப்ேபா கிறாய்
ப த் வி கின்றாள் ப த் ப்ேபா கிேறன்
ப த் வி கின்றாய் ப த் ப்ேபா கிேறாம்
ப த் வி கின்ேறாம் ப த் ப்ேபா வார்
ப த் வி கிேறன் ப த் ப்ேபா வான்
ப த் வி வார் ப த் ப்ேபா வாள்
ப த் வி வாள் ப த் ப்ேபா வாய்
ப த் வி வான் ப த் ப்ேபா ேவாம்
ப த் வி வாய் ப த் ப்ேபா ேவன்
ப த் வி ேவன் ப த் த்ெதாைலத்தார்
ப த் வி ேவாம் ப த் த்ெதாைலத்தான்
ப த் க்ெகாண் ந்தார் ப த் த்ெதாைலத்தாள்
ப த் க்ெகாண் ந்தான் ப த் த்ெதாைலத்தாய்
ப த் க்ெகாண் ந்தாள் ப த் த்ெதாைலேதாம்
ப த் க்ெகாண் ந்தாய் ப த் த்ெதாைலேதான்
277
ப த் த்ெதாைலகிறார் ப த் க்கிழிக்கிறாய்
ப த் த்ெதாைலகிறாள் ப த் க்கிழிக்கிேறன்
ப த் த்ெதாைலகிறான் ப த் க்கிழிக்கிேறாம்
ப த் த்ெதாைலகிறாய் ப த் க்கிழிப்பார்
ப த் த்ெதாைலகிேறாம் ப த் க்கிழிப்பான்
ப த் த்ெதாைலகிேறன் ப த் க்கிழிப்பாள்
ப த் த்ெதாைலப்பார் ப த் க்கிழிப்பாய்
ப த் த்ெதாைலப்பான் ப த் க்கிழிப்ேபன்
ப த் த்ெதாைலப்பாள் ப த் க்கிழிப்ேபாம்
ப த் த்ெதாைலப்பாய் ப த் க்கிடந்தார்
ப த் த்ெதாைலப்ேபன் ப த் க்கிடந்தான்
ப த் த்ெதாைலப்ேபாம் ப த் க்கிடந்தாள்
ப த் த்தள்ளினார் ப த் க்கிடந்தாய்
ப த் த்தள்ளினாள் ப த் க்கிடந்ேதன்
ப த் த்தள்ளினான் ப த் க்கிடந்ேதாம்
ப த் த்தள்ளினாய் ப த் க்கிடக்கிறார்
ப த் த்தள்ளிேனாம் ப த் க்கிடக்கிறான்
ப த் த்தள்ளிேனன் ப த் க்கிடக்கிறாள்
ப த் த்தள் வார் ப த் க்கிடக்கிறாய்
ப த் த்தள் வான் ப த் க்கிடக்கிேறன்
ப த் த்தள் வாள் ப த் க்கிடக்கிேறாம்
ப த் த்தள் வாய் ப த் க்கிடப்பார்
ப த் த்தள் ேவன் ப த் க்கிடப்பான்
ப த் த்தள் ேவாம் ப த் க்கிடப்பாள்
ப த் க்கிழித்தார் ப த் க்கிடப்பாய்
ப த் க்கிழித்தான் ப த் க்கிடப்ேபன்
ப த் க்கிழித்தாள் ப த் க்கிடப்ேபாம்
ப த் க்கிழித்தாய் ப த் த்தீர்த்தார்
ப த் க்கிழித்ேதன் ப த் த்தீர்த்தான்
ப த் க்கிழித்ேதாம் ப த் த்தீர்த்தாள்
ப த் க்கிழிப்பார் ப த் த்தீர்த்தாய்
ப த் க்கிழிப்பான் ப த் த்தீர்த்ேதாம்
ப த் க்கிழிக்கிறாள் ப த் த்தீர்த்ேதன்
ப த் க்கிழிப்பாய் ப த் த்தீர்க்கிறார்
ப த் க்கிழிப்ேபன் ப த் த்தீர்க்கிறான்
ப த் க்கிழிப்ேபாம் ப த் த்தீர்க்கிறாள்
ப த் க்கிழிக்கிறார் ப த் த்தீர்க்கிறாய்
ப த் க்கிழிக்கிறான் ப த் த்தீர்க்கிேறாம்
ப த் க்கிழிக்கிறாள் ப த் த்தீர்க்கிேறன்
278
ப த் த்தீர்ப்பார் ப க்க ந்த
ப த் த்தீர்ப்பான் ப க்க கிற
ப த் த்தீர்ப்பாள் ப க்க ம்
ப த் த்தீர்ப்பாய் ப க்கயி ந்தான்
ப த் த்தீர்ப்ேபாம் ப க்கயி ந்தாள்
ப த் த்தீர்ப்ேபன்
ப த் த்தார்
ப த் த்தான்
ப த் த்தாள்
ப த் த்தாய்
ப த் த்ேதாம்
ப த் த்ேதன்
ப த் க்கிறார்
ப த் க்கிறான்
ப த் க்கிறாள்
ப த் க்கிறாய்
ப த் க்கிேறாம்
ப த் க்கிேறன்
ப த் ப்பார்
ப த் ப்பான்
ப த் ப்பாள்
ப த் ப்பாய்
ப த் ப்ேபன்
ப த் ப்ேபாம்
ப க்கட் ம்
ப க்கேவண் ம்
ப க்கேவண்டாம்
ப க்கக்கூ ம்
ப க்கக்கூடா
ப க்கமாட்டான்
ப க்கமாட்டாள்
ப க்கமாட்டார்
ப க்கமாட்ேடன்
ப க்கமாட்டாய்
ப க்கமாட்ேடாம்
ப க்கவில்ைல
ப க்கயிய ம்
ப க்கயியன்ற
ப க்கவிய கிற
279
B.6 MOSES INSTALLATION AND TRAINING
This subsection explains the installation of Moses tool kit and the issues which are
occurred while training the system. Remedies for the issues are also given in detail.
cd smt/moses/tools/gizapp
make
if error, then
1) yum install gcc
2) yum install gcc-c++
3) yum install glibc-static (for can't find -lm error)
4) yum install libstdc++-static (for can't find -lstdc++
error)
cd ../
mkdir bin
cp giza-pp/GIZA++-v2/GIZA++ bin/
cp giza-pp/mkcls-v2/mkcls bin/
cp giza-pp/GIZA++-v2/snt2cooc.out bin/
cd srilm
make World
make all
280
export
PATH=/home/anand/smt/moses/tools/srilm/bin/i686:/home/js
chroe1/demo/tools/srilm/bin:$PATH
./regenerate-makefiles.sh
./configure --with-
srilm=/home/anand/smt/moseso/tools/srilm --with-
irstlm=/home/anand/smt/moses/tools/irstlm
make -j 2
To confirm setup:
cd /home/anand/smt/moses
mkdir data
cp sample-models
cd sample-models/phrase-model/
../../../tools/moses/moses-cmd/src/moses -f moses.ini <
in > out
cd ../../../tools/
mkdir moses-scripts
cd moses/scripts
< TARGETDIR?=/home/s0565741/terabyte/bin
< BINDIR?=/home/s0565741/terabyte/bin
---
> TARGETDIR?=/home/anand/smt/moses/tools/moses-scripts
> BINDIR?=/home/anand/smt/moses/tools/bin
export
SCRIPTS_ROOTDIR=/home/anand/smt/moses/tools/moses-
scripts/scripts-YYYYMMDD-HHMM
281
Additional Scripts
cd ../../
extract scripts
also cp mteval.v11b.pl
in /home/anand/smt/moses/tools/moses-scripts/scripts-
20110204-0333/training/mert-moses.pl, type python before
$cmertdir i.e., in this line $SCORENBESTCMD =
"$cmertdir/score-nbest.py" if ! defined $SCORENBESTCMD;
Training
SMT-project/moses/bin/moses-scripts/scripts-20090302-
0358/training/train-factored-phrase-model.perl -scripts-
root-dir SMT-project/moses/bin/moses-scripts/scripts-
20090302-0358/ -root-dir . -corpus SMT-project/smt-
cmd/corpus/corpus -f en -e ma -alignment grow-diag-final
-reordering msd-bidirectional-fe -lm 0:4:SMT-
project/smt-cmd/lm/monolingual.lm:0
282
Testing
bin/moses-scripts/scripts-20100311-1743/training/train-
factored-phrase-model.perl -scripts-root-dir bin/moses-
scripts/scripts-20100311-1743/ -root-dir running_files -
corpus pc/twofactfour -f eng -e tam -lm
0:4:/root/Desktop/D5/srilm/bin/i686/300.lm:0 -lm
2:5:/root/Desktop/D5/srilm/bin/i686/300.pos.lm:0 -lm
3:5:/root/Desktop/D5/srilm/bin/i686/300.morph.lm:0 --
alignment-factors 0,1,2,3-0,1,2,3 --translation-factors
0-0+1-1+2-2+3-3 --reordering-factors 0-0+1-1+2-2+3-3 --
generation-factors 3-2+3,2-1+1,2,3-0 --decoding-steps
t3,t2,t1,g0,g1,g2
#########################
### MOSES CONFIG FILE ###
#########################
# input factors
[input-factors]
0
1
2
3
# mapping steps
[mapping]
0 T 0
0 T 1
283
# no generation models, no generation-file section
[distortion-limit] 6
284
B.7 COMPARISION WITH GOOGLE OUTPUT
She did not come with me. அவள் என் டன் வரவில்ைல.
285
B.8 GRAPHICAL USER INTERFACES
286
287
288
289
REFERENCES
[1] Lee, L. (2004). ‘‘I’m sorry Dave, I’m afraid I can’t do that’’: Linguistics,
statistics, and natural language processing circa 2001. In On the Fundamentals
of Computer Science: Challenges C, Opportunities CS, Telecommunications
Board NRC (Eds.), Computer science: Reflections on the field, reflections from
the field (pp. 111–118). Washington, DC: The National Academies Press.
[2] Jurafsky Daniel and Martin James H (2005), “An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition”,
Prentice Hall, ISBN: 0130950696, contributing writers: Andrew Kehler, Keith
Vander Linden, and Nigel Ward.
[3] Hutchins John 2001, Machine translation and human translation: in competition
or in complementation?, International Journal of Translation, 13, 1-2, p. 5-20
[5] http://www.pangea.com.mt/en/q2-why-statistical-mt/
[6] http://cordis.europa.eu/fp7/ict/language-technologies/project-
euromatrixplus_en.html
[7] Ma, Xiaoyi. 1999. Parallel text collections at the Linguistic Data Consortium. In
Machine Translation Summit VII, Singapore.
[9] http://en.wikipedia.org/wiki/Tamil_language
[10] Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proc.
EMNLP+CoNLL, pages 868–876, Prague
[11] http://en.wikipedia.org/wiki/Subject%E2%80%93verb%E2%80%93object
290
[12] Jes´us Gim´enez and Llu´ıs M`arquez,2006 ,SVMTool:Technical manual v1.3,
August 2006.
[15] Church, K. W. (1988). A stochastic parts program and noun phrase parser for
unrestricted text. In Proceedings of the Second Conference on Applied Natural
Language Processing, pages 136–143.
[19] Brill E (1992), “A simple rule based part of speech tagger”, Proceedings of the
Third Conference on Applied 5atural Language Processing, ACL, Trento, Italy.
[20] Brill E (1993), “Automatic grammar induction and parsing free text: A
transformation based approach”, Proceedings of 31st Meeting of the
Association of Computational Linguistics, Columbus.
[21] Brill E (1993), “Transformation based error driven parsing”, Proceedings of the
Third International Workshop on Parsing Technologies, Tilburg, The
Netherlands.
291
[22] Brill E (1994), “Some advances in rule based part of speech tagging”,
Proceedings of The Twelfth 5ational Conference on Artificial Intelligence
(AAAI- 94), Seattle, Washington.
[23] Prins R, and Van Noord G (2001), “Unsupervised Pos- Tagging Improves
Parsing Accuracy And Parsing Efficiency”, Proceedings of the International
Workshop on Parsing Technologies.
[28] Francis, W.N., & Kucera, H. (1982). Frequency analysis of English usage:
Lexicon and grammar. Boston: Houghton Mifflin.
292
[32] Yahya O and Mohamed Elhadj (2004), “Statistical Part-of-Speech Tagger for
Traditional Arabic Texts”, Journal of Computer Science 5 (11): 794-800, ISSN
1549-3636.
[36] Ray Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Adaptive Language
Modeling Using The Maximum Entropy Prin- ciple. In Proceedings of the
Human Language Technology Workshop, pages 108-113. ARPA.
[38] Jes´us Gim´enez and Llu´ıs M`arquez. (2004), “SVMTool: A general POS
tagger generator based on support vector machines”, Proceedings of the 4th
LREC Conference.
[40] Smriti Singh, Kuhoo Gupta, Manish Shrivastava and Pushpak Bhattacharyya
(2006), “Morphological richness offsets resource demand – experiences in
constructing a pos tagger for Hindi”, Proceedings of the COLING/ACL 2006,
Sydney, Australia Main Conference Poster Sessions, pp. 779–786.
293
[41] Manish Shrivastava and Pushpak Bhattacharyya, Hindi POS Tagger Using
Naive Stemming: Harnessing Morphological Information Without Extensive
Linguistic Knowledge, International Conference on NLP (ICON08), Pune,
India, December, 2008.
[42] Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi
Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”,
Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech
and Chunking for Indian Languages.
[44] Nidhi Mishra Amit Mishra (2011), “Part of Speech Tagging for Hindi Corpus”,
International Conference on Communication Systems and Network
Technologies.
[45] Pradipta Ranjan Ray, Harish V., Sudeshna Sarkar and Anupam Basu, (2003)
“Part of Speech Tagging and Local Word Grouping Techniques for Natural
Language Parsing in Hindi” , Indian Institute of Technology, Kharagpur,
INDIA 721302. www.mla.iitkgp.ernet.in/papers/hindipostagging.pdf.
[46] Sivaji Bandyopadhyay, Asif Ekbal and Debasish Halder (2006), “HMM based
POS Tagger and Rule-based Chunker for Bengali”, Proceedings of NLPAI
Machine Learning Workshop on Part Of Speech and Chunking for Indian
Languages.
[47] RamaSree, R.J and Kusuma Kumari, P (2007), “Combining Pos Taggers For
Improved Accuracy To Create Telugu Annotated Texts For Information
Retrieval”, Available at http://www.ulib.org/conference/2007/RamaSree.pdf.
[48] Sandipan Dandapat (2007), “Part Of Speech Tagging and Chunking with
Maximum Entropy Model”, Proceedings of IJCAI Workshop on Shallow
Parsing for South Asian Languages.
294
[49] Antony P.J and K.P. Soman. 2010. Kernel based part of speech tagger for
kannada. In Machine Learning and Cybernetics (ICMLC), 2010 International
Conference on, volume 4, pages 2139 –2144, july.
[50] Manju K, Soumya S, and Sumam Mary Idicula (2009), “Development of a POS
Tagger for Malayalam - An Experience”, International Conference on Advances
in Recent Technologies in Communication and Computing, pp.709-713.
[51] Antony P J, Santhanu P Mohan and Soman K P (2010), “SVM Based Parts
Speech Tagger for Malayalam”, International Conference on-Recent Trends in
Information,Telecommunication and Computing (ITC 2010).
[53] Arulmozhi P and Sobha L (2006), “A Hybrid POS Tagger for a Relatively Free
Word Order Language”, Proceedings of MSPIL-2006, Indian Institute of
Technology, Bombay.
[56] Ganesan M (2007), “Morph and POS Tagger for Tamil” (Software), Annamalai
University, Annamalai Nagar.
[57] Lakshmana Pandian S and Geetha T V (2009), “CRF Models for Tamil Part of
Speech Tagging and Chunking “, Proceedings of the 22nd ICCPOL.
295
and Induction Techniques”, International Journal of Computers, Issue 4,
Volume 3, 2009.
[60] Daelemans Walter, Zavrel J, Van den Bosch A and Van der Sloot K (2003),
“MBT: Memory Based Tagger, version 2.0, reference guide”, Technical Report
ILK 03-13, ILK Research Group, Tilburg University.
[61] Alon Itai and Erel Segal (2003), “A Corpus Based Morphological Analyzer for
Unvocalized Modern Hebrew”, Department of Computer Science Technion—
Israel Institute of Technology, Haifa, Israel.
296
Network Security (IJCSNS) Vol 11 No. 1, Jan 2011, "Kannada Morphological
Analyser and Generator using Trie" pp 112-116
[70] Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya
Morphological Analyser: Some Tests with OriNet. In Proceeding of symposium
on Indian Morphology, phonology and Language Engineering, IIT Kharagpur
[73] Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S.
and Vijay Shanker, K. (2003). A Tamil Morphological Analyser, Proceedings of
the International Conference On Natural language processing ICON 2003,
Central Institute of Indian Languages, Mysore, India, pp. 31–39.
297
[75] Vijay Sundar Ram R, Menaka S and Sobha Lalitha Devi (2010), Tamil
Morphological Analyser, In Mona Parakh (ed.) Morphological Analyser For
Indian Languages, CIIL, Mysore, pp. 1 -18.
[76] Akshar Bharat, Rajeev Sangal, S. M. Bendre, Pavan Kumar and Aishwarya,
“Unsupervised improvement of morphological analyzer for inflectionally rich
languages,” Proceedings of the NLPRS, pp. 685-692, 2001.
[78] Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi
Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”,
Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech
and Chunking for Indian Languages.
[81] R.M.K.Sinha, Jain R. and Jain A,“Translation from English to Indian languages,
ANGLABHARTI Approach,” In proceedings of Symposium on Translation
Support System STRANS2001, IIT Kanpur, India, Feb 15-17, 2001.
298
[84] Durgesh Rao,“Machine Translation in India: A Brief Survey,” In Proceedings of
the SCALLA 2001 Conference, Bangalore, India, 2001.
[90] G.S. Josan and G.S. Lehal,“Punjabi to Hindi machine translation system,”In
Proceedings of the 22nd International Conference on Computational
Linguistics, MT-Archive, Manchester, UK., pp. 157-160,Aug. 21-24, 2001.
[91] Vishal Goyal and Gurpreet Singh Lehal,“Hindi to Punjabi Machine Translation
System,”Springer Berlin Heidelberg, Information Systems for Indian
Languages, Communications in Computer and Information Science, Vol: 139,
pp: 236-241. 2011.
299
[93] Ruvan Weerasinghe. 2004. A statistical machine translation approach to Sinhala
Tamil language translation. In SCALLA 2004.
[97] Chella muthu (2001) “ Russian language to Tamil machine translation system ”.
INFITT, (TI2001).
[99] Saravanan S., Menon A. G., Soman K. P. (2010), English to Tamil Machine
Tanslation System, INFITT 2010, at Coimbatore.
[100] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and
P. Roossin,“A statistical approach to machine translation,”InJournal
ofComputational Linguistics, 16(2):79-85, 1990.
[101] F.J.Och. “An Efficient method for determining bilingual word classes,” In
Proceedings of Ninth Conference of the European Chapter of the Association
for Computational Linguistics (EACL), 1999.
[102] Daniel Marcu and William Wong,“A Phrase-Based, Joint Probability Model for
Statistical Machine Translation,” In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP-2002),
Philadelphia, PA, July 6-7, 2002.
300
[103] Philipp Koehn, Franz Josef Och, and Daniel Marcu,“Statistical Phrase-Based
Translation,”In Proceedings ofHLT/NAACL, 2003.
[108] Sonja Nießen and Hermann Ney,“Statistical Machine Translation with Scarce
Resources Using Morpho-syntactic Information,” In Journal ofComputational
Linguistics, 30(2), pp. 181–204, 2004.
[109] Maja Popovic and Hermann Ney,“Statistical Machine Translation with a Small
Amount of Bilingual Training Data,”5th LREC SALTMIL Workshop on
Minority Languages, pp. 25–29, 2006.
[110] Michael Collins, Philipp Koehn, and Ivona Kucerova,“Clause Restructuring for
Statistical Machine Translation,”In Proceedings of ACL, pp. 531–540, 2006.
[111] Sahar Ahmadi and Saeed Ketabi. “Translation Procedures and problems of
Color Idiomatic Expressions in English and Persian,” In the Journal of
International Social Research, Volume: 4 Issue: 17, 2011.
301
[113] Breidt, E., Segond F and Valetto G.,“Local grammars for the description of
multi-word lexemes and their automatic recognition in texts,”In Proceedings of
COMPLEX96, Budapest, 1996.
302
[123] Sriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali, A
Discriminative Approach for Dependency Based. Statistical Machine
Translation (2010).
[125] Kumaran A and Tobias Kellner (2007) A Generic Framework for Machine
Transliteration. Proceedings in 30th annual international ACM-SIGIR
conference on Research and development in information retrieval, Pages 721-
722.
[128] Vijaya M. S., Shivapratap G., Soman K. P (2010), ‘English to Tamil Trans-
literation using One Class Support Vector Machine’, International Journal of
Applied Engineering Research, Volume 5, Number 4, 641-652.
[130] Sobha, L., Vijay Sundar Ram. R, (2006) "Noun Phrase Chunker for Tamil", In
Proceedings of Symposium on Modeling and Shallow Parsing of Indian
Languages, Indian Institute of Technology, Mumbai, pp 194-198.
303
[132] http://en.wikipedia.org/wiki/Tamil_grammar.
[135] http://en.wikipedia.org/wiki/Machine_learning
[136] Vapnik, V. (1998). Statistical Learning Theory. Wiley. & Sons, Inc., New York.
[137] K.P. Soman, Shyam Diwakar, V. Ajay, Insight into Data Mining, Theory and
Practice. Prentice Hall of India, Pages174-198, 2008.
[138] Dr. K.P. Soman, Ajav. V, Loganathan R., "Machine Learning with SVM and
other Kernel Methods", Prentice-Hall India, ISBN: 978-81-203-3435-9, 2009.
[139] Harris Z S (1962), String Analysis of Sentence Structure. Mouton, The Hague.
[144] Kishore Papineni, Salim Roukos, Todd Ward & Wei-Jing Zhu (2002). BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
40th Meeting of the Association for Computational Linguistics (ACL’02) (pp.
311–318). Philadelphia, PA.
304
[146] V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions
and reversals. Soviet Physics Doklady, 10(8), pp. 707–710, February.
[147] S. Nießen, F. J. Och, G. Leusch, and H. Ney. 2000. An evaluation tool for
machine translation: Fast evaluation for MT research. In Proc. Second Int. Conf.
on Language Resources and Evaluation, pp. 39–45, Athens, Greece, May.
[148] Christoph Tillmann, Stefan Vogel, Hermann Ney & Alex Zubiaga (1997). A
DP-based search using monotone alignments in statistical translation. In
Proceedings of the 35th Meeting of the Association for Computational
Linguistics and 8th Conference of the European Chapter of the Association for
Computational Linguistics (pp. 289–296). Somerset, New Jersey.
[149] Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul: 2006, `A Study
of Translation Edit Rate with Targeted Human Annotation'. In: Proceedings of
Association for Machine Translation in the Americas. pp. 223-231.
[150] http://nlp.stanford.edu/software/lex-parser.shtml
[151] Fei Xia and Michael McCord. Improving a statistical MT system with
automatically learned rewrite patterns. In Proceedings of the 20th International
Conference on Computational Linguistics, COLING ’04, pages 508–514,
Geneva, Switzerland, August 2004. Association for Computational Linguistics.
[152] Michael Collins, Philipp Koehn, and Ivona Kuˇcerová. Clause restructuring for
statistical machine translation. In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, ACL ’05, pages 531–540, Ann
Arbor, Michigan, USA, June 2005. Association for Computational Linguistics.
[153] Marta Ruiz Costa-juss` (2006). On developing novel reordering algorithms for
Statistical Machine Translation. Ph.D. thesis, Speech Processing Group
Department of Signal Theory and Communications, Universitat Polit`ecnica de
Catalunya.
305
[155] Ananthakrishnan Ramanathan, Pushpak Bhattacharya, Jayprasad Hegde, Ritesh
M.Shah, and Sasikumar M. 2008. Simple Syntactic and Morphological
Processing Can Help English-Hindi Statistical Machine Translation. In IJCNLP
2008, Hyderabad, India. Rochester, NY, April
[157] Simon Zwarts and Mark Dras. 2007. Syntax-Based Word Reordering in Phrase-
Based Statistical Machine Translation: Why Does it Work? In Proceedings of
MT Summit XI, pages 559–566.
[160] Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma and Lakshmi Bai (2006),
“AnnCorra: Annotating Corpora Guidelines for POS and Chunk Annotation for
Indian Languages”, Language Technologies Research Centre IIIT, Hyderabad.
[161] http://www.au-kbc.org/research_areas/nlp/projects/postagger.html.
[162] http://www.infitt.org/ti2001/papers/vasur.pdf.
[163] http://shiva.iiit.ac.in/SPSAL2007/SPSAL-Proceedings.pdf.
[164] http://www.ldcil.org/up/conferences/pos%20tag/presentation.html.
[165] http://www.infitt.org/ti2009/papers/ganesan_m_final.pdf
[166] http://tdil.mit.gov.in/Tamil-AnnaUniversity-ChennaiJuly03.pdf
[167] Rajan K (2002), “Corpus Analysis And Tagging for Tamil”, Annamalai
University, Annamalai nagar.
[168] http://www.cs.waikato.ac.nz/~ml/weka/
306
[169] Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.)2004 ,
Morphology. A Handbook on Inflection and Word Formation, Berlin and New
York: Walter De Gruyter, 1893-1900.
[170] N. Ramaswami, 2001, Lexical Formatives and Word Formation Rules In Tamil.
Volume 1: 8 December 2001.
[172] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and
P. Roossin, A statistical approach to machine translation, In Journal of
Computational Linguistics, 16(2):79-85, 1990.
[173] Jurafsky, Daniel, and James H. Martin, Speech and Language Processing: An
Introduction to Natural Language Processing, Speech Recognition, and
Computational Linguistics, 2nd edition. Prentice-Hall, 2009.
[174] Philipp Koehn and Kevin Knight, Knowledge Sources for Word-Level
Translation Models, In Proceedings of EMNLP, 2001.
[175] Philipp Koehn, Franz Josef Och, and Daniel Marcu, “Statistical Phrase-Based
Translation, In Proceedings of HLT/NAACL, 2003.
[176] Kenji Yamada and Kevin Knight, A Syntax-based Statistical Translation Model,
In Proceedings of ACL 2001, pp.523-530, 2001.
307
[180] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, et.al,
Moses: Open source toolkit for statistical machine translation, In Proceedings of
ACL, Demonstration Session, 2007.
[184] Garrido Alicia, Amaia Iturraspe, Sandra Montserrat, et.al (1999). “A compiler
for morphological analysers and generators based on finite state transducers”.
Procesamiento del Lenguaje Natural, 25:93–98.
[185] Guido Minnen, John Carroll, and Darren Pearce. 2000. “Robust applied
morphological generation.” Proceedings of the First International Natural
Language Generation Conference, pages 201.208, 12.16 June.
308
[190] Jan Hajič, Barbora Vidová-Hladká: Tagging Inflective Languages: Prediction of
Morphological Categories for a Rich, Structured Tagset, In: Proceedings of the
Conference COLING - ACL `98. 1998
309
PUBLICATIONS
International Journals
310
8. Anand Kumar M, Dhanalakshmi V, Soman K.P, Factored Statistical Machine
Translation System for English to Tamil using Tamil Linguistic Tools, Journal
of Computer Science, Science publications. [Indexed by IET- ISI Thomson
Scientific Index, SCOPUS]. (Accepted for Publication).
International Conferences
311
7. Dhanalakshmi.V , Padmavathy P, Anand Kumar M, Soman K P and
Rajendran S (2009), “Chunker for Tamil using Machine Learning”, 7th
International Conference on Natural Language Processing 2009( ICON2009),
IIIT Hyderabad.
312
13. Anand Kumar M, Dhanalakshmi.V , Soman K P and Rajendran S (2011) ,
“Morphology based factored Statistical Machine Translation system from
English to Tamil”, INFITT 2011, Conference was held at the University of
Pennsylvania, Philadelphia, USA during June 17-19, 2011.
313