0% found this document useful (0 votes)
626 views338 pages

Tamil Langu PDF

This thesis proposes a morphology-based statistical machine translation system for translating from English to Tamil. The system uses part-of-speech tagging and morphological analysis/generation for preprocessing Tamil text. It employs a factored statistical machine translation model to incorporate linguistic information. The thesis aims to improve English to Tamil translation by leveraging morphological richness of Tamil language.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
626 views338 pages

Tamil Langu PDF

This thesis proposes a morphology-based statistical machine translation system for translating from English to Tamil. The system uses part-of-speech tagging and morphological analysis/generation for preprocessing Tamil text. It employs a factored statistical machine translation model to incorporate linguistic information. The thesis aims to improve English to Tamil translation by leveraging morphological richness of Tamil language.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 338

MORPHOLOGY BASED PROTOTYPE

STATISTICAL MACHINE TRANSLATION SYSTEM


FOR ENGLISH TO TAMIL LANGUAGE

A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the School of Engineering

by

ANAND KUMAR M

CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING


AND NETWORKING

AMRITA SCHOOL OF ENGINEERING


AMRITA VISHWA VIDYAPEETHAM
COIMBATORE-641 112, TAMILNADU, INDIA

April, 2013

 
 
AMRITA SCHOOL OF ENGINEERING
AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641 112

BONAFIDE CERTIFICATE

This is to certify that the thesis entitled “MORPHOLOGY BASED PROTOTYPE


STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO
TAMIL LANGUAGE” submitted by Mr. ANAND KUMAR M, Reg. No.
CB.EN.D*CEN08002 for the award of the Degree of Doctor of Philosophy in the
School of Engineering is a bonafide record of the work carried out by him under my
guidance and supervision at Amrita School of Engineering, Coimbatore.

Thesis Advisor

Dr. K.P.SOMAN
Professor and Head,
Center for Excellence in Computational Engineering and Networking.

 
 
AMRITA SCHOOL OF ENGINEERING
AMRITA VISHWA VIDYAPEETHAM, COIMBATORE 641 112

CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND


NETWORKING

DECLARATION

I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis
entitled “MORPHOLOGY BASED PROTOTYPE STATISTICAL
MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL
LANGUAGE” is the record of the original work done by me under the guidance of
Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational
Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best
of my knowledge this work has not formed the basis for the award of any
degree/diploma/associateship/fellowship or a similar award, to any candidate in any
University.

Place: Coimbatore Signature of the Student


Date:

COUNTERSIGNED

Thesis Advisor

Dr. K.P.SOMAN
Professor and Head
Center for Excellence in Computational Engineering and Networking  

 
 
TABLE OF CONTENTS

ACKNOWLEDGEMENT .............................................................................................. xii


LIST OF FIGURES ........................................................................................................ xv
LIST OF TABLES ....................................................................................................... xviii
ABBREVIATIONS ........................................................................................................ xxi
ABSTRACT .................................................................................................................. xxiv

1 INTRODUCTION ....................................................................................................... 1
1.1 GENERAL ....................................................................................................... 1
1.2 OVERVIEW OF MACHINE TRANSLATION ............................................. 2
1.3 ROLE OF MACHINE TRANSLATION IN NLP .......................................... 3
1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM ... 4
1.5 MOTIVATION OF THE THESIS .................................................................. 6
1.6 OBJECTIVE OF THE THESIS....................................................................... 7
1.7 RESEARCH METHODOLOGY .................................................................... 9
1.7.1 Overall System Architecture .............................................................. 9
1.7.2 Details of Preprocessing English Language Sentence ..................... 10
1.7.2.1 Reordering English Language Sentence ........................... 10
1.7.2.2 Factorization of English Language Sentence .................... 11
1.7.2.3 Compounding of English Language Sentence .................. 11
1.7.3 Details of Preprocessing Tamil Language Sentence ........................ 12
1.7.3.1 Tamil Part-of-Speech Tagger ............................................ 13
1.7.3.2 Tamil Morphological Analyzer ......................................... 13
1.7.4 Factored SMT System for English to Tamil Language.................... 14
1.7.5 Postprocessing for English to Tamil SMT ....................................... 15
1.7.5.1 Tamil Morphological Generator........................................ 15
1.8 RESEARCH CONTRIBUTIONS ................................................................. 16
1.9 ORGANISATION OF THE THESIS ............................................................ 17

2 LITERATURE SURVEY ......................................................................................... 19


2.1 PART OF SPEECH TAGGER ..................................................................... 19

iv 

 
2.1.1 Part Of Speech Tagger for Indian Languages .................................. 21
2.1.2 Part Of Speech Tagger for Tamil Language .................................... 23
2.2 MORPHOLOGICAL ANALYZER AND GENERATOR .......................... 25
2.2.1 Morphological Analyzer and Generator for Indian Languages ....... 26
2.2.2 Morphological Analyzer and Generator for Tamil Language .......... 26
2.3 MACHINE TRANSLATION SYSTEMS ..................................................... 30
2.3.1 Machine Translation Systems for Indian Languages ....................... 30
2.3.2 Machine Translation Systems for Tamil Language ......................... 35
2.4 ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM ................ 38
2.5 RELATED NLP WORKS IN TAMIL .......................................................... 43
2.6 SUMMARY ................................................................................................... 46

3 THEORITICAL BACKGROUND .......................................................................... 47


3.1 GENERAL .................................................................................................... 47
3.1.1 Tamil Language................................................................................ 47
3.1.2 Tamil Grammar ................................................................................ 48
3.1.3 Tamil Characters .............................................................................. 49
3.1.4 Morphological Richness of Tamil Language ................................... 50
3.1.5 Challenges in Tamil NLP ................................................................. 51
3.1.5.1 Ambiguity in Morpheme ................................................... 51
3.1.5.2 Ambiguity in Word Class .................................................. 52
3.1.5.3 Ambiguity in Word Sense ................................................. 52
3.1.5.4 Ambiguity in Sentence ...................................................... 53
3.2 MORPHOLOGY ........................................................................................... 53
3.2.1 Types of Morphology ....................................................................... 53
3.2.2 Lexemes ........................................................................................... 54
3.2.3 Lemma and Stems ............................................................................ 54
3.2.4 Inflections and Word forms .............................................................. 55
3.2.5 Morphemes and Types ..................................................................... 55
3.2.6 Allomorphs ....................................................................................... 56
3.2.7 Morpho-Phonemics .......................................................................... 56
3.2.8 Morphotactics ................................................................................... 57
3.3 MACHINE LEARNING FOR NLP .............................................................. 58

 
3.3.1 Machine Learning ............................................................................ 58
3.3.2 Support Vector Machines ................................................................. 59
3.3.3 Geometrical Interpretation of SVM ................................................. 61
3.3.4 SVM Formulation ............................................................................ 64
3.4 VARIOUS APPROACHES FOR POS TAGGING ...................................... 67
3.4.1 Supervised POS Tagging ................................................................. 67
3.4.2 Unsupervised POS Tagging ............................................................. 68
3.4.3 Rule based POS Tagging.................................................................. 68
3.4.4 Stochastic POS Tagging ................................................................... 69
3.4.5 Other Techniques ............................................................................. 69
3.5 VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER........ 70
3.5.1 Two level Morphological Analysis .................................................. 70
3.5.2 Unsupervised Morphological Analyser ............................................ 71
3.5.3 Memory based Morphological Analysis .......................................... 72
3.5.4 Stemmer based Approach................................................................. 72
3.5.5 Suffix Stripping based Approach ..................................................... 72
3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION ..................... 73
3.6.1 Linguistic or Rule based Approaches............................................... 73
3.6.1.1 Direct Approach ................................................................ 74
3.6.1.2 Interlingua Approach......................................................... 76
3.6.1.3 Transfer Approach............................................................. 77
3.6.2 Non Linguistic Approaches .............................................................. 79
3.6.2.1 Dictionary based Approach ............................................... 79
3.6.2.2 Empirical or Corpus based Approach ............................... 79
3.6.2.3 Example based Approach .................................................. 80
3.6.2.4 Statistical Approach .......................................................... 81
3.6.3 Hybrid Machine Translation System................................................ 82
3.7 EVALUATING STATISTICAL MACHINE TRANSLATION .................. 83
3.7.1 Human Evaluation Techniques ........................................................ 84
3.7.2 Automatic Evaluation Techniques ................................................... 85
3.7.2.1 BLEU Score ...................................................................... 85
3.7.2.2 NIST Metric ...................................................................... 86
3.7.2.3 Precision and Recall .......................................................... 87
vi 

 
3.7.2.4 Edit Distance Measures ..................................................... 88
3.8 SUMMARY ................................................................................................... 89

4 PREPROCESSING FOR ENGLISH LANGUAGE .............................................. 90


4.1 MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE .. 90
4.1.1 POS and Lemma and Information .................................................... 91
4.1.2 Syntactic Information ....................................................................... 92
4.1.3 Dependency Information .................................................................. 93
4.2 DETAILS OF PREPROCESSING ENGLISH SENTENCES ...................... 94
4.2.1 Reordering English Sentences .......................................................... 96
4.2.1.1 Syntactic Comparision between English and Tamil ......... 97
4.2.1.2 Reordering Methodology .................................................. 98
4.2.2 Factoring English Sentence ............................................................ 102
4.2.3 Compounding English Language Sentence.................................... 105
4.2.3.1 Morphological Comparision between English and Tamil106
4.2.3.2 Compounding Methodology for English Sentence ......... 109
4.2.4 Integrating Reordering and Compounding ..................................... 113
4.3 SUMMARY ................................................................................................. 115

5 PART OF SPEECH TAGGER FOR TAMIL LANGUAGE .............................. 117


5.1 GENERAL ................................................................................................... 117
5.1.1 Part of Speech Tagging ................................................................. 117
5.1.2 Tamil POS Tagging ........................................................................ 120
5.2 COMPLEXITY IN TAMIL POS TAGGING ............................................. 122
5.2.1 Root Ambiguity .............................................................................. 122
5.2.2 Noun Complexity ........................................................................... 122
5.2.3 Verb Complexity ............................................................................ 123
5.2.4 Adverb Complexity ........................................................................ 125
5.2.5 Postposition Complexity ................................................................ 126
5.3 PART OF SPEECH TAGSET DEVELOPMENT ...................................... 126
5.3.1 Available POS Tagsets for Tamil................................................... 127
5.3.2 AMRITA POS Tagset .................................................................... 128
5.4 DEVELOPMENT OF TAMIL POS CORPORA FOR PREPROCESSING129
vii 

 
5.4.1 Untagged and Tagged Corpus ....................................................... 130
5.4.2 Available Corpus for Tamil............................................................ 131
5.4.3 POS Tagged Corpus Development ................................................ 131
5.4.4 Applications of Tagged Corpus...................................................... 134
5.4.5 Details of POS Tagged corpus developed ...................................... 134
5.5 DEVELOPMENT OF POS TAGGER USING SVMTOOL....................... 136
5.5.1 SVMTool ....................................................................................... 136
5.5.2 Features of SVMTool ..................................................................... 137
5.5.3 Components of SVMTool .............................................................. 138
5.5.3.1 SVMTlearn ...................................................................... 138
5.5.3.2 SVMTagger ..................................................................... 146
5.5.3.3 SVMTeval ....................................................................... 151
5.6 RESULTS AND COMPARISON WITH OTHER TOOLS........................ 160
5.7 ERROR ANALYSIS ................................................................................... 161
5.8 SUMMARY ................................................................................................. 162

6 MORPHOLOGICAL ANALYZER FOR TAMIL .............................................. 163


6.1 GENERAL ................................................................................................... 163
6.1.1 Morphology in Language ............................................................... 163
6.1.2 Computational Morphology ........................................................... 163
6.1.3 Morphological Analyzer ................................................................ 164
6.1.4 Role of Morphological Analyzer in NLP ....................................... 165
6.2 TAMIL MORPHOLOGY............................................................................ 166
6.2.1 Tamil Morphology and Language .................................................. 166
6.2.2 Syntax of Tamil Morphology ......................................................... 167
6.2.3 Word Formation Rules(WFR) in Tamil ......................................... 168
6.2.4 Tamil Verb Morphology ................................................................ 171
6.2.5 Tamil Noun Morphology ............................................................... 172
6.2.6 Tamil Morphological Analyzer ...................................................... 175
6.2.7 Challenges in Tamil Morphological Analzer ................................. 175
6.3 TAMIL MORPHOLOGICAL ANALYZER SYSTEM .............................. 176
6.4 TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS 177
6.4.1 Morphological Analyzer using Machine Learning ........................ 177
viii 

 
6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer .... 179
6.4.2.1 Paradigm Classification................................................... 179
6.4.2.2 Word forms ..................................................................... 180
6.4.2.3 Morphemes ...................................................................... 183
6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer.. 186
6.4.2.5 Issues in Data Creation .................................................... 188
6.4.3 Morphological Tagging Framework using SVMTool ................... 189
6.4.3.1 Support Vector Machine (SVM) ..................................... 189
6.4.3.2 SVMTool ......................................................................... 189
6.4.3.3 Implementation of Morphological Analyzer System ...... 190
6.5 MORPH ANALYZER FOR PRONOUN USING PATTERNS ................. 192
6.6 MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES........... 194
6.7 RESULTS AND EVALUATION ............................................................... 195
6.8 PREPROCESSED ENGLISH AND TAMIL SENTENCE......................... 198
6.9 SUMMARY ................................................................................................. 198

7 FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL ............................... 200


7.1 STATISTICAL MACHINE TRANSLATION ........................................... 200
7.2 COMPONENTS OF SMT ........................................................................... 201
7.2.1 Translation Model .......................................................................... 202
7.2.1.1 Expectation Maximization .............................................. 202
7.2.1.2 Word based Translation Model ....................................... 203
7.2.1.3 Phrase based Translation Model ..................................... 204
7.2.2 Language Model ............................................................................. 206
7.2.2.1 N-gram Language Models ............................................... 208
7.2.3 Statistical Machine Translation Decoder ....................................... 210
7.3 INTEGRATING LINGUISTIC INFORMATION IN SMT ...................... 210
7.3.1 Factored Translation Models .......................................................... 210
7.3.1.1 Decomposition of Factored Translation .......................... 212
7.3.2 Syntax based Translation Models .................................................. 212
7.4 TOOLS USED IN SMT SYSTEM .............................................................. 213
7.4.1 MOSES .......................................................................................... 213
7.4.2 GIZA++ & MKCLS ...................................................................... 214
ix 

 
7.4.3 SRILM ............................................................................................ 214
7.5 DEVELOPMENT OF FACTORED CORPORA ........................................ 215
7.5.1 Parallel Corpora Collection ............................................................ 215
7.5.2 Monolingual Corpora Collection .................................................. 216
7.5.3 Automatic Creation of Factored Corpora ....................................... 216
7.6 FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE ................. 217
7.6.1 Building Language Model .............................................................. 218
7.6.2 Building Phrase based Translation Model...................................... 219
7.7 SUMMARY ................................................................................................. 221

8 POSTPROCESSING FOR ENGLISH TO TAMIL SMT ................................... 222


8.1 GENERAL ................................................................................................... 222
8.2 MORPHOLOGICAL GENERATOR.......................................................... 223
8.2.1 Challenges in Tamil Morphological Generator .............................. 223
8.2.2 Simplified Part-of-Speech Catagories ............................................ 225
8.3 MORPHOLOGICAL GENERATOR FOR TAMIL NOUN AND VERB . 226
8.3.1 Algorithm for Noun and Verb Morphological Generator .............. 227
8.3.2 Word-forms Handled in Morphological Generator ........................ 229
8.3.3 Data Required for the Algorithm ................................................... 230
8.3.3.1 Morpho Lexical Information File .................................... 230
8.3.3.2 Paradigm Classification Rules ........................................ 232
8.3.3.3 Suffix Table ..................................................................... 234
8.3.3.4 Stemming Rules .............................................................. 235
8.4 MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS............. 236
8.5 SUMMARY ................................................................................................. 238

9 EXPERIMENTS AND RESULTS ......................................................................... 240


9.1 GENERAL ................................................................................................... 240
9.2 EXPERIMENTAL SETUP AND RESULTS.............................................. 240
9.3 SUMMARY ................................................................................................. 245

10 CONCLUSION AND FUTUREWORK................................................................ 246


10.1 SUMMARY ................................................................................................. 247

 
10.2 SUMMARY OF WORK DONE ................................................................. 247
10.3 CONCLUSIONS ......................................................................................... 249
10.4 FUTURE DIRECTIONS ............................................................................. 250

APPENDIX-A ................................................................................................................ 252


A.1 TAMIL TRANSLITERATION ................................................................ 252
A.2 DETAILS OF AMRITA POS TAGS ....................................................... 256
APPENDIX-B ................................................................................................................ 264
B.1 PENN TREE BANK POS TAGS ............................................................. 264
B.2 DEPENDENCY TAGS .............................................................................. 265
B.3 TAMIL VERB MLI ................................................................................... 266
B.4 TAMIL NOUN WORD FORM ................................................................ 272
B.5 TAMIL VERB WORD FORM................................................................. 275
B.6 MOSES INSTALLATION AND TRAINING......................................... 280
B.7 COMPARISION WITH GOOGLE OUTPUT ....................................... 285
B.8 GRAPHICAL USER INTERFACES....................................................... 286
REFERENCES ............................................................................................................. 290
AUTHOR’S PUBLICATIONS ................................................................................... 310

xi 

 
ACKNOWLEDGEMENT

I would never have been able to finish my dissertation without the guidance, support
and encouragement of numerous people including my mentors, my friends, colleagues
and support from my family and wife. At the end of my thesis I would like to thank all
those people who made this thesis possible and an unforgettable experience for me.

First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata
Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my
doctoral studies, both in unseen and unconcealed ways.

Wholeheartedly, I thank our respected Pro Chancellor, Swami Abhayamrita


Chaitanya, by providing the necessary environment, infrastructure and encouragement
for my research in Amrita Vishwa Vidyapeetham University. I thank Dr. P. Venkat
Rangan, our respected Vice Chancellor, for his full hearted encouragements and
supports throughout my doctoral studies.

I would like to express my sincere gratitude to my supervisor, Dr. K.P Soman,


Professor and Head, Centre for Excellence in Computational Engineering and
Networking (CEN), for his excellent guidance, patience, and providing an excellent
atmosphere for doing research. His wide knowledge and logical way of thinking have
been of great source of inspiration for me. I am really so happy and proud to say that I
am a student of Dr.K.P.Soman. He has always extended his helping hands in solving
research problems. The in-depth discussions, scholarly supervision and constructive
suggestions received from him have broadened my knowledge. I strongly believe that
without his guidance, the present work could have not reached this stage.

I wish to thank my doctoral committee members Dr.C.S Shunmuga


Velayutham and Dr.V.P.Mohandass, for their encouraging words and support
throughout this research.

I express my heartfelt gratitude to Dr.N.S.Pandian, Dean, PG Programmes,


Amrita Vishwa Vidyapeetham, and Coimbatore, for the continuous support of my Ph.D
study and research.

xii 

 
I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from
the very early stage of this research as well as giving me extraordinary experiences
through-out the work.

I express my deepest gratitude to Mrs.V.Dhanalakshmi, Head of the


Department-Tamil, SRM University, Chennai. Whatever knowledge I have gained in
linguistic is definitely because of her.

I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for
supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a
good friend from my graduate is always willing to help and give his best suggestions.

I express my sincere gratitude to my beloved Director, Dr.K.A.Chinnaraju,


and Principal, Dr N.Nagarajan, CIET for giving me all the moral support to complete
the thesis successfully. I would like to express my gratitude to my Head of the Department
Dr.S.Gunasekaran, who is always inspiring me to complete this thesis work. I would also like
to thank Mr.G.Ravi Kumar and Prof. Mrs.Janaki Kumar for their timely support and
suggestions. I would like to thank my colleagues at the department of Computer science and
engineering, especially Mr. N.Ramkumar, Mr.N.Boopal, Mr.A.Suresh, Mr.M.Yogesh,
Mr.C.Prabu, and Mr.B .Saravanan for sharing their enthusiasm and for supporting me from
the beginning of my career at CIET.

I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD
(MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan,
SAMSUNG Electronics, for their kind support and direction which have been of great
value in this study.

My sincere thanks also goes to Mr.Sivaprathap, Mr.Rakesh Peter,


Mr.Loganathan and Mr.Antony P J, Mr.Ajit, Mr Saravanan, Mr.Kathir, Mr.
Senthil, Mr.V Anand Kumar, Mrs. Latha Menon, and Sampath Kumar CEN
department for supporting me in all the ways. I also express my sense of gratitude to
my friends Ms.Resmi N.G and Ms.Preeja for their encouragement and guidance. My
research would not have been possible without the help of my friends C.Murugesan,
S.Ramakrishnan, S.Mohanraj and A.Baladhandapani, I like to thank them for being
with me in all circumstances.

xiii 

 
I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun
Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research.

I would like to thank to my Grandpa Mr.M.Narayanasamy and Mr. A.Peter


who left us too soon. I hope that this work will make them proud.

I would like to thank my uncle Mr.P.M.Palraj and aunt Mrs.P.Rajeswari for


their encouragement and motivation during my difficult moments during the long years
of my education. I would also like to express deepest gratitude to my Grandma
Mrs.N.Valliyammal and my uncles Mr.N.Natesapandiyan and Mr.N.Pandiyan for
supporting me from my school days.

I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their
kind support, the confidence and the love they have shown to me. You have been my
greatest strength and I am blessed to be your son. I would also like to give a special
thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways.

I wish to thank my sister Mrs.S.Arthi and her husband Mr.K.Suresh, for


supporting me in all the ways. I would like to thank my father-in-law Mr.P.Velusamy,
and mother-in-law Mrs.V. Ponnuthai, without their encouragement and moral support
it would have been impossible for me to finish this work.

Finally, I would like to give a special thank to my wife Mrs.Sharmiladevi V.


She is always there for cheering me up at difficult times with great patience. Without
her love and support it would have been impossible for me to finish this work.

-ANAND KUMAR M

xiv 

 
LIST OF FIGURES

Figure 1.1 Morphology based Factored SMT for English to Tamil Language ................... 10

Figure 1.2 Reordering of English Language ........................................................................ 11

Figure 1.3 Mapping English Word Factors to Tamil Word Factors .................................... 14

Figure 1.4 Thesis Organizations........................................................................................... 17

Figure 3.1 Maximum Margin and Support Vectors ............................................................. 62

Figure 3.2 Training Errors in Support Vector Machine....................................................... 63

Figure 3.3 Non-linear Classifier ........................................................................................... 64

Figure 3.4 Classification of POS Tagging Models .............................................................. 67

Figure 3.5 Two Level Morphology ...................................................................................... 71

Figure 3.6 Block Diagram of Direct Approach to Machine Translation ............................. 75

Figure.3.7 The Vauquios Triangle ....................................................................................... 77

Figure 3.8 Block Diagram of Transfer Approach ................................................................ 78

Figure 3.9 Block Diagram of EBMT System ...................................................................... 80

Figure 3.10 Block Diagram of SMT System ......................................................................... 81

Figure 3.11 Rule based Translation System with Post-processing ........................................ 83

Figure 3.12 Statistical Machine Translation System with Pre-processing ............................ 83

Figure 4.1 Example of English Syntactic Tree .................................................................... 92

Figure 4.2 Preprocessing Stages of English Sentence ......................................................... 95

Figure 4.3 Process of Reordering ......................................................................................... 99

Figure 4.4 English Syntactic Tree ...................................................................................... 101

Figure 4.5 English to Tamil Alignment ............................................................................. 110

Figure 4.6 Block Diagram for Compounding ................................................................... 111

Figure 4.7 Integration Process ........................................................................................... 114

Figure 5.1 Example of Untagged Corpus........................................................................... 130

xv 

 
Figure 5.2 Example of Tagged Corpus .............................................................................. 130

Figure 5.3 Untagged Corpus before Pre-editing ................................................................ 132

Figure 5.4 Untagged Corpus after Pre-editing ................................................................... 133

Figure 5.5 Training Data Format........................................................................................ 139

Figure 5.6 Implementation of SVMTlearn ........................................................................ 143

Figure 5.7 Example Input ................................................................................................... 149

Figure 5.8 Example Output ................................................................................................ 149

Figure 5.9 Implementation of SVMTagger........................................................................ 150

Figure 5.10 Implementation of SVMTeval .......................................................................... 152

Figure 6.1 Role of Morphological Analyzer in NLP ......................................................... 166

Figure 6.3 General Framework for Morphological Analyzer System ............................... 176

Figure 6.4 Preprocessing Steps .......................................................................................... 187

Figure 6.5 Implementation of Noun/Verb Morph Analyzer .............................................. 191

Figure 6.6 Structure of Pronoun Word form ...................................................................... 192

Figure 6.7 Implementation of Pronoun Morph Analyzer ................................................. 193

Figure 6.8 Implementation of Proper Noun Morph Analyzer .......................................... 195

Figure 6.9 Training Data Vs Accuracy .............................................................................. 196

Figure 7.1 The Noisy Channel Model to Machine Translation ......................................... 201

Figure 7.2 Block Diagram for Factored Translation .......................................................... 211

Figure 7.3 Mapping English Factors to Tamil Factors ...................................................... 280

Figure 8.1 Tamil Sentence Generation............................................................................... 225

Figure 8.2 Algorithm for Morphological Generator .......................................................... 227

Figure 8.3 Architecture of Tamil Morphological Generator ............................................. 228

Figure 8.4 Pseudo Code for Paradigm Classification ........................................................ 233

Figure 8.5 Structure of Pronoun Word form ...................................................................... 237

Figure 8.6 Pronoun Morphological Generator ................................................................... 238

 
xvi 

 
Figure 9.1 BLEU-1 Score for Various Models .................................................................. 243

Figure 9.2 BLEU-4 Score for Various Models .................................................................. 244

Figure 9.3 NIST Score for Various Models ....................................................................... 244

Figure 9.4 Google Translation System............................................................................... 245

xvii 

 
LIST OF TABLES

Table 1.1 Factored English Sentences ................................................................................ 12

Table 1.2 Compounded English Sentences ........................................................................ 12

Table 3.1 Tamil Grammar................................................................................................... 48

Table 3.2 Tamil Vowels ...................................................................................................... 49

Table 3.3 Tamil Compound Letters .................................................................................... 50

Table 3.4 Ambiguity in Morpheme’s Position ................................................................... 52

Table 3.5 An Example to Illustrate the Direct Approach ................................................... 75

Table 3.6 An Example for Interlingua Representation ....................................................... 76

Table 3.7 An Example for Transfer Approach ................................................................... 79

Table 3.8 Example of English and Tamil Sentences ......................................................... 81

Table 3.9 Scales of Evaluation........................................................................................... 85

Table 4.1 POS and Lemma of Words ................................................................................. 91

Table 4.2 Reordering Rules .............................................................................................. 100

Table 4.3 Original and Reordered Sentences ................................................................... 102

Table 4.4 Description of Factors in English Word ........................................................... 103

Table 4.5 Example of English Word Factors.................................................................... 104

Table 4.6 Factored Representation of English Language Sentence ................................. 104

Table 4.7 Word forms of English ..................................................................................... 106

Table 4.8 Content Words of English ................................................................................ 107

Table 4.9 Function Words of English ............................................................................... 107

Table 4.10 English Word Forms based on Tenses ............................................................. 108

Table 4.11 Tamil Word Forms based on Tenses ............................................................... 109

Table 4.12 Compounding Rules for English Sentence ...................................................... 112

Table 4.13 Average Words per Sentence............................................................................ 113

Table 4.14 Factored English Sentence ................................................................................ 113


xviii 

 
Table 4.15 Compounded English Sentence ........................................................................ 113

Table 4.16 Preprocessed English Sentences ....................................................................... 115

Table 5.1 AMRITA POS Tagset....................................................................................... 129

Table 5.2 Tag Count.......................................................................................................... 134

Table 5.3 Corpus Statistics................................................................................................ 135

Table 5.4 Example of Suitable POS Features for Model 0 .............................................. 141

Table 5.5 Example of Suitable POS Features for Model 1 .............................................. 141

Table 5.6 Example of Suitable POS Features for Model 2 .............................................. 142

Table 5.7 Comparison of Accuracies................................................................................ 161

Table 5.8 Trials and Error ................................................................................................ 162

Table 5.9 Confusion Matrix .............................................................................................. 162

Table 6.1 Compound Word-forms Formation .................................................................. 171

Table 6.2 Simple Verb Finite Forms ................................................................................ 172

Table 6.3 Noun Case Markers .......................................................................................... 173

Table 6.4 Minimized POS Tagset ..................................................................................... 177

Table 6.5 Number of Paradigms and Inflections ............................................................. 180

Table 6.6 Noun Paradigms ................................................................................................ 180

Table 6.7 Verb Paradigms................................................................................................. 181

Table 6.8 Noun Word Forms ............................................................................................ 181

Table 6.9 Verb Word Forms ............................................................................................. 182

Table 6.10 Noun Morphemes ............................................................................................. 183

Table 6.11 Verb Morphemes .............................................................................................. 184

Table 6.12 Verb/Noun Ambiguous Morphemes ............................................................... 185

Table 6.13 Sample Data Format ......................................................................................... 187

Table 6.14 Example of Proper Noun Inflections ................................................................ 195

Table 6.15 Tagged Vs Untagged Accuracies ..................................................................... 196

Table 6.16 Number of Words and Characters and Level of Efficiencies .......................... 197
xix 

 
Table 6.17 Sentence Level Accuracies ............................................................................... 198

Table 6.18 Preprocessed English and Tamil Sentence ....................................................... 198

Table 7.1 Factored Parallel Sentences .............................................................................. 217

Table 8.1 Morpho-phonemic Changes ............................................................................. 224

Table 8.2 Simplified POS Tagset ..................................................................................... 225

Table 8.3 Verb and Noun Word Forms ............................................................................ 229

Table 8.4 MLI for Tamil Verb .......................................................................................... 231

Table 8.5 Look up Table for Paradigm Classification...................................................... 233

Table 8.6 Paradigms and inflections ................................................................................. 234

Table 8.7 Suffix Table ...................................................................................................... 235

Table 8.8 Stemming End Characters ................................................................................ 236

Table 9.1 Details of Baseline Parallel Corpora ................................................................ 241

Table 9.2 Details of Factored Parallel Corpora ................................................................ 241

Table 9.3 BLEU and NIST Scores.................................................................................... 243

Table 10.1 Mapping of Major Research Outcome to Publications .................................... 248

xx 

 
LIST OF ABBREVIATIONS

ABBREVIATIONS FULL FORM


1PL First person Plural
1S First person Singular
2PE Second person Plural Epicene
2S Second person Singular
2SE Second person Singular Epicene
3PE Third person Plural Singular
3PN Third person Plural Neutral
3SE Third person Singular Epicene
3SF Third person Singular Feminine
3SM Third person Singular Masculine
3SN Third person Singular Neutral
ACC Accusative
AI Artificial Intelligence
AU-KBC Anna University K B Chandrasekhar
BL Base line
BLEU Bi-Lingual Understudy
CALTS Centre for Applied Linguistics and
Translation Studies
CIIL Central Institute of Indian Languages
CLIR Cross lingual information retrieval
CRF Conditional Random Fields
CWF Compressed Word Format
EBMT Example based Machine Translation
EM Expectation Maximization
EOS End of Sentences
FSA Finite State Automata
FSM Finite State Machine
FSMT Factored Statistical Machine Translation
FST Finite State Transducer
xxi 

 
HMM Hidden Markov Model
IBM International Business Machine
IE Information Extraction
International Institute of Information
IIIT
Technology
IR Information Retrieval
KWIC Key word in context
LDC Language data Consortium
LSV Letter Successor Varieties
ManTra MAchiNe assisted TRAnslation
MBMA Memory based Morphological Analysis
MEMM Maximum Entropy Markov Models
MG Morphological Generator
MIRA Margin Infused Relaxed Algorithm
ML Machine Learning
MLI Morpho-Lexical Information
MT Machine Translation
NIST National Institute of Standards and
Technology
NLI Natural Language Interface
NLP Natural Language Processing
NLU Natural Language Understanding
PBSMT Phrase based Statistical Machine Translation
PCFG Probalistic Context Free Grammar
PER Position Independent Word Error Rate
PLIL Pseudo Lingual for Indian Languages
PN Proper Noun
PNG Person-Number-Gender
POS Part-of-Speech
POST Part-of-Speech Tagging
QA Question Answering
RBMT Rule based Machine Translation

xxii 

 
RCILTS Resource Centre for Indian Language
Technology Solutions
SMR Statistical Machine Reordering
SMT Statistical Machine Translation
SOV Subject-Object-Verb
Stanford Research Institute for Language
SRILM
Modeling
SVM Support Vector Machine
SVO Subject-Verb-Object
TBL Transformation based learning
TDIL Technology Development for Indian
Languages
TER Translation Edit Rate
TnT Trigrams n Tagger
UCSG Universal Clause Structure Grammar
UN United Nations
VG Verb Group
WER Word Error Rate
WFR Word Formation Rules
WSJ Wall Street Journal
WWW World Wide Web

xxiii 

 
ABSTRACT

Machine translation is about automatic translation of one natural language text


to another using computer. In this thesis, morphology based Factored Statistical
Machine Translation system (F-SMT) is proposed for translating sentence from English
to Tamil. Tamil linguistic tools such as Part-of-Speech Tagger, Morphological
Analyzer and Morphological Generator are also developed as a part of this research
work. Conventionally, rule-based approaches are employed for developing Machine
Translation. It uses transfer-rules between the source language and the target language
for producing grammatical translations. The major drawback of this approach is that it
always requires the help of a good linguist for the rule improvement. So, recently data-
driven approaches such as example-based and statistical based systems are getting more
attention from research community. Currently, Statistical Machine Translation (SMT)
systems are playing a major role in developing translation between languages. The
main advantage of using Statistical Machine Translation system is that it is language
independent and it disambiguates the sense automatically with the use of large
quantities of parallel corpora. SMT system considers the translation problem as a
machine learning problem.

Statistical learning methods perform translation based on large amounts of


parallel training data. At first, non-structural information and statistical parameters are
derived from the bi-lingual corpora. These statistical parameters are then used for
translation. Baseline Statistical Machine Translation system considers only surface
forms and does not use linguistic knowledge of the languages. Therefore its
performance is better for similar language pair when compared to the dissimilar
language pair. Translating English into morphologically rich languages is a challenging
task. Because of the highly rich morphological nature of Tamil language, a simple
lexical mapping alone does not help for retrieving and mapping all the morphological
and syntactic information from the English language sentences.

Tamil word forms are productive, that is, word forms are written without
spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the
problem of sparse data. It is very difficult to collect or create a parallel corpus which
contains all the possible Tamil surface words. Because, a single Tamil root verb is
xxiv 

 
inflected into more than ten thousand different forms. Moreover, selecting a correct
Tamil word or phrase during translation is a challenging job. The corpus size and
quality decides the accuracy of the Machine Translation system. The limited
availability of parallel corpora for English-Tamil language and high inflectional
variation increases the data sparseness problem for baseline phrase-based SMT system.
While translating from English to Tamil language, the SMT baseline system will not
generate the Tamil word forms that are not present in the training corpora.

The proposed Machine Translation system is based on factored Statistical


Machine Translation models. The words are factored into lemma and inflected forms
based on their part of speech. This factorization reduces the data sparseness in
decoding. Factored translation models allow the integration of the linguistic
information into a phrase-based translation model. These linguistic features are treated
as separate tokens during the factored training process. Baseline SMT system uses
untagged corpora for training, whereas factored SMT uses linguistically factored
corpora. Pre-processing phase allows including language specific knowledge into the
parallel corpus indirectly. In preprocessing, bi-lingual corpora are converted into
factored bi-lingual corpora using linguistic tools and reordering rules. Similarly, Tamil
language sentences are also pre-processed using the proposed linguistic tools like POS
tagger and Morphological analyzer. These factored corpora are then given to the
Statistical Machine Translation models for training. Finally, Tamil morphological
generator is used for generating a surface word from output factors. 

xxv 

 
CHAPTER 1
INTRODUCTION
1.1 GENERAL
Machine Translation is an automatic translation of one natural language text to another
using computer. Initial attempts for Machine Translation made in 1950’s didn’t meet
with success. Now internet users need a fast automatic translation system between
languages. Several approaches like Linguistic based and Interlingua based systems are
used to develop a machine translation system. But currently, statistical methods
dominate the machine translation field. Statistical Machine Translation (SMT)
approach draws knowledge from automata theory, artificial intelligence, data structure
and statistics. SMT system treats translation as a machine learning problem. This means
that a learning algorithm is applied to a large amount of parallel corpora. Parallel
corpora are sentences in one language along with its translation. Learning algorithms
create a model from parallel sentences and using this model, unseen sentences are
translated. If parallel corpora are available for a language pair then it is easy to build a
bilingual SMT system. The accuracy of the system is highly dependent on the quality
and quantity of the parallel corpus and the domain. These parallel corpora are
constantly growing. Parallel corpora are the fundamental resource for SMT system.
Parallel corpora are available from government’s bi-lingual text books, news papers,
websites and novels.

SMT models are giving good accuracy for language pairs, particularly for similar
languages in specific domains or languages that have large availability of bi-lingual
corpora. If a sentence in language pair is not structurally similar, then the translation
patterns are difficult to learn. Huge amounts of parallel corpora are required for
learning the pattern, therefore statistical methods are difficult to use in “less
resourced” languages. To enhance the translation performance of dissimilar language
pairs and less resourced languages, an external preprocessing is required. This
preprocessing is performed using linguistic tools.

In SMT system, statistical methods are used for mapping of source language
phrases into target language phrases. Statistical model parameters are estimated from
bi-lingual and mono-lingual corpora. There are two models in the SMT system. They

1
 
are Translation model and Language model. The translation model takes parallel
sentences and finds the translation hypothesis between the phrases. Language model is
based on the statistical properties of n-grams. It uses the monolingual corpora.

Several translation models are available in SMT system. Some important models
are phrase based model, syntax based model and factored model. Phrase Based
Statistical Machine Translation (PBSMT) is limited to the mapping of small text
chunks. Factored translation model is an extension of phrase based models. It integrates
linguistic information at the word level. This thesis proposes a pre-processing method
that uses linguistic tools to the development of English to Tamil machine translation
system. In this translation system, external linguistic tools are used to augment the
linguistic information into the parallel corpora. The pre and post processing
methodology proposed in this thesis are applicable to other language pairs too.

1.2 OVERVIEW OF MACHINE TRANSLATION

Machine translation is one of the major oldest and the most active area in natural
language processing. The word ‘translation’ refers to transformation of text or speech
from one language into other. Machine translation can be defined as, the application of
computers to the task of translating texts from one natural language to another. It is a
focussed field of research in linguistic concepts of syntax, semantics, pragmatics and
discourse.

Today a number of systems are available for producing translations, though they
are not perfect. In the process of translation, which is either carried out manually or
automated through machines, the context of the text in the source language when
translated must convey the exact context in the target language. Translation is not just
word level replacement. A translator, either a machine or human, must interpret and
analyse all the elements in the text. Also human/machine should be familiar with all the
issues during the translation process and must know how to handle it. This requires in-
depth knowledge in grammar, sentence structure, meanings, etc and also an
understanding in each language’s culture in order to handle idioms and phrases
originated from different culture. The cross culture understanding is an important issue
that holds the accuracy of the translation.

2
 
It will be a great challenge for humans to design automatic machine translation
system. It is difficult for translating sentences by taking into consideration all the
required information. Humans need several revisions to make the perfect translation.
No two individual human translators can generate identical translations of the same text
in the same language pair. Hence it will be a greater challenge for humans to design a
fully automated machine translation system to produce high quality translations.

1.3 ROLE OF MACHINE TRANSLATION IN NLP


Natural Language Processing (NLP) is the field of computer science devoted to the
development of models and technologies enabling computers to use human languages
both as input and output [1]. The ultimate goal of NLP is to build computational models
that equal human performance in the task of reading, writing, learning, speaking and
understanding. Computational models are useful to explore the nature of linguistic
communication as well as for enabling effective human-machine interaction. Jurafsky
and Martin (2005) [2] describe Natural Language Processing as “computational
techniques that process spoken and written human language as language”. According to
the Microsoft researchers, the goal of the Natural Language Processing (NLP) is “to
design and build software that will analyze, understand and generate languages that
humans use naturally, so that eventually one will be able to address their computer like
addressing another person”.

Machine Translation is used for translating texts for assimilation purpose which
aids bilingual or cross-lingual communication and also for searching, accessing and
understanding foreign language information from databases and web-pages [3]. In the
field of information retrieval a lot of research is going on in Cross-Language
Information Retrieval (CLIR), i.e. information retrieval systems capable of searching
databases in many different languages [4].

Construction of robust systems for speech-to-speech translation to facilitate “cross-


lingual” oral communication has been the dream of speech and natural language
researchers for decades. Machine translation is an important module in speech
translation systems. Currently, computer assisted learning plays a major role in
academic environment. The use of Machine Translation in language learning has not
yet got enough attention because of poor quality of automatic translation output. Using

3
 
good automatic translation system, students can improve their translation and writing
skills. Such system can break the language barriers of students and language learners.

1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION


SYSTEM

Traditionally, rule based approaches are used to develop a machine translation system.
Rule based approach feeds the rules into machine using appropriate representations.
Feeding all linguistic knowledge into a machine would be very hard. In this context, the
statistical approach to Machine Translation has some attractive qualities that made it
the preferred approach in machine translation research over the past two decades.
Statistical translation models learn translation patterns directly from data, and
generalize them to translate a new text. The SMT approach is largely language-
independent, i.e. the models can be applied to any language pair.

System based on statistical methods is much better than the traditional rule-based
systems. In SMT, implementation and development times are much shorter. SMT can
improve by coupling new models for reordering and decoding. It only needs to learn
parallel corpora for generating a translation system. In contrast, rule based system
needs transfer rules which only linguistic experts can generate. These rules are entirely
dependent on language pair involved and defining general “transfer-rules” is not an
easy task, especially for languages with different structures [5].

SMT system can be developed rapidly if the appropriate corpus is available. A Rule
Based Machine Translation (RBMT) system requires a lot of development and
customization costs until it reaches the desired quality threshold. Packaged RBMT
systems have been already developed and it is extremely difficult to reprogram models
and equivalences. Above all, RBMT has a much longer process involving more human
resources. RBMT system is retrained by adding new rules and vocabulary among other
things [5].

Statistical Machine Translation works well for translations in a specific domain


with the engine trained with bilingual corpus in that domain. A SMT system requires
more computing resources in terms of hardware to train the models. Billions of
calculations need to take place during the training of the engine and the computing
knowledge required for it is highly specialized. However, training time can be reduced

4
 
nowadays thanks to the wider availability of more powerful computers. RBMT requires
a longer deployment and compilation time by experts so that, in principle, building
costs are also higher. SMT generates statistical patterns automatically, including a good
learning of exceptions to rules. As regards to the rules governing the transfer of RBMT
systems, certainly they can be seen as special cases of statistical standards.
Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT
systems can be upgraded with syntactic information and even semantics, like the
RBMT. A SMT engine can generate improved translations if retrained or adapted
again. In contrast, the RBMT generates very similar translations after retraining [5].

SMT systems, in general, have trouble in handling the morphology on the source or
the target side especially for morphologically rich languages. Errors in morphology can
have severe consequences on meaning of the sentence. They change the grammatical
function of words or the interpretation of the sentence through the wrong verb tense.
Factored translation models try to solve this issue by explicitly handling morphology on
the generation side.

Another advantage of Statistical Machine Translation system is that, it generates a


more natural or closer to the literal translation of the input sentence. Symbolic
approaches to machine translation take great human effort in language engineering. In
knowledge based machine translation, for example, designers must first find out what
kinds of linguistic, general common-sense and domain-specific knowledge is important
for a task. Then they have to design an Interlingua representation for the knowledge
and write grammars to parse input sentences. Output sentences are generated using the
Interlingua representation. All of these require expertise in language technologies and it
requires tedious and laborious work.

The major advantage of Statistical Machine Translation system is its learnability.


As long as a model is set up, it can learn automatically with well-studied algorithms for
parameter estimation. Therefore parallel corpus replaces the human expertise for the
task. The coverage of grammar is also one of the serious problems in rule based system.
Statistical Machine Translation system is a good candidate that meets these criteria. It
can learn to have a good coverage as long as the training data is representative enough.
It can statistically model the noise in spoken language, so it does not have to make a
binary keep/abandon decision and is therefore more robust to noisy data [5].

5
 
1.5 MOTIVATION OF THE THESIS
Machine translation (MT) is the application of computers to the task of translating texts
from one natural language to another. Even though machine translation was envisioned
as a computer application in the 1950’s, machine translation is still considered to be an
open problem [3].

The demand for machine translation is growing rapidly. As multilingualism is


considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a
project to build machine translation system for all European language pairs, to
automatically translate the documents to its 23 official languages, which were being
translated manually. Also as the United Nations (UN) is translating a large number of
documents into several languages, the UN has created bilingual corpora for some
language pairs like Chinese–English, Arabic–English which are among the largest
bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the
World Wide Web, as around 20% of web pages and other resources are available in
their national languages. Machine Translation can be used to translate these web pages
and resources to the required language in order to understand the content in those pages
and resources, thereby decreasing the effect of language as a barrier of communication
[7].

In a linguistically diverse country like India, machine translation is a very essential


technology. Human translation is widely prevalent in India since ancient times which
are evident from the various works of philosophy, arts, mythology, religion and science
which have been translated among ancient to modern Indian languages. Also, numerous
classic works of art, ancient, medieval and modern, have also been translated between
European and Indian languages since the 18th century. As of now, human translation in
India finds application mainly in the administration, media and education and to a
lesser extent in business, arts and science and technology [8].

India has 18 constitutional languages, which are written in 10 different scripts.


Hindi is the official language of the India. English is the language which is most widely
used in the media, commerce, science and technology and education. Many of the states
have their own regional language, which is either Hindi or one of the other
constitutional languages.

6
 
In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are translation of news from English into local languages,
translation of annual reports of government departments and public sector units among
English, Hindi and the local language. Many resources such as news, weather reports,
books, etc., in English are being manually translated to Indian languages. Of these,
News and weather reports from all around the world are translated from English to
Indian languages by human translators more often. Human translation is slow and also
consumes more time and cost compared to machine translation. It is clear from this that
there is large market available for machine translation rather than human translation
from English into Indian languages. The reason for choosing automatic machine
translation rather than human translation is that machine translation is faster and
cheaper than human translation.

Tamil, a Dravidian language, is spoken by around 72 million people and has the
official status in the state of Tamilnadu and Indian union territory of Puducherry.
Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken
by significant minorities in Malaysia and Mauritius as well as emigrant communities
around the world. It is one of the 22 scheduled languages of India and declared a
classical language by the government of India in 2004 [9].

In this thesis a methodology for English to Tamil Statistical Machine Translation is


proposed, along with a pre-processing technique. This pre-processing method is used to
handle morphological variance between English and Tamil. Linguistic tools are
developed to generate linguistically motivated data for the factored translation model
for English-Tamil.

1.6 OBJECTIVE OF THE THESIS


The main aim of this research is to develop a morphology based prototype Statistical
Machine Translation system for English to Tamil language by integrating different
linguistic tools. This research will also address the issue of how the morphologically
correct sentence is generated when translating from a morphologically simple language
into a morphologically rich language. The objective of the research is detailed as
follows:

7
 
• Develop a pre-processing module (Reordering, Compounding and
Factorization) for English language sentence to transform the structure to
more similar to that of Tamil.

The pre-processing module for source language includes three stages, which are
reordering, factorization and compounding. In reordering stage, the source language
sentence is to be syntactically reordered according to the Tamil language syntax.
After reordering, the English words will be factored into lemma and other
morphological features. It will be followed by the compounding process, in which
the various function words are removed from the reordered sentence and attached
as a morphological factor to the corresponding content word.

• Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a


sentence.

Tamil POS tagger is going to develop using Support Vector Machine (SVM)
based machine learning tool. POS annotated corpus will be created for training the
automatic tagger system.

• Develop a Morphological Analyser to segment the Tamil surface word into


linguistic factors.

Morphological analyzer system is to be developed using machine learning


approach. POS tagger and morphological analyser tools are to be used for pre-
processing the Tamil language sentence. Linguistic information from the tools is to
be incorporated to the surface words before SMT training.

• Build a Morphology based prototype Factored Statistical Machine Translation


(F-SMT) system for English to Tamil.

After pre-processing, the bi-lingual sentences are to be created and transformed


as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and
factored using Tamil POS tagger and morphological analyser. These sentences will
be used for training the factored Statistical machine translation model.

8
 
• Develop a Tamil Morphological Generator system to generate Tamil surface
word form.

Morphological generator transforms the translation output into grammatically


correct target language sentence. Morphological generator is used in post
processing module for English to Tamil machine translation system.

1.7 RESEARCH METHODOLOGY

1.7.1 Overall System Architecture

Tamil is a morphologically rich language with free word-order of Subject-Object-


Verb (SOV) pattern. English language is morphologically simple with a fixed word
order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not
perform well for the languages with different word order and disparate morphological
structure. For resolving this, factored models are introduced in SMT system. The
factored model, which is a subtype of SMT system, will allow multiple levels of
representation of the word-from the most specific level to more general levels of
analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1
shows the overall architecture of the proposed English to Tamil SMT system. The
preprocessing module is externally attached to the factored SMT system. This module
converts bilingual corpora into factored bi-lingual corpora using morphology based
linguistic tools and reordering rules. After preprocessing, the representations of source
language sentence syntax closely follow the sentence structure of target language. This
transformation decreases the complexity in alignment, which is also one of the key
problems in baseline SMT system.

Parallel corpora are used to train the statistical translation models. Parallel corpora
are created and converted into factored parallel corpora using preprocessing. English
sentences are factored using Stanford Parser tool and Tamil sentences are factored
using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected
from various news papers and factored using Tamil linguistic tools. This mono-lingual
corpus is used in language model. Finally, in post-processing, Tamil morphological
generator is used for generating a surface word from output factors.

9
 
Figure 1.1 Morphoology based
d Factored SMT
S for En
nglish to Tam
mil languagge

1.77.2 Detaills of Pre-p


processing English Language
L Sentence

Maachine Transslation systeem for languuage pair wiith disparatee morphologgical structurre
neeeds appropriiate pre-proccessing or moodeling befoore translatioon. The prepprocessing caan
be performed on the raw source langguage sentennce to makee it more ap
ppropriate fo
for
trannslating into
o target lannguage senttence. The pre-processing modulee for Englissh
lannguage sentence consistss of reorderinng, factorizaation and com
mpounding.

1.77.2.1 Reordeering Englissh Language Sentence

g means, reaarrange the word order of source llanguage sentence into a


Reordering
wo
ord order thaat is closer to that of tthe target laanguage senntence. It is an importannt
proocess for lannguages whhich differs in their synntactic struccture. Englissh and Tam
mil
lannguage pair has disparate syntactic structure. English
E worrd order is Subject-Verb
S b-
bject (SVO) whereas Tam
Ob mil word orrder is Subjeect-Object-V
Verb (SOV). For examplle,
thee main verb of a Tamil sentence allways comess at the endd but in Engglish it comees
bettween subject and objeect [11]. Ennglish syntacctic relationns are retrievved from thhe
Staanford Parseer tool. Based on reorderring rules soource languaage sentencee is reordered.

10
Reordering rules are handcrafted using the syntactic word order difference between
English and Tamil language. 180 reordering rules are created based on the sentence
structure of English and Tamil. Reordering significantly improves the performance of
the Machine Translation system. Lexicalized distortion reordering model is
implemented in Moses toolkit [180]. But this automatic reordering in Moses toolkit is
good for short range sentences. Therefore external tool or component is needed for
dealing the long distance reordering. This reordering is also a one way of indirectly
integrating syntactic information to the source language. 80% of English sentences are
reordered correctly according to the rules which are developed. Example for English
reordering is given in the Figure 1.2.

English Sentence: I bought vegetables to my home.

Reordered English: I my to home vegetables bought

Tamil Sentence : நான் என் ைடய ட் ற்கு காய்கறிகள் வாங்கிேனன் .

Figure 1.2 Reordering of English language

1.7.2.2 Factorization of English Language Sentence

Factored models can be used for morphologically rich languages, in order to reduce
the amount of bi-lingual data. Factorization refers splitting the word into linguistic
factors and integrates as a vector. Stanford Parser is used to parse the English
sentences. From the parsed tree, the linguistic information such as lemma, part-of-
speech tags, syntactic information and dependency information are retrieved. This
linguistic information is integrated as factors in the original word.

1.7.2.3 Compounding for English language sentence

Compounding is defined as adding additional morphological information to


morphological factor of source (English) language words [188]. Additional
morphological information includes function word, subject information, dependency
relations, auxiliary verbs and model verbs. This information is based on the

11
 
morphological structure of Tamil language sentence. In compounding phase, the
function words are identified from the English factored corpora using dependency
information. After finding the function words, these are removed from the factored
sentence and attached as a morphological factor to the corresponding content word.
Compounding process reduces the length of the English sentence. Like function words,
auxiliary verbs and model verbs are also removed and attached as a morphological
factor of source language word. Now the morphological representation of the English
language sentence is similar to that of the Tamil language sentence. This compounding
step indirectly integrates dependency information into the source language factor. Table
1.1 and Table 1.2 show the factored and compounded sentences respectively.

Table 1.1 Factored English Sentences

I | i | PN | prn
my | my | PN | PRP$
home | home | N | NN
to | to | TO | TO
vegetables | vegetable | N | NNS
bought | buy | V | VBD .

Table 1.2 Compounded English Sentences

I | i | PN | prn_i
my | my | PN | PRP$
home | home | N |NN_to
vegetables | vegetable | N | NNS
bought | buy | V | VBD_1S.

1.7.3 Details of Pre-processing for Tamil Language Sentence

Like preprocessing of English sentence, Tamil sentence is also pre-processed using


linguistic tools such as Parts-of-Speech (POS) Tagger and morphological analyzer.
Tamil surface words are segmented into linguistic information and this information is
integrated as factors in SMT training corpora. Tamil sentence is given to Part-of-
Speech Tagger tool and then using this part-of-speech information, the simplified part-
of-speech tag is identified. Based on this simplified tag, the word is given to the Tamil

12
 
morphological analyzer. Morphological analyzer split the word to lemma and
morphological information. Parallel corpora as well as the monolingual corpora are
preprocessed in this stage.

1.7.3.1 Tamil Part-of-Speech Tagger

POS tagging means labeling grammatical classes i.e. assigning parts of speech tags
to each and every word of the given sentence. Tamil sentences are POS tagged using
Tamil POS Tagger tool. This tagger was developed, using Support Vector Machine
(SVM) based machine learning tool, SVMTool [12], which make the task simple and
efficient. In this method, POS tagged corpus is created and used to generate a trained
model. The SVMTool is used for creating models using tagged sentences and untagged
sentences are tagged using those models. 42k sentences (approx 5 lakh words) are
tagged for this Part-of-Speech tagger with the help of eminent Tamil linguist. The
experiments are conducted with our tagged corpus. The overall accuracy of 94.6% is
obtained for the test set which contains 6K sentences (approx 35 thousand words).  

1.7.3.2 Tamil Morphological Analyzer

After POS tagging, sentences in the corpora are morphologically analyzed for
finding the lemma and morphological information. Morphological analyzer is a
software tool used to segment the word into meaningful units. Morphological analysis
of Tamil is a complex process because of its “morphological-rich” nature. Generally,
rule based approaches are used to develop morphological analyzer system. For a
morphologically rich language like Tamil, the creation of rules is a challenging task.
Here a novel machine learning based approach is proposed and implemented for Tamil
verb and noun Morphological analyzer. Additionally, this approach is tested for
languages such as Malayalam, Telugu and Kannada.

This approach is based on sequence labeling and training by kernel methods. It


captures the non-linear relationships and various morphological features of natural
language words in a better and simpler way. In this machine learning approach, two
training models are created for morphological analyzer. First model is trained using the
sequence of input characters and their corresponding output labels. This trained Model-
I is used for finding the morpheme boundaries. Second model is trained using sequence
of morphemes and their grammatical categories. This trained Model-II is used for

13
 
assigning grammatical classes to each morpheme. The SVM based tool was used for
training the data. This tool segments each word into its lemma and morphological
information.

1.7.4 Factored SMT System for English to Tamil Language

Factored translation is an extension of Phrase based Statistical Machine Translation


(PBSMT) that allows the integration of additional morphological and lexical
information, such as lemma, word class, gender, number, etc., at the word level on
source and the target languages. In SMT system, three different toolkits are used for
translation modeling, language modeling and decoding. These toolkits are implemented
using GIZA++, SRILM and Moses toolkits. GIZA++ is a Statistical Machine
Translation toolkit that is used to train IBM models 1-5 and an HMM word alignment
model. It is an extension of GIZA which was designed as part of the SMT toolkit.
SRILM is a toolkit for language modeling that can be used in speech recognition,
statistical tagging and segmentation, and Statistical Machine Translation. Moses is an
open source SMT system toolkit that allows to automatically training translation
models for any language pair. What is needed is a collection of translated texts (parallel
corpus). An efficient search algorithm finds quickly the highest probability translation
among the exponential number of choices. Figure 1.3 explains the mapping of English
factors and Tamil factors in Factored SMT system.  

Figure 1.3 Mapping English Word Factors to Tamil Word Factors

Morphological, syntactic and semantic information can be integrated as factors


in factored translation model during training. Initially, English factors “Lemma” and
“Minimized-POS” are aligned to Tamil factors “Lemma” and “M-POS” then

14
 
“Minimized-POS” and “Compound-Tag” factors of English word is aligned to
“Morphological information” factor of Tamil word. Here, the important thing is Tamil
surface new words are not generated in SMT decoder. Only factors are generated from
SMT system and the surface word is generated in the post processing stage. Tamil
morphological generator is used in post processing to generate a Tamil surface word
from output factors. The system is evaluated with different sentence patterns like
simple, continuous and model auxiliaries and with these types, 85% of the sentences
are translated correctly. In addition, for other sentence types, the performance is 60%.
The prototype machine translation system which is developed properly handles the
noun-verb agreement. This is an essential requirement for translating into
morphologically rich languages like Tamil. BLEU and NIST evaluation scores clearly
show that the factored model with an integration of linguistic knowledge gives better
result for English to Tamil Statistical Machine Translation system. 

1.7.5 Post-processing for English to Tamil SMT

Post-processing is engaged to generate a Tamil surface word using output factors.


In factored SMT system, the aim is to translate factors only, not to generate a surface
word. Due to the morphological rich nature of Tamil language, word generation is
handled separately. Morphological generator is applied in post-processing stage of
English to Tamil Machine Translation system. Post-processing transforms the
translated factors into grammatically correct target language sentence.

1.7.5.1 Tamil Morphological Generator

Tamil morphological generator receives the factors in the form of “lemma +


word_class + morpho-lexical information”, where lemma specifies the lemma of the
word form to be generated, word_class denotes the grammatical category and morpho-
lexical information states the type of inflection. These factors are output of the
proposed Machine Translation system. The novel suffix based approach is developed
for Tamil Morphological generator. Tamil noun and verb paradigm classification is
done based on its case and tense markers respectively. Number of paradigms for verb
and noun is defined. In Tamil, verbs are classified into 32 paradigms and nouns and
classified into 25 [13]. Noun and verb paradigms are used for creating suffix table.
Morphological generator system is divided into 3 modules. The first module takes the

15
 
lemma and word-class as input and gives the lemma’s paradigm number and word’s
stem as output. This paradigm number is referred as column index. Paradigm number
provides information about all the possible inflected words of a lemma in a particular
word class. The second module takes morpho-lexical information as an input and gives
its index number as an output. From the complete morpho-lexical information list, the
index number of the corresponding input morpho-lexical information factor is identified
and this is referred as row index. In third module, a two dimensional suffix-table is
used to generate the word using row index and column index. Finally the identified
suffix is attached with the stem to create a word form. For pronouns, pattern matching
approach is followed for generating pronoun word form.

1.8 RESEARCH CONTRIBUTIONS

This thesis shows how preprocessing and post processing can be used to improve
the statistical machine translation for English to Tamil language. The main focus of this
research is on translation from English into Tamil language, but also the development
of linguistic tools for Tamil language. The contributions are,
• Introduced a novel pre-processing method for English sentences which is
based on reordering and compounding. Reordering rearrange the English
sentence structures according to Tamil sentence. Compounding removes the
function words and auxiliaries then merged to the morphological factor of
content word. This pre-processing reorganizes the English sentence structure
according to the structure of Tamil sentence.
• Created a Tamil POS Tagger and tagged corpora size of 5 lakh words which
is a part of pre-processing Tamil language sentence.
• Introduced a novel method for developing Tamil morphological analyser
which is based on Machine learning approach. Corpora developed for this
approach contains 4 lakh morphologically segmented Tamil verbs and 2 lakh
Tamil nouns.
• Introduced a novel algorithm for developing Tamil morphological generator
with the use of paradigms and suffixes. Using this generator, it is possible to
generate 10 thousand distinct word form of a single Tamil verb.
• Successfully integrated these pre-processing and post-processing modules and
developed English to Tamil factored SMT system.

16
 
1.9 ORGANIZATION OF THE THESIS

This thesis is divided into ten chapters. Figure 1.4 shows the Organization of the thesis.

    Chapter‐I  INTRODUCTION

    Chapter‐2  LITERATURE SURVEY

    Chapter‐3  BACKGROUND

PREPROCESSING PREPROCESSING
    Chapter‐4  ENGLISH TAMIL
LANGUAGE LANGUAGE

POS TAGGER
    Chapter‐5 
FOR TAMIL

MORPH ANALYZER
    Chapter‐6  FOR TAMIL

    Chapter‐7  FACTORED SMT

MORPHOLOGICAL
    Chapter‐8  GENERATOR FOR TAMIL

EXPERIMENTS AND
    Chapter‐9  RESULTS

  Chapter‐10  CONCLUSION

Figure 1.4 Thesis Organizations

17
 
This thesis is organized as follows. General introduction is presented in chapter 1.
Chapter 2 presents the literature survey for linguistic tools and available Machine
Translation systems for Indian languages. In Chapter 3, the theoretical background and
language processing for Tamil is described. Chapter 4 contains the different stages of
preprocessing English language sentences. Stages include reordering, factorization and
compounding. Chapter 5 and 6 presents the preprocessing of Tamil sentence using
linguistic tools. In Chapter 5, development of Tamil POS tagger is explained and
Chapter 6 illustrates the Morphological Analyzer for Tamil language. This
morphological analyzer is developed based on the new machine learning based
approach. Additionally, the detailed descriptions of the method and data resources are
also illustrated. Chapter 7 presents the Factored SMT system for English to Tamil
language. This chapter explains how the factored corpora are trained and decoded using
SMT Toolkit. Post-processing for Tamil language is discussed in chapter 8.
Morphological generator is used as a Post-processing tool. This chapter also explains
the detailed description about a new algorithm which is developed for Tamil
Morphological generator. Chapter 9 explains the experiment and results of English to
Tamil Statistical Machine Translation system. It also describes the training and testing
details of SMT toolkit. The output of the developed system is evaluated using BLEU
and NIST metrics. Finally Chapter 10 concludes the thesis and explains the future
directions about this research.

18
 
CHAPTER 2

LITERATURE SURVEY
This chapter presents the state of the art in the field of Tamil Linguistic tools and
Machine Translation systems. Tamil Linguistic tools include POS Tagger,
Morphological analyzer and Morphological generator. This chapter discusses the
literature review about the Linguistic tools and Machine Translation systems for Indian
languages and Tamil languages.

2.1 PART OF SPEECH TAGGER

Part-of-Speech (POS) tagging is the process of labeling a Part-of-Speech or other


lexical class marker to each and every word in a sentence. It is similar to the process of
tokenization for computer languages. Hence POS tagging is considered as an important
process in speech recognition, natural language parsing, morphological parsing,
information retrieval and machine translation.

Different approaches have been used for Part-of-Speech (POS) tagging, where
the notable ones are rule-based, stochastic, or transformation-based learning
approaches. Rule-based taggers try to assign a tag to each word using a set of hand-
written rules. These rules could specify, for instance, that a word following a
determiner and an adjective must be a noun. This means that the set of rules must be
properly written and checked by human experts. The stochastic (probabilistic) approach
uses a training corpus to pick the most probable tag for a word [14-17]. All
probabilistic methods cited above are based on first order or second order Markov
Models.

There are a few other techniques which use probabilistic approach for POS
Tagging, such as the Tree Tagger [18]. Finally, the transformation-based approach
combines the rule-based approach and statistical approach. It picks the most likely tag
based on a training corpus and then applies a certain set of rules to see whether the tag
should be changed to anything else. It saves any new rules that it has learnt in the
process, for future use. One example of an effective tagger in this category is the Brill
tagger [19-22]. All of the approaches discussed above fall under the rubric of
supervised POS Tagging, where a pre tagged corpus is a prerequisite. On the other

19 
 
hand, there is the unsupervised POS tagging [23] [24] [25] technique and it does not
require any pre-tagged corpora. Koskenniemi(1985) [26] also used a rule-based
approach implemented with finite-state machines.

Greene and Rubin (1971) [27] have used a rule-based approach in the TAGGIT
program, which was an aid in tagging the Brown corpus [28]. TAGGIT disambiguated
77% of the corpus; the rest was done manually over a period of several years.

Derouault and Merialdo (1986) [29] have used a bootstrap method for training.
At first, a relatively small amount of text was manually tagged and used to train a
partially accurate model. The model was then used to tag more text, and the tags were
manually corrected and then used to retrain the model. Church (1988) [15] uses the
tagged Brown corpus for training. These models involve probabilities for each word in
the lexicon and hence a large tagged corpus is required for a reliable estimation. Jelinek
(1985) [30] has used Hidden Markov Model (HMM) for training a text tagger.
Parameter smoothing can be conveniently achieved using the method of ‘deleted
interpolation’ in which weighted estimates are taken from second and first-order
models and a uniform probability distribution.

Kupiec (1992) [31] used word equivalence classes (referred to here as


ambiguity classes) based on parts of speech, to pool data from individual words. The
most common words are still represented individually, as sufficient data exist for robust
estimation. Yahya O. Mohamed Elhadj (2004) [32] presents the development of an
Arabic part-of-speech tagger that can be used for analyzing and annotating traditional
Arabic texts, especially the Quran text. The developed tagger employed an approach
that combines morphological analysis with Hidden Markov Models (HMMs) based-on
the Arabic sentence structure. The morphological analysis is used to reduce the size of
the lexicon tags by segmenting Arabic words in their prefixes, stems and suffixes; this
is due to the fact that Arabic is a derivational language. On the other hand, HMM is
used to represent the Arabic sentence structure in order to take into account the
linguistic combinations.

In the recent literature, several approaches to POS tagging based on statistical


and machine learning techniques are applied, including Hidden Markov Models [33],
Maximum Entropy taggers [34], Transformation–based learning [35], Memory–based
learning [36], Decision Trees [37], and Support Vector Machines [38]. Most of the

20 
 
previous taggers have been evaluated on the English WSJ corpus, using the Penn
Treebank set of POS categories and a lexicon constructed directly from the annotated
corpus. Although the evaluations were performed with slight variations, there was a
wide consensus in the late 90’s that the state–of-the–art accuracy for English POS
tagging was between 96.4% and 96.7%. In the recent years, the most successful and
popular taggers in the NLP community have been the HMM–based TnT tagger, the
Transformation–based learning (TBL) tagger [35] and several variants of the Maximum
Entropy (ME) approach [34].

The SVMTool [38] is intended to comply with all the requirements of modern
NLP technology, by combining simplicity, flexibility, robustness, portability and
efficiency with state–of–the–art accuracy. This is achieved by working in the Support
Vector Machines (SVM) learning framework, and by offering NLP researchers a highly
customizable sequential tagger generator.

TnT is an example of a really practical tagger for NLP applications. It is


available to anybody, simple and easy to use, considerably accurate, and extremely
efficient, allowing training from 1 million word corpora in just a few seconds and
tagging thousands of words per second [39]. In the case of TBL and ME approaches,
the great success has been due to the flexibility they offer in modeling contextual
information, being ME slightly more accurate than TBL.

2.1.1 Part-of-Speech Tagger for Indian Languages


Various approaches have been used for developing Part-of-Speech tagger for Indian
languages.

Smriti Singh et.al (2006) [40] have proposed tagger for Hindi, that uses the affix
information stored in a word and assigns a POS tag using no contextual information.
By considering the previous and the next word in the Verb Group (VG), it correctly
identifies the main verb and the auxiliaries. Lexicon lookup was used for identifying
the other POS categories.

Hidden Markov Model (HMM) based tagger for Hindi was proposed by Manish
Shrivastava and Pushpak Bhattacharyya (2008) [41]. The authors attempted to utilize
the morphological richness of the languages without resorting to complex and
expensive analysis. The core idea of their approach was to explode the input in order to

21 
 
increase the length of the input and to reduce the number of unique types encountered
during learning. This in turn increases the probability score of the correct choice while
simultaneously decreasing the ambiguity of the choices at each stage.

In NLPAI ML contest, Dalal et al (2006) [42] have achieved accuracies of


82.22 % and 82.4% for Hindi POS tagging and chunking respectively using maximum
entropy models. Karthik et al. (2006) [43] got 81.59 % accuracy for Telugu POS
tagging using HMMs.

Nidhi Mishra and Amit Mishra [44] proposed a Part-of-Speech Tagging for
Hindi Corpus in 2011. In the proposed method, the system scans the Hindi corpus and
then extracts the sentences and words from the given corpus. Also the system search
the tag pattern from database and display the tag of each Hindi word like noun tag,
adjective tag, number tag, verb tag etc.

Based on lexical sequence constraints, a POS tagger algorithm for Hindi was
proposed by Pradipta Ranjan Ray (2003) [45]. The proposed algorithm acts as the first
level of part of speech tagger, using constraint propagation, based on ontological
information, morphological analysis information and lexical rules. Even though the
performance of the POS tagger has not been statistically tested due to lack of lexical
resources, it covers a wide range of language phenomenon and accurately captures the
four major local dependencies in Hindi.

Sivaji Bandyopadhyay et.al (2006) [46] came up with a rule based chunker for
Bengali which gave an accuracy of 81.64 %. The chunker has been developed using
rule-based approach since adequate training data was not available. The list of suffixes
has been prepared for handling unknown words. They used 435 suffixes; many of them
usually appear at the end of verb, noun and adjective words.

For Telugu, three POS taggers have been proposed by using different POS
tagging approaches viz., (1) Rule-based approach, (2) Transformation based learning
(TBL) approach of Erich Brill (3) Maximum Entropy Model, a machine learning
technique [47].

For Bengali, Sandipan et al., (2007) [48] have developed a corpus based semi-
supervised learning algorithm for POS tagging based on HMMs. Their system uses a
small tagged corpus (500 sentences) and a large unannotated corpus along with a

22 
 
Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.

Antony P J and Soman KP [49] of Amrita University, Coimbatore proposed


statistical approach to build a POS tagger for Kannada language using SVM. They have
proposed a tagset consisting of 30 tags. The proposed POS tagger for Kannada
language is based on supervised machine learning approach. The Part-of-Speech tagger
for Kannada language was modeled using SVM kernel.

A stochastic Hidden Markov Model (HMM) based part of speech tagger has
been proposed for Malayalam. To perform the Part-of-Speech tagger using stochastic
approach, an annotated corpus is required. Due to the non-availability of annotated
corpus, a morphological analyzer was also developed to generate a tagged corpus from
the training set [50]. Antony P.J et.al (2010) [51] developed tagset and tagged corpora
size of 180,000 words for Malayalam language. This tagged corpus is used for training
the system. The performance of the SVM based tagger achieves 94 % accuracy and
showed an improved result than HMM based tagger.

2.1.2 Part of Speech Tagger for Tamil Language


Various methodologies have been developed for POS Tagging for Tamil
language. A rule-based POS tagger for Tamil was developed and tested by
Dr.Arulmozhi P et.al [52]. This system gives only the major tags and the sub tags are
overlooked during evaluation. A hybrid POS tagger for Tamil using HMM technique
and a rule based system was also developed [53].

Parts of speech tagging scheme, tags a word in a sentence with its parts of
speech. It is done in three stages: pre-editing, automatic tag assignment, and manual
post-editing. In pre-editing, the corpus is converted to a suitable format to assign a Part
of Speech tag to each word or word combination. Because of orthographic similarity
one word may have several possible POS tags. After the initial assignment of possible
POS, words are manually corrected to disambiguate words in texts.

23 
 
Vasu Ranganathan’s Tagtamil (2001)

Tagtamil by Vasu Ranganathan [55] is based on Lexical phonological approach.


Tagtamil does morphotactics of morphological processing of verbs by using index
method. Tagtamil does both tagging and generation.

Ganesan’s POS tagger (2007)

Ganesan [56] has prepared a POS tagger for Tamil. His tagger works well in
CIIL Corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for
Tamil. He tagged a portion of CIIL corpus by using a dictionary as well as a
morphological analyzer. He corrected it manually and trained the rest of the corpus
with it. The tags are added morpheme by morpheme.

pUkkaLai : pU_N_PL_AC

vawthavan: va_IV_ wth_PT_avan_3PMS

Kathambam of RCILTS-Tamil

Kathambam attaches parts of speech tags to the words of a given Tamil


document. It uses heuristic rules based on Tamil linguistics for tagging and does not
use either the dictionary or the morphological analyzer. It gives 80% efficiency for
large documents, uses 12 heuristic rules and identifies the tags based on PNG, tense and
case markers. Standalone words are checked with the lists stored in the tagger. It uses
‘Fill in rule’ to tag ‘unknown words. It also uses bigram for identifying the unknown
word using the previous word category.

Lakshmana Pandian S and Geetha T V (2008) [54] have developed a Morpheme


based Language Model for Tamil Part-of-Speech Tagging. A language model based on
the information of the stem type, last morpheme, and previous to the last morpheme
part of the word for categorizing its part of speech was developed. For estimating the
contribution factors of the model, they have followed the generalized iterative scaling
technique.

Lakshmana Pandian S and Geetha T V (2009) [57] developed CRF Models for
Tamil Part of Speech Tagging and Chunking. This method avoids a fundamental
limitation of maximum entropy Markov models (MEMMs) and other discriminative

24 
 
Markov models. The Language models are developed using CRF and designed based
on morphological information of Tamil.

Selvam and Natarajan (2009) [58] have developed a Rule based Morphological
analyser and POS Tagger for Tamil. They improved the above systems using Projection
and Induction techniques. Rule based morphological analyzer and POS tagger can be
built from well defined morphological rules of Tamil. Projection and induction
techniques are used for POS tagging, base noun-phrase bracketing, named entity
tagging and morphological analysis from a resource rich language to a resource
deficient language. They applied alignment and projection techniques for projecting
POS tags, and alignment, lemmatization and morphological induction techniques for
inducing root words from English to Tamil. Categorical information and root words are
obtained from POS projection and morphological induction respectively from English
via alignment across sentence aligned corpora. They generated more than 600 POS tags
for rule based morphological analysis and POS tagging.

2.2 MORPHOLOGICAL ANALYZER AND GENERATOR


Different methodologies have been adopted for developing morphological analyzer in
various languages. Generally rule based approaches dominate to solve the
morphological analyzer problem. Nowadays statistical methods are introduced for
solving the analyzer framework. For example, Thai morphological analysis based on
the theoretical background of Conditional Random Fields (CRF) formulates an un-
segmented language as the sequential supervised learning problem [59]. Memory-based
learning has been successfully applied to morphological analysis and part-of-speech
tagging in Western and Eastern-European languages [60]. MBMA (Memory-Based
Morphological Analysis), is a memory-based learning system. Memory-based learning
is a class of inductive supervised machine learning algorithm that learns by storing
examples of a task in memory. A corpus based morphological analyzer for unvocalized
Modern Hebrew is developed by combining statistical methods with rule-based
syntactic analysis [61].

John Goldsmith (2001) [62] shows how stems and affixes can be inferred from
a large un-annotated corpus. Data-driven method for automatically analyzing the
morphology of ancient Greek used a nearest neighbor machine learning framework

25 
 
[63]. A language modeling technique to select the optimal segmentation rather than
using heuristics is proposed for Thai morphological analyzer [64].

2.2.1 Morphological Analyzer and Generator for Indian Languages

T. N. Vikram and Shalini R (2007) [65] developed a prototype of morphological


analyzer for Kannada language based on Finite State Machine. This is just a prototype
based on Finite state machines (FSM) and can simultaneously serve as a stemmer, part
of speech tagger and spell checker. The proposed morphological analyzer tool does not
handle compound formation morphology and can handle a maximum of 500 distinct
nouns and verbs.

Recently Shambhavi B. R and Dr. Ramakanth Kumar (2011) [66] developed a


paradigm based morphological generator and analyzer using a trie based data strucure.
The disadvantage of trie is that it consumes more memory as each node can have at
most ‘y’ children, where y is the alphabet count of the language. As a result it can
handle up to maximum 3700 root words and around 88K inflected words.

Uma Maheshwar Rao G. and Parameshwari K. of CALTS, University of


Hyderabad (2010) [67] attempted to develop a morphological analyzer and generators
for South Dravidian languages MORPH- A network and process model for Kannada
morphological analysis/ generation was developed by K. Narayana Murthy (2001) [68]
and the performance of the system is 60 to 70% on general texts.

For Bengali, unsupervised methodology is used in developing a Morphological


Analyzer system [69] and two-level morphology approach was used to handle Bengali
compound words. Rule based Morphological Analyzer was developed for Sanskrit [71]
and Oriya languages [70].

2.2.2 Morphological Analyzer and Generator for Tamil Language

In Tamil language, the first step towards the preparation of morphological analyzer for
Tamil was initiated by Anusaraka group. Ganesan (2007) [56] developed a
morphological analyzer for Tamil to analyze CIIL corpus. Phonological and
morphophonemic rules are taken into for building morphological analyzer for Tamil.
Resource Centre for Indian Language Technological Solutions (RCILTS) -Tamil has

26 
 
prepared a morphological analyzer (Atcharam) for Tamil. Finite automata state-table
has been adopted for developing this Tamil morphological analyzer [72].

Tamil morphological analyzers and generators were built based on various


techniques and constraints like morpho-tactics, morphological alternations, phonology
and morpho-phonemics. Some of these works were reported by the authors named
Rajendran, Ganesan, Kapilan, Deivasundaram, Vishnavi, Ramasamy, Winston Cruz
and Dhurai Pandi, and organizations named AU-KBC (AnnaUniversity-KBC) at
Madras Institute of Technology (MIT) at Chennai and Resource Centre for Indian
Language Technological Solutions-Tamil (RCILTS-T) at Anna University, Chennai. A
simple morphological tagger which identifies suffixes, labels them and separates root
words from transliterated Tamil words was reported.

Parameswari.K (2010) [74] developed a Tamil morphological Analyzer and


generator using APERTIUM tool kit. This attempt involves a practical adoption of
lttoolbox for the modern standard written Tamil in order to develop an improvised open
source Morphological Analyzer and generator. The tool uses the computational
algorithm Finite State Transducers (FST) for one-pass analysis and generation, and the
database is developed in the morphological model called word and paradigm.

Vijay Sundar Ram R et.al (2010) [75] was designed Tamil Morphological
Analyzer using paradigm based approach and Finite State Automata, which works
efficiently in recursive tasks and considers only the current state for having a transition.
In this approach complex affixations are easily handled by FSA and in the FSA, the
required orthographic changes are handled in every state. In this approach, they built a
FSA using all possible suffixes, categorize the root word lexicon based on paradigm
approach to optimize the number of orthographic rules and use morpho-syntax rules to
get the correct analysis for the given word. FSA is used in analysis of the word is done
suffix by suffix. FSA are the proven technology for efficient and speedy processing.
There are three major components in their morphological analyzer system. The first one
is Finite State Automata which is modeled using all possible suffixes (allomorphs).
Next is lexicon, categorized based on the paradigm approach and the final component is
morpho-syntax rules for filtering the correct parse of the word.

27 
 
Akshar bharati et.al (2001) [76] developed a algorithm for unsupervised
learning of morphological analysis and generation of inflectionally rich languages. This
algorithm uses the frequency of occurrences of word forms in a raw corpus. They
introduce the concept of “observable paradigm “by forming equivalence classes of
feature-structures which are not obvious. Frequency of word forms for each
equivalence class is collected from such data for known paradigms. In this algorithm,
suppose the morphological analyzer cannot recognize the inflectional form. The
possible stem and paradigm was guessed using the corpus frequencies. The method
assumes that the morphological package makes use of paradigms. This package was
able to guess stem paradigm pair for an unknown word. This method only depends on
the frequencies of the word forms in raw corpora and does not require any linguistic
rules or tagger. The performance of this system is depends on the size of the corpora.

Vasu Ranganathan (2001) [55] built a Tamil tagger by implementing the theory
of lexical phonology and morphology and this system was tested with English-Tamil
Machine translation system and a number of natural language processing tasks. This
tagger tool was written in Prolog and built with knowledge base morphological rules of
Tamil language. This tagger should be capable of accounting all morphological
information during the process of recognition and generation. This tagger was built
using successive stage of knowledge based morphological rules of Tamil Language. In
this method, three different coding procedures were adopted to recognize the written
Tamil literary word forms. The output was contains morphological information such as
type of the word, root form of the word and suitable morphological tags for affixes.
This tagger is capable of recognizing and generating Tamil word forms including finite
and non finite verbs such as aspect, modality, tense forms as well as the noun forms
like participial noun, verbal nouns and cases. The dictionary is built as part of this
system which contains information about the root word and grammatical information.
This tagger was tested and included in Machine translation system.

Duraipandi (2002) [77] designed morpho-phonemic rules for Tamil computing.


These rules are primary resource for spell checker and Machine aided translation
system. This Morphological Generator and Parsing Engine for Tamil verb forms is a
full-fledged engine on verb patterns in modern Tamil.

28 
 
Dhanabalan et.al (2003) [13] developed a spell checker for Tamil Language.
Lexicons with morphological and syntactic information are used for developing this
spell checker. This spell checker can be integrated with word processors. Each word is
compared against a dictionary of correctly spelled word. This tool needs syntactic and
semantic knowledge for catching the misspell words. It also provides a facility to
customize the spell checker’s dictionary so that the technical words and proper nouns
can be appended. Initially the spell checker reads the word from document. If the word
is present in the dictionary then it is interpreted as valid word. Otherwise the word is
forwarded into error correcting process. This tool consists of three phases, they are, text
parsing, spelling verification and generation. This spell checker uses Tamil
morphological analyzer and morphological generator. Analyzer is used for analyzing
the given word and generator is for generating different suggestions. The
morphological analyzer first tries to split the suffix of the correct word. If spelling
mistake is there then that word is passes to spelling verification and correction module
for correcting the mistake. After finding the root word, system compares it with
dictionary entries. If the root word is not present in the dictionary then the nearest root
word is taken and given to morphological generator system.

M.Ganesan (2007, 2009) [56] explained about the analysis and generation of
Tamil corpora. He also developed various tools to analyze the Tamil corpora. The tools
include POS tagging, morphological analyzer, frequency counter and KWIC (KeyWord
In Context) concordance. At word level the tagset contains 22 tags and morph level it
contains 82 tags. Using this pos tagger and morphological analyzer he has also
developed a syntactic tagger. This tagger is for tagging a Tamil sentence at phrase level
and clause level.

K.Rajan et.al (2009) [79] developed an unsupervised approach to Tamil


morpheme segmentation. The main objective of his work is production of list of
morpheme for Tamil language. Morpheme identification is done by using Letter
Successor Varieties (LSV) and N-gram based approach. The words which are used in
this segmentation are collected from CIIL Tamil corpus. This segmentation algorithm is
based on peak and plateau model which is one of the types of Letter Successor
Varieties. The basic idea of the LSV is to count the amount of different letters
encountered after a part of a word and to compare it to the counts before and after that

29 
 
position. Successive split method is used for splitting the Tamil words. Initially all the
words are treated as stems. In the first pass these stems are split into new stems and
suffixes based on the similarity of the characters. They are split at the position where
the two words differ. The right substring is stored in a suffix list and the left sub string
is kept as a stem. In the second pass, the same procedure is followed, and the suffixes
are stored in a separate suffix list.

Uma Maheswar Rao (2004) [67] proposed a modular model that is based on a
hybrid approach which combines the two basic primary concepts of analyzing word
forms and Paradigm model. The morphological model underlying this description is
influenced by the needs of flexibility of the design that permits to plugin language
specific databases with minimum changes and the simplicity of computation. This
architecture involves the identification of different layers among the affixes, which
enter into concatenation to generate word forms. Unlike in traditional item and
arrangement model, allomorphic variants are not given any special status if they are
generable through simple morphophonemic rules. Automatic phonological rules also
used for to derive surface forms.

Menon et.al (2009) [131] developed a Finite State Transducer (FST) based
Tamil morphological analyzer and generator. They have used AT &T Finite State
Machine to build this tool. FST maps between two sets of symbols. This is used as a
transducer that accepts the input string if it is in the language and generates another
string on its output. The system is based on lexicon and orthographic rules from a two
level morphological system. For the Morphological generator, if the string which has
the root word and its morphemic information is accepted by the automaton, then it
generates the corresponding root word and morpheme units in the first level.

2.3 MACHINE TRANSLATION SYSTEMS

2.3.1 Machine Translation Systems for Indian Languages

Machine Translation systems have been developed in India, for translation from
English to Indian Languages and from Indian languages to Indian languages. These
systems are also used for teaching machine translation to the students and researchers.
Most of these systems are in the English to Hindi domain with exceptions of a Hindi to
English and English to Kannada machine translation system. English is a SVO

30 
 
language while Indian regional languages are SOV and are relatively of free word-
order. The translation domains are mostly government documents, health, tourism,
news reports and stories. . The University of Hyderabad under K. Narayana Murthy has
worked on an English-Kannada MT system called “UCSG-based English-Kannada
MT”, using the Universal Clause Structure Grammar (UCSG) formalism. A survey of
the machine translation systems that have been developed in media for translation from
English to Indian languages and among Indian languages reveals that the machine
translation software is used in field testing or is available as web translation service.
Indian Machine Translation system [80] are presented below; these systems are used to
translate English to Hindi language

R.M.K.Sinha et al. [81] proposed Anglabharti, a machine-aided translation


system specifically designed for translating English to Indian languages. English is a
SVO word order language while Indian languages are of SOV order and are relatively
of free word-order. Instead of designing translators for English to each Indian language,
Anglabharti uses a pseudo-interlingua approach. It analyses English only once and
creates an intermediate structure called PLIL (Pseudo Lingua for Indian Languages).
This is the basic translation process translating the English source language to PLIL
with most of the disambiguation having been performed. The PLIL structure is then
converted to each Indian language through a process of text-generation. The effort in
analysing the English sentences and translating into PLIL is estimated to be about 70%
and the text-generation accounts for the rest of the 30%. Thus only with an additional
30% effort, a new English to Indian language translator can be built.

Anglabharti is a pattern directed rule based system with context free grammar
like structure for English, the source language which generates a ‘pseudo-target’ (PLIL)
applicable to a group of Indian languages (target languages). A set of rules obtained
through corpus analysis is used to identify plausible constituents with respect to which
movement rules for the PLIL is constructed. The idea of using PLIL is primarily to
exploit structural similarity to obtain advantages similar to that of using interlingua
approach. It also uses some example-base to identify noun and verb phrasal and resolve
their ambiguities [82].

AksharBharati et al. proposed Anusaaraka, a project which was started at IIT


Kanpur, and is now being continued at IIIT Hyderabad, was started with the explicit

31 
 
aim of translation from one Indian language to another. It produces output which a
reader can understand but is not exactly grammatical. For example, a Bengali to Hindi
Anusaaraka can take a Bengali text and produce output in Hindi which can be
understood by the user but will not be grammatically perfect. Likewise, a person
visiting a site which is in a language he does not know, he can run Anusaaraka and read
the text and understand the context. Anusaaraka's have been built from Telugu,
Kannada, Bengali, and Marathi to Hindi [83].

MaTra [84] is a Human-Assisted translation project for English to Indian


languages, currently Hindi, essentially based on a transfer approach using a frame-like
structured representation. The focus is on the innovative use of man-machine
synergy—the user can visually inspect the analysis of the system, and provide
disambiguation information using an intuitive GUI, allowing the system to produce a
single correct translation. The system uses rule-bases and heuristics to resolve
ambiguities to the extent possible – for example, a rule-base is used to map English
prepositions into Hindi postpositions. The system can work in a fully automatic mode
and produce rough translations for end users, but is primarily meant for translators,
editors and content providers. Currently, it works for simple sentences, and work is on
to extend the coverage to complex sentences. The MaTra lexicon and approach is
general-purpose, but the system has been applied mainly in the domains of news,
annual reports and technical phrases, and has been funded by (TDIL).

The Mantra (MAchiNe assisted TRAnslation tool) translates English text into
Hindi in a specified domain of personal administration, specifically gazette
notifications, office orders, office memorandums and circulars. Initially, the Mantra
system was started with the translation of administrative document such as appointment
letters, notification, and circular issued in central government from English to Hindi.
The system is ready for use in its domains.

It has a text categorization component at the front, which determines the type of
news story such as political, terrorism, economic, etc., before operating on the given
story. Depending on the type of news, it uses an appropriate dictionary. It also requires
considerable human assistance in analysing the input. Another novel component of the
system is that given a complex English sentence, it breaks it up into simpler sentences,
which are then analysed and used to generate Hindi. They are using the translation

32 
 
system in a project on Cross Lingual Information Retrieval (CLIR) that enables a
person to query the web for documents related to health issues in Hindi [85].

Anubharti approach for machine-aided-translation is a hybridized example-


based machine translation approach (EBMT) that is a combination of example-based,
corpus-based approaches and some elementary grammatical analysis. The example-
based approaches emulate human-learning process for storing knowledge from past
experiences to use it in future. In Anubharti, the traditional EBMT approach has been
modified to reduce the requirement of a large example-base. This is done primarily by
generalizing the constituents and replacing them with abstracted form from the raw
examples. The abstraction is achieved by identifying the syntactic groups. Matching of
the input sentence with abstracted examples is done based on the syntactic category and
semantic tags of the source language structure [85].

Two machine translation systems, Shiva and Shakti [86] for English to Hindi
were developed jointly by Carnegie Mellon University USA, Indian Institute of
Science, Bangalore, India, and International Institute of Information Technology,
Hyderabad. Shakti machine translation system has been designed to produce machine
translation systems for new languages rapidly. Shakti system combines rule-based
approach with statistical approach whereas Shiva is example based machine translation
system. The rules are based on linguistic nature, and the statistical approach tries to
infer or use linguistic information. Some modules also use semantic information.
Currently Shakti is working for three target languages (Hindi, Marathi and Telugu).

English-Telugu machine translation system is developed jointly at CALTS with


IIIT, Hyderabad, Telugu University, Hyderabad and Osmania University, Hyderabad.
This system uses English-Telugu lexicon consisting of 42,000 words. A word form
synthesizer for Telugu is developed and incorporated in the system [85].

English-Kannada machine aided translation system is developed at Resource


Centre for Indian Language Technology Solutions, University of Hyderabad by Dr. K.
Narayana Murthy [85]. Their approach is based on using the Universal Clause Structure
Grammar (UCSG) formalism. This is essentially a transfer-based approach, and has
been applied to the domain of government circulars, and funded by the Karnataka
government [85].

33 
 
Anubaad [87], a hybrid machine translation system for translating English news
headlines to Bengali, developed by Sivaji Bandyopadhyay at Jadavpur University
Kolkata. The current version of the system works at the sentence level.

R. Mahesh et al. [88] proposed Hinglish, a machine translation system for pure
(standard) Hindi to pure English. It had been implemented by incorporating additional
layer to the existing English to Hindi translation (AnglaBharti-II) and Hindi to English
translation (AnuBharti-II) systems developed by Sinha. The system claimed to be
produced satisfactory acceptable results in more than 90% of the cases. Only in case of
polysemous verbs, due to a very shallow grammatical analysis used in the process, the
system is unable to resolve their meaning.

English-Hindi example based machine translation system, developed by IBM


India Research Lab at New Delhi. Now, they have recently initiated work on statistical
machine translation between English and Indian languages, building on IBM‘s existing
work on statistical machine translation [89].

Gurpreet Singh Josan et al. [90] developed Punjabi to Hindi machine translation
system at Punjabi University, Patiala. This system is based on direct word-to-word
translation approach. This system consists of modules like pre-processing, word-to-
word translation using Punjabi-Hindi lexicon, morphological analysis, word sense
disambiguation, transliteration and post processing. The system has reported 92.8%
accuracy.

Sampark, a machine translation system for translation among Indian languages


was developed by the Consortium of institutions. Consortiums of institutions include
IIIT Hyderabad, University of Hyderabad, CDAC (Noida,Pune), Anna University,
KBC, Chennai, IIT Kharagpur, IIT Kanpur, IISc Bangalore, IIIT Alahabad, Tamil
University, Jadavpur University. Currently experimental systems have been released
namely {Punjabi,Urdu, Tamil, Marathi} to Hindi and Tamil-Hindi Machine
Translation systems [85].

Vishal Goyal et.al. [91] developed Hindi to Punjabi Machine translation System
at Punjabi University, Patiala. This system is based on direct word-to-word translation
approach. This system consists of modules like pre-processing, word-to-word
translation using Hindi-Punjabi lexicon, morphological analysis, word sense

34 
 
disambiguation, transliteration and post processing. The system has reported 95%
accuracy.

2.3.2 Machine Translation Systems for Tamil

Prashanth Balajapally et al. [92] developed English to {Hindi, Kannada, and Tamil}
and Kannada to Tamil language-pair example based machine translation. It is based on
a bilingual dictionary comprising of sentence-dictionary, phrases-dictionary, words-
dictionary and phonetic-dictionary is used for the machine translation. Each of the
above dictionaries contains parallel corpora of sentence, phrases and words, and
phonetic mappings of words in their respective files. Example Based Machine
Translation (EBMT) has a set of 75000 most commonly spoken sentences that are
originally available in English. These sentences have been manually translated into
three of the target Indian languages, namely Hindi, Kannada and Tamil.

Tamil-Hindi, machine-aided translation system developed by Prof. C.N.


Krishnan at AU-KBC Research Centre, MIT Campus, Anna University Chennai. This
system is based on Anusaaraka machine translation system. It uses a lexical level
translation and has 80-85% coverage. Stand-alone, API, and Web-based on-line
versions are developed. Tamil morphological analyser and Tamil-Hindi bilingual
dictionary are the by-products of this system. They also developed a prototype of
English-Tamil MAT system. It includes exhaustive syntactical analysis. At present it
has limited vocabulary and small set of transfer rules.

Telugu-Tamil machine translation system is also being developed at CALTS.


This system uses the Telugu morphological analyser and Tamil generator developed at
CALTS. The backbone of the system is Telugu-Tamil dictionary [85].

Ruvan Weerasinghe (2004), [93] developed a SMT system for Sinhala to Tamil
Language. In this method, corpora were utilized from newspaper of Sri Lanka which
publishing in both languages. He has also collected corpora from a website that
contains translations of English articles into Sinhala and Tamil. These resources are
formed a small trilingual parallel corpus for this research. This corpus consists of news
items and articles related to politics and culture in Sri Lanka. The fundamental task of
sentence boundary detection was performed employing a semi-automatic approach. In
this scheme, a basic heuristic was first applied to identify sentence boundaries and

35 
 
those situations that were exceptions to the heuristic identified. Sentences are aligned
in manual way. After cleaning up the texts and manual alignment, a total of 4064
sentences of Sinhala and Tamil were used for SMT. All language processing done used
raw words and were based on statistical information. The CMU-Cambridge Statistical
Language Modeling Toolkit (version 2) was used to build n-gram language models.

Vasu renganathan (2002) [94] pursued on development of English-Tamil web


based machine translation system. This system is developed based on rules. This
system contains around five thousand words in lexicon, and a wide range of transfer
rules written in Prolog. This system also considers frequently occurring English
structures mapped to corresponding Tamil structures. The interesting feature of this
system is that it can update easily by adding words into lexicon and rules into rule-base.
Two types of lexicons are explained in this system, one is based on grammatical
categories of head and target words and the other is based on semantic and syntactic
properties of words. This former lexicon type is used to translate technical, colloquial
and news documents and where as the latter type of lexicon is mandatory for translating
complex type of literary texts comprising fiction, poems, biographies etc. Former type
of lexicon is used to build this English to Tamil translation system. The programming
language Prolog is used to code the complex rules in a robust way. He also states the
feasibilities of further research in this area. The morphological transducer built as part
of this system uses this information to generate correct inflectional forms. It is
constructed following the concepts of the theory of lexical phonology, which accounts
for the interrelationship between phonological and morphological rules in terms of
lexical and post lexical rules.

Ulrich Germann(2001) [95] reported his experience with building a statistical


MT system from scratch, including the creation of a small parallel Tamil-English
corpus. Parallel corpus of about 100,000 words on the Tamil side is created within one
month, using several translators. In this paper the complete experience about the
creation of parallel corpus is explained and the author is also advised for similar future
projects. The overall organization of this project is source data retrieval, hiring and
management of the translators, design and implementation of the web interface for
managing the project via the Internet, development of transliterator and stemmer, etc.
In order to boost the text coverage they built a simple text stemmer for Tamil, based on

36 
 
the Tamil inflection tables. The stemmer uses regular expression matching to cut off
inflectional endings and introduce some extra tokens for negation and certain case
markings (such as locative and genitive), which are all marked morphologically in
Tamil. Finally, it is showed that the performance of MT system increases using the
stemmer.

Fredric C.Gey (2002) [96] report a prospects of machine translation of Tamil


language. He mentions the major problems in connection with machine translation and
cross-language retrieval of Tamil (and other Indian languages). The primary issue is the
lack of machine-readable resources for either machine translation or cross-language
dictionary lookup. Most of the Tamil language research has been in the rich classical
literature. He has assembled a corpus of Tamil news stories from Thinaboomi website.
This corpus contains over 3,000 news stories in the Tamil language, and provides a rich
source for modern Tamil linguistic studies and retrieval. This corpus has been used to
develop an experimental statistical machine translation system from Tamil to English
by the Information Sciences Institute (http://www.isi.edu), one of the leading machine
translation research organizations.

KC.Chellamuthu(2002) [97] explained the role of Machine translation in


information dissemination and a brief history of MT and its strategies. The various
components and functions of an early MT system developed in Tamil University,
Tanjore for Russian to Tamil translation is also explained. The Russian to Tamil MT
system uses an intermediate language with a syntax more related to Target Language.
The Russian to Tamil MT system consists of various functional components such as a
preprocessor, parser, lexical analyzer, bi-lingual dictionary, morphological analyzer,
and translator and generation modules. Depending upon the strategy adopted in a MT
system, the functional organization may vary from system to system. Here, the primary
task would be analyzing the input text, parsing the sentences, analyzing the words
lexically and morphologically, conceptualizing the SL sentences, table look up using
bi-lingual dictionary and translating the input word using the linguistic knowledge
already defined in the system. The Bi-lingual dictionary contained about 1200
vocabularies with certain lexical markers and attributes is used in this MT system. This
dictionary is a major database of a MT system. The translation strategy adopted in the
Russian to Tamil MT system is a transfer methodology involving an intermediate
language. In this system the given Russian sentence is lexically analyzed and the syntax

37 
 
is transformed to the grammar of an intermediate language after carrying out the syntax
and morphological analysis and word by word translation.

Computational Engineering and Networking research centre of Amrita School


of Engineering, Coimbatore, proposed a English – Tamil translation memory system.
The system is based on phrase based approach by incorporating concept labeling using
translation memory of parallel corpus. The translation system consists of 50,000
English – Tamil parallel sentences, 5000 proverbs, and 1000 idioms and phrases, with a
dictionary containing more than 2,00,000 technical words and 100,000 general words
and has the accuracy of 70% [98].

Loganathan R (2010) [124] developed the English-Tamil machine translation


system using rule-based and corpus-based approaches. For rule based approach,
structural difference between English and Tamil is considered and syntax transfer based
methodology is adopted for translation.

Saravanan et.al (2010) [99] developed a Rule based Machine translation system
for English to Tamil. Using statistical machine translation approach, Google developed
a web based machine translation engine for English to Tamil language. This system is
also having the facility to identify the source language automatically.

2.4 ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM

Statistical translation models have evolved from the word-based models originally
proposed by Brown et al. [100] to syntax-based and phrase-based techniques. The
beginnings of phrase-based translation can be seen in the alignment template model
introduced by Och et al. [101]. A joint probability model for phrase translation was
proposed by Marcu et al. [102]. Koehn et al. [103] propose certain heuristics to extract
phrases that are consistent with bidirectional word-alignments generated by the IBM
models [100]. Phrases extracted using these heuristics are also shown to perform better
than syntactically motivated phrases, the joint model, and IBM model 4 [103].

Syntax-based models use parse-tree representations of the sentences in the


training data to learn, among other things, tree transformation probabilities. These
methods require a parser for the target language and, in some cases, the source
language too. Yamada et al. [104] propose a model that transforms target language
parse trees to source language strings by applying reordering, insertion, and translation
38 
 
operations at each node of the tree. Graehl et al. and Melamed, propose methods based
on tree-to-tree mappings [105] [106]. Imamura et al. present a similar method that
achieves significant improvements over a phrase-based baseline model for Japanese-
English translation [107].

Recently, various pre-processing approaches have been proposed for handling


syntax within SMT. These algorithms attempt to reconcile the word-order differences
between the source and target language sentences by reordering the source language
data prior to the SMT training and decoding cycles. Nießen et al. [108] propose some
restructuring steps for German-English SMT. Popovic et al. [109] report the use of
simple local transformation rules for Spanish-English and Serbian-English translation.
Collins et al. [110] propose German clause restructuring to improve German-English
SMT.

The use of morphological information for SMT has been reported in [108] and
[109]. The detailed experiments described by Nießen et al.[108], show that the use of
morph-syntactic information drastically, reduces the need for bilingual training data.

Recent work by Koehn et al. [10] proposes factored translation models that
combine feature functions to handle syntactic, morphological, and other linguistic
information in a log-linear model. The following addresses the various approaches to
handle idioms and phrasal verbs in machine translation. Handling of idioms and phrasal
verbs is one of the most important tasks to be handled in Machine Translation. Various
approaches are developed to handle idioms and phrases in machine Translation. Sahar
Ahmadi et al. [111] focused on analysing the translatability of colour idiomatic
expressions in English- Persian and Persian-English texts to explore the applied
translation strategies in translation of colour idiomatic expressions and also to find
cultural similarities and differences between colour idiomatic expressions in
English and Persian.

Martine Smets et al. [112] developed their Machine Translation system in such
a way such that it handles the verbal idioms. Verbal idioms constitute a challenge for
machine translation systems: their meaning is not compositional, preventing a word-
for-word translation, and they can be discontinuous, preventing a match during
tokenization.

39 
 
Elisabeth Breidt et al. [113] suggested describing their syntactic restrictions and
their idiosyncratic peculiarities with local grammar rules, which at the same time
permit to express regularities valid for a whole class of multi-word lexemes such as
word order variation in German.

Digital Sonata provides NLP services and products. It has released its tool kit
called, Caraboa Language Kit, in which idioms serves as the backbone of its
architecture and it is mainly rule based. Here, the idioms are considered as sequences
and each sequence is a combination of one or more lexical units.

Panagiotis(2005) [114] proposes a novel algorithm for incorporating


morphological knowledge for English to Greek Statistical Machine Translation (SMT)
system. They suggest a method of improving the translation quality of existing SMT
systems, by incorporating word-stems into SMT systems. Initially word stems are
acquired automatically for the source and target languages using an unsupervised
morphological acquisition algorithm. Second the stems are incorporated into the SMT
system using a general statistical framework which combines a word-based and a stem-
based SMT system. The combined lexical and morphological SMT system is
implemented using late integration and lattice re-scoring. They used the Linguistica
system to perform morphological analysis for both the source and target languages. The
system has been trained on parts of the Europarl corpus [115], a parallel corpus in 11
European languages which is extracted from the proceedings of the European
Parliament. The system is then evaluated on the Europarl corpus, using automatic
evaluation methods for various training corpus sizes.

Soha Sultan (2011) [116] introduces two approaches to augmenting linguistic


knowledge with English-Arabic statistical machine translation (SMT). The first
approach improves SMT by adding linguistically motivated syntactic features to
particular phrases. These added features are based on the English syntactic information,
namely part-of-speech tags and dependency parse trees. The second approach improves
morphological agreement in machine translation output through post-processing. This
method uses the projection of the English dependency parse tree onto the Arabic
sentence in addition to the Arabic morphological analysis in order to extract the
agreement relations between words in the Arabic sentence. Individual morphological
features are trained using syntactic and morphological information from both the source

40 
 
and target languages. The predicted morphological features are then used to generate
the correct surface forms.

Adri`a de Gispert Ramis (2006) [117] addresses the use of morpho-syntactic


information in order to improve the performance of Statistical Machine Translation
(SMT) systems, providing them with additional linguistic information beyond the
surface level of words from parallel corpora. This author proposes a translation model
tackling verb form generation through an additional verb instance model, reporting
experiments in English to Spanish tasks. The importance of word alignment is given as
the first step in training SMT systems. Morpho-syntactic information is included prior
to word alignment. Improvements in terms of word alignment and translation quality
are also studied. Classification approach is proposed and attached with standard SMT
decoding and report results for English to Spanish translation task.

Ann Clifton (2010) [118] examines various methods of augmenting SMT


models to use morphological information to improve the quality of translation into
morphologically rich languages, comparing them on English to Finnish translation task.
Unsupervised morphological segmentation methods are integrated into the translation
model and combine this segmentation-based system with a Conditional Random Field
morphology prediction model. Morphological awareness models yield significantly
more fluent translation output compared to a baseline word-based model.

Sara Stymne (2009), [119] explores how compound processing can be used to
improve phrase-based statistical machine translation (PBSMT) between English and
German/Swedish. For translation into Swedish and German the parts are merged after
translation. The effect of different splitting algorithms for translation between English
and German, and of different merging algorithms for German is also investigated. For
translation between English and German different splitting algorithms work best for
different translation directions. A novel merging algorithm based on art-of-speech
matching is designed and evaluated.

Rabih M. Zbib (2010) [120] presented methods for using linguistically


motivated information to enhance the performance of statistical machine translation
(SMT). The use linguistic knowledge at various levels to improve statistical machine
translation for Arabic-English translation is presented. In the first part, morphological

41 
 
information is used to preprocess the Arabic text for Arabic-to-English and English-to-
Arabic translation. This preprocessing reduces the gap in the complexity of the
morphology between Arabic and English language. The second method addresses the
issue of long-distance reordering in translation to account for the difference in the
syntax of the two languages. In the third part, it is showed that how additional local
context information on the source side is incorporated. This part helps to reduce the
lexical ambiguity. Two methods are also proposed for using binary decision trees to
control the amount of context information introduced. Finally the system combines the
outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of
the flexibility of the statistical approach and the rich linguistic knowledge embedded in
the rule-based MT system.

Young-Suk Lee (2004) [121] presents a novel morphological analysis technique


which induces a morphological and syntactic symmetry between two languages with
highly asymmetrical morphological structures to improve statistical machine translation
qualities. They have applied this technique for Arabic- English sentence alignment.
This algorithm identifies morphemes to be merged or deleted in the morphologically
rich language to induce the desired morphological and syntactic symmetry. The
algorithm utilizes two sets of translation probabilities to determine merge/deletion
analysis of a morpheme. Additional morphological analysis induced from noun phrase
parsing of Arabic is applied to accomplish a syntactic as well as morphological
symmetry between the two languages. They have used an Arabic part-of-speech tagger
with around 120 tags, and an English part-of-speech tagger with around 55 tags. This
technique improves Arabic-to-English translation qualities significantly.

Irimia Elena and Alexandru Ceauşu (2010) [122] presents a method for
extracting translation examples using the dependency linkage of both the source and
target language sentence. They identified two types of dependency link-structures -
super-links and chains - and used these structures to set the translation example borders.
They used a Romanian-English parallel corpus contained about 600,000 translation
units. In order to build the translation models from the linguistically analyzed parallel
corpora GIZA++ tool is used and unidirectional translation models are also constructed.
The performance of the dependency-based approach is measured with the BLEU-NIST
score and in comparison with a baseline system.

42 
 
Sriram Venkatapathy et.al (2010) [123] proposes a dependency based statistical
system that uses discriminative techniques to train its parameters. Experiments are
conducted for English- Hindi parallel corpora. The use of syntax (dependency tree)
allows us to address the large word-reordering between English and Hindi. They
grouped the function words with their corresponding function words. These groups of
words are called local-word groups. In these cases, the function words are considered
as factors of the content words. There are three types of transformation features are
explored, first one is, Local Features, the next is Syntactic Features and the final one is
Contextual Features. Online-large margin algorithm, MIRA is used for updating the
weights which are learned in training algorithm.

2.5 RELATED NLP WORKS IN TAMIL

Kumaran A and Tobias Kellner (2007) [125] proposed machine transliteration


framework based on a core algorithm modelled as a noisy channel, where the source
string gets garbled into target string. Viterbi alignment was used for source and target
language segments alignment. The transliteration is learned by estimating the
parameters of the distribution that maximizes the likelihood of observing the garbling
seen in the training data using Expectation Maximization (EM) algorithm.
Subsequently, given a target language string ‘t’, the most probable source language
string ‘s’ that gave raise to ‘t’, is decoded. The method is applied for forward
transliteration from English to Hindi, Tamil, Arabic, Japanese and backward
transliteration from Hindi, Tamil, Arabic, Japanese to English.

Afraz and Sobha (2008) [126] developed a statistical transliteration engine


using an n-grams based approach. This algorithm uses n-gram frequencies of the
transliteration units, to find the probabilities. Each transliteration unit is pattern of
consonant-vowel in the word. This transliteration engine is used in their Tamil to
English CLIR system.

Srinivasan C Janarthanam et.al. (2008) [127] proposed an efficient algorithm for


transliteration of English named entities to Tamil. In the first stage of transliteration
process, he used a Compressed Word Format (CWF) algorithm to compress both
English and Tamil named entities from their actual forms. Compressed Word Format of
words is created using an ordered set of rewrite and remove rules. Rewrite rules replace

43 
 
characters and clusters of characters with other characters or clusters. Remove rules
simply remove the characters or clusters. This CWF algorithm is used for both English
and Tamil names, but with different rule set. The final CWF forms will only have the
minimal consonant skeleton. In the second stage Levenshtein’s Edit Distance algorithm
is modified to incorporate Tamil characteristics like long-short vowel, ambiguities in
consonants like ‘n’, ‘r’, ‘i’, etc. Finally, the CWF Mapping transliteration algorithm
takes an input source language named entity string, converts it into CWF form and then
maps with similar Tamil CWF words using modified edit distance. This method
produces a ranked list of transliterated names in the target language Tamil for an
English source language name.

Vijaya et.al (2010) [128] developed an English to Tamil Transliteration using


one class Support Vector Machine (SVM) algorithm. This is a statistical based
transliteration system, where training, testing and evaluations were performed with
publically available SVM tool. The experiment result shows that, the SVM based
transliteration was outperformed over other previous methods.

Basically a chunker divides a sentence into its major-non-overlapping phrases


and attaches a label to each. Chunker differs in terms of their precise output and the
way in which a chunk is defined. Many do more than just simple chunking. Others just
find NPs. Chunking falls between tagging (which is feasible but sometimes of limited
use) and full parsing (which more useful but is difficult on unrestricted text and may
result in massive ambiguity. The structure of individual chunks is fairly easy to
describe, while relations between chunks are harder and more dependent on individual
lexical properties. So chunking is a compromise between the currently available and the
ideal processing output. Chunkers tokenize and tag the sentence. Most chunkers simply
use the information in tags, but others look at actual words.

Noun Phrase Chunking in Tamil

Noun phrase chunking deals with extracting the noun phrases from a sentence.
While NP chunking is much simpler than parsing, it is still a challenging task to build
an accurate and very efficient NP chunker. The importance of NP chunking derives
from the fact that it is used in many applications. Noun phrases can be used as a pre-
processing tool before parsing the text. Due to the high ambiguity of the natural

44 
 
language exact parsing of the text may become very complex. In these cases chunking
can be used as a preprocessing tool to partially resolve these ambiguities. Noun phrases
can be used in Information Retrieval systems. In this application the chunking can be
used to retrieve the data's from the documents depending on the chunks rather than the
words. In particular nouns and noun phrases are more useful for retrieval and extraction
purposes. Most of the recent work on machine translation use texts in two languages
(parallel corpora) to derive useful transfer patterns. Noun phrases also have applications
in aligning of text in parallel corpora. The sentences in the parallel corpora can be
aligned by using the chunk information and by relating the chunks in the source and the
target language. This can be done lot more easily than doing word alignment between
the texts of the two languages. Further noun phrases that are chunked can also be used
in other applications where in depth parsing of the data is not necessary [129].

AUKBCRC’s Noun Phrase Chunker for Tamil

The approach is a rule based one. In this method initially a corpus is taken and it
is divided into two or more sets. One of these divided sets is used as the training data.
The training data set is taken and manually chunked for noun phrases, thus evolving
rules that can be applied to separate the noun phrases in a sentence. These rules serve as
the base for chunking. The chunker program uses these rules and chunks the test data.
The coverage of these rules is tested with this test data set. Precision and recall are
calculated for this and the result is analyzed to check, if more rules are needed to
improve the coverage of the system. If more rules are needed then additional rules are
added and the same process as mentioned above is repeated to check for increase in the
precision and recall of the system. The system is then tested for various other
applications [130].

Vaanavil of RCILTS-Tamil

Vaanavil identifies the syntactic constituents of a Tamil sentence. It gives the


parsed tree in a list form. It tackles both simple and complex sentences. Simple
sentences can have a verb, many noun phrase, simple adverbs and adjectives. Complex
sentences can have multiple adjectival, adverbial and noun clausal forms. In the case of
sentences with multiple clauses, vaanavil syntactically groups the clauses based on the
cue words and phrases. It makes of phrase structure grammar. It uses look-ahead to

45 
 
handle free word order. It handles ambiguity using 15 heuristic rules. It uses the
morphological analyzer to obtain the root word [129].

2.6 SUMMARY

This chapter presents the literature survey for linguistic tools and available
Machine Translation systems for Indian languages. Literature review about the
Linguistic tools such as POS Tagger, Morphological analyzer and Morphological
generator are described in this chapter. Most of the tools are developed based on rule
based methods and few are developed using data driven. In India, Machine Translation
Systems have been developed using direct machine translation approach for closely
related language pairs. Some of these systems are very successful and still operational.
Statistical Machine Translation methods are frequently applied for unrelated language
pairs. Thus, it is concluded that statistical approach is the most appropriate for
unrelated languages.

46 
 
CHAPTER 3

THEORETICAL BACKGROUND
3.1 GENERAL

Natural Language Processing (NLP) research has a long tradition in European


countries. It has taken giant leaps in the last decade with the initiation of efficient
machine learning algorithms and the creation of large annotated corpora for various
languages. In countries like India where more than thousands of language are in usage,
so, the importance of the NLP is very relevant. However, NLP research in Indian
languages has mainly focused on the development of rule based techniques due to the
lack of annotated corpora. The pre-requisites for developing NLP applications in Tamil
language are the availability of speech corpora, annotated text corpora, parallel corpora,
lexical resources and computational models. The sparseness of these resources for
Tamil language is one of the major reasons for the slow growth of NLP work in Tamil.
Like other language processing, Tamil language also involves morphological analysis,
syntax analysis and semantic analysis.

3.1.1 Tamil Language

Tamil belongs to the southern branch of the Dravidian languages, a family of around
twenty-six languages native to the Indian subcontinent. It flourished in India as a
language with rich literature during the Sangam period (300 BCE to 300 CE). Tamil
scholars categorize the history of the language into three periods, Old Tamil (300 BC -
700 CE), Middle Tamil (700 - 1600) and Modern Tamil (1600–present). In Old Tamil,
Epigraphic attestation of Tamil begins with rock inscriptions from the 3rd century BC,
written in Tamil-Brahmi, an adapted form of the Brahmi script. The earliest extant
literary text is the ெதால்காப்பியம் (tholkAppiyam), a work on grammar and poetics
which describes the language of the classical period. The Sangam literature contains
about 50,000 lines of poetry contained in 2381 poems attributed to 473 poets including
many women poets [9].

During Modern Tamil i.e., in the early 20th century, the chaste Tamil
Movement called for the removal of all Sanskrit and other foreign elements from

47
 
Tamil. It received support from Dravidian parties and nationalists who supported Tamil
independence. This led to the replacement of a significant number of Sanskrit loan
words by Tamil equivalents. An important factor specific to Tamil is the existence of

two main varieties of the language, colloquial and formal Tamil ெசந்தமிழ்

(sewthamiz), which are sufficiently divergent that the language is classed as diglossic.
Colloquial Tamil is used for most spoken communication, and formal Tamil is spoken
in a restricted number of high contexts, such as lectures and news bulletins, and also
used in writing. They differ in terms of their lexis, morphology, and segmental
phonology.

Tamil is the official language of the Indian state of Tamilnadu and one of the 22
languages under schedule 8 of the constitution of India. It is also one of the official
languages of the Union Territories of Puducherry, Andaman & Nicobar Islands, Sri
Lanka, Malaysia and Singapore. Tamil became the first legally recognized classical
language of India in the year 2004 [9].

3.1.2 Tamil Grammar

Traditional Tamil grammar consists of five parts, namely எ த் (ezuththu), ெசால்


(sol), ெபா ள் (poruL), யாப் (yAppu) and அணி(aNi). Of these, the last two are
applicable mostly in poetry. The following Table 3.1 gives additional information about

these parts. The tholkAppiyam (ெதால்காப்பியம்) is the oldest work on the grammar of

the Tamil language [132].


Table 3.1 Tamil Grammar

Divisions Meaning Main grammar books


ெதால்காப்பியம்(tholkAppiyam),நன் ல்
எ த் (Ezuththu ) Letter
(wannUl)
ெதால்காப்பியம்(tholkAppiyam),நன் ல்
ெசால் (sol) Word
(wannUl)
ெபா ள் (poruL) Meaning ெதால்காப்பியம் (tholkAppiyam)
யாப்ெப ம்கலாக்காாிைக(yApperungkalAk
யாப் (yAppu) Form
kArikai )
அணி (aNi) Method தனியலங்காரம் (thaniyalangkAram)

48
 
3.1.3 Tamil Characters

Tamil is written using a script called the vattEzuththu. The Tamil script has twelve
vowels uyirezuththu (உயிெர த் ) "soul-letters", eighteen consonants meyyezuththu
(ெமய்ெய த் ) "body-letters" and one character, the Aythaezuththu (ஆய்த எ த் )
“the hermaphrodite letter”, which is classified in Tamil grammar as being neither a
consonant nor a vowel though often considered as part of the vowel set. The script,
however, is syllabic and not alphabetic.

The complete script, therefore, consists of the thirty-one letters in their independent
form, and an additional 216 compound letters representing a total 247 combinations.
These compound letters are formed by adding a vowel marker to the consonant. The
details of Tamil vowels are given in Table 3.2. Some vowels require the basic shape of
the consonant to be altered in a way that is specific to that vowel. Others are written by
adding a vowel-specific suffix to the consonant, yet others a prefix, and finally some
vowels require adding both a prefix and a suffix to the consonant. The following Table
3.3 lists vowel letters across the top and consonant letters along the side, the
combination of which gives all Tamil compound ( uyirmei) letters.

Table 3.2 Tamil Vowels

Short vowel Long vowel Diphthong


அ ஆ
இ ஈ ஐ
உ ஊ
எ ஏ ஔ
ஒ ஒ

In every case, the vowel marker is different from the standalone character for the
vowel. The Tamil script is written from left to right. Vowels are also called the 'life'
(uyir) or 'soul' letters. Tamil vowels are divided into short and long kuril and nedil -
five of each type) and two diphthongs. Tamil compound (uyirmei) letters are formed by
adding a vowel marker to the consonant. There are 216 compound letters in Tamil. The
Tamil transliteration is given in the Appendix A.

49
 
Table 3.3 Tamil Compound Letters

Tamil compound Characters table


Vow→ அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ
↓Cons a (A) (i) (I) (u) (U) (e) (E) (ai) (o) (O) (au)
க் (k) க கா கி கீ கு கூ ெக ேக ைக ெகா ேகா ெகௗ

ங் (ng) ங ஙா ஙி ஙீ ஙு ஙூ ெங ேங ைங ெஙா ேஙா ெஙௗ

ச் (s) ச சா சி சீ சு சூ ெச ேச ைச ெசா ேசா ெசௗ

ஞ் (nj) ஞ ஞா ஞி ஞீ ெஞ ேஞ ைஞ ெஞா ேஞா ெஞௗ

ட் (d) ட டா டீ ெட ேட ைட ெடா ேடா ெடௗ

ண் (N) ண ணா ணி ணீ ெண ேண ைண ெணா ேணா ெணௗ

த் (th) த தா தி தீ ெத ேத ைத ெதா ேதா ெதௗ

ந் (w) ந நா நி நீ ெந ேந ைந ெநா ேநா ெநௗ

ப் (p) ப பா பி பீ ெப ேப ைப ெபா ேபா ெபௗ

ம் (m) ம மா மி மீ ெம ேம ைம ெமா ேமா ெமௗ

ய் (y) ய யா யி யீ ெய ேய ைய ெயா ேயா ெயௗ

ர் (r) ர ரா ாி ாீ ெர ேர ைர ெரா ேரா ெரௗ

ல் (l) ல லா லீ ெல ேல ைல ெலா ேலா ெலௗ

வ் (v) வ வா வி ெவ ேவ ைவ ெவா ேவா ெவௗ

ழ் (z) ழ ழா ழி ழீ ெழ ேழ ைழ ெழா ேழா ெழௗ

ள் (L) ள ளா ளி ளீ ெள ேள ைள ெளா ேளா ெளௗ

ற் (R) ற றா றி றீ ெற ேற ைற ெறா ேறா ெறௗ

ன் (n) ன னா னி னீ ென ேன ைன ெனா ேனா ெனௗ

3.1.4 Morphological Richness of Tamil Language

Tamil is an agglutinative language. Tamil words consist of a lexical root to which one
or more affixes are attached. Mostly, Tamil affixes are suffixes. Tamil suffixes can be
derivational suffixes, which either changes the Part-of-Speech of the word or its
meaning, or inflectional suffixes, which mark categories such as person, number, mood,
tense, etc. There is no absolute limit on the length and extent of agglutination, which
can lead to long words with a large number of suffixes, which would require several
words or a sentence in English.

50
 
Tamil is a morphologically rich language in which most of the morphemes
coordinate with the root words in the form of suffixes. Suffixes are used to perform the
functions of cases, plural marker, euphonic increment and postpositions in noun class.
Tamil verbs are inflected for tense, person, number, gender, mood and voice. Other
features of Tamil language are, using plural for honorific noun, frequent echo words,
and null subject feature i.e. not all sentences have subject. Computationally, each root
word can take more than ten thousand inflected word-forms, out of which only a few
hundred will exist in a typical corpus [129]. Tamil is consistently head-final language.
The verb comes at the end of the clause with a typical word order of Subject-Object-
Verb (SOV). However, Tamil language allows word order to be changed, making it a
relatively word order free language. In Tamil, subject-verb agreement is required for
the grammaticality of a Tamil sentence.

3.1.5 Challenges in Tamil NLP

There are many issues that make a Tamil language processing task to difficult. These
relate to the problems of representation and interpretation. Language computing
requires precise representation of context. The natural languages are highly ambiguous
and vague, so achieving such representations are very hard. The various sources of
ambiguities in Tamil language are described below.

3.1.5.1 Ambiguity in Morphemes

Tamil morphemes are ambiguous in the grammatical category and the position it takes
in a word construction.

Ambiguity in morpheme’s grammatical category

A morpheme can have more than one grammatical category. For example, the
morpheme athu, ana, thu can occur as Nominalizing suffix or 3rd Person neuter
suffix.

Ambiguity in morpheme’s position

The suffixation of the morpheme’s position also leads to ambiguity. The Table 3.4
gives a few examples for the morphemes and its possible grammatical features.

51
 
Table 3.4 Ambiguity in Morpheme’s Position

Morpheme Possible Grammatical Features

அ (a) Infinitive Relative Participle

கல் (kal) Root Nominal Suffix

ஆக (Aka) Benefactive Adverbial Suffix

த் (th) Sandhi Tense

ெசய் (sey) Root Auxiliary Root

3.1.5.2 Ambiguity in Word Class

A word may be ambiguous in its Part of Speech or the word class. A word may have
more than one interpretation. For example, the word ப “padi” can take noun class or
verb class. The word ambiguity has to be disambiguating while referring to its context.

padi- study (V) or step (N)

கீேழ ப உள்ள கவனமாக ெசல்ல ம் . step (N)

தின ம் பாடங்கைள ப என ஆசிாிைய மாணவர்களிடம் கூறினார். study (V)

3.1.5.3 Ambiguity in Word Sense

Even though a word belongs to a specific grammatical category, it may be ambiguous

in the sense. For instance, the Tamil word கா “ kAddu” has 11 senses in noun class

and 18 senses in verb class [kiriyAvin tharkAla Tamil akarAthi, 2006] [133]. For
example the following sentence has two different meanings.

அவன் பாடல் ேகட்டான் .

(He heard the song)

52
 
(He ask the song )

3.1.5.4 Ambiguity in Sentence

A sentence may be ambiguous even if the words are not ambiguous. For example, the
following sentence has two interpretations.

“நான் ஒ அழகான ெபண்ைண ம் ஆைண ம் பார்த்ேதன்”

(I saw the beautiful women and men)

(I saw the beautiful women and beautiful men).

The words are not ambiguous but the sentences are ambiguous.

3.2 MORPHOLOGY

Morphology is the field within linguistics that studies the internal structure of words.
While words are generally accepted as being the smallest units of syntax, it is clear that
in most (if not all) languages, words can be related to other words by rules.
Morphology is the branch of linguistics that studies patterns of word-formation within
and across languages, and attempts to formulate rules that model the knowledge of the
speakers of those languages. 

3.2.1 Types of Morphology

Morphology is traditionally classified into three main divisions: inflection, derivation,


and compounding. Inflectional morphology deals with the formation of different forms
in the paradigm of a lexeme. In inflectional morphology, words undergo a change in
their form to express some grammatical functions but their syntactic category remains
unchanged. Many inflectional features appear on words to express agreement purposes
(agreement in person, number, and gender) as well as to express case, aspect, mood,
and tense.

The derivational morphology is concerned with “the creation of a new lexeme


via affixation”. In English, the process of word formation through derivation involves
two types of affixation: prefixation, which means placing a morpheme before a word,
e.g. un-happy; and suffixation, which means placing a morpheme after a word, e.g.

53
 
happi-ness. Derivation poses a problem to translation in that “not all derived words
have straight-forward compositional translation as derived words. In English, for
example, the same meaning can be expressed by different affixes. Moreover, the same
affix can have more than one meaning. This can be exemplified by the suffix -er. This
suffix can be used to express the agent as in player and singer. But this is not the only
meaning it can convey as it can describe instruments as in mixer and cooker. In this
way the affix can have a range of equivalents in the target language and the attempt to
have one-to-one correspondences for affixes will be greatly misguided.

Compounding morphology is the process of forming a new word through


combining two or more words. Compounding is a process of word formation that
involves combining complete word forms into a single compound form; dog catcher is
therefore a compound, because both dog and catcher are complete word forms in their
own right before the compounding process has been applied, and are subsequently
treated as one form. An important notion in compounding is the notion of head. A
compound noun is divided into head and modifier or modifiers. For instance, the
compound noun watchtower in which watch and tower can be represented as a head
and modifier.

3.2.2 Lexemes

A lexical database is organized around lexemes, which include all the morphemes of a
language. A lexeme is conventionally listed in a dictionary as a separate entry.
Generally lexeme corresponds to a set of forms taken by a single word. For example, in
the English language, run, runs, ran and running are forms of the same lexeme “run”.

3.2.3 Lemma and Stems

A lemma in morphology is the canonical form of a lexeme. In lexicography, this unit is


usually the citation form or headword by which it is indexed. Lemmas have special
significance in highly inflected languages such as Tamil. The process of determining
the lemma for a given word is called lemmatization.

A stem is the part of the word that never changes even when morphologically
inflected, whilst a lemma is the base form of the verb. For example, for the word
"produced", the lemma is "produce", but the stem is “produc-”. This is because there

54
 
are words such as production. In linguistic analysis, the stem is defined more generally
as the analyzed base form from which all inflected forms can be formed. When
phonology is taken into account, the definition of the unchangeable part of the word is
not useful, as can be seen in the phonological forms of the words in the preceding
example: "produced" vs. "production".

3.2.4 Inflections and Word Forms

Given the notion of a lexeme, it is possible to distinguish two kinds of morphological


rules. Some morphological rules relate different forms of the same lexeme; while other
rules relate two different lexemes. Rules of the first kind are called inflectional rules,
while those of the second kind are called word-formation. The English plural, as
illustrated by dog and dogs, is an inflectional rule; compounds like dog-catcher or
dishwasher provide an example of a word-formation rule. Informally, word-formation
rules form "new words" (that is, new lexemes), while inflection rules yield variant
forms of the "same" word (lexeme).

Derivation involves affixing bound (non-independent) forms to existing lexemes,


whereby the addition of the affix derives a new lexeme. One example of derivation is
clear in this case: the word independent is derived from the word dependent by
prefixing it with the derivational prefix in-, while dependent itself is derived from the
verb depend.

3.2.5 Morphemes and Types

Morpheme is the minimal meaningful unit in a word. The concept of word and
morpheme are different, a morpheme may or may not stand alone. One or several
morphemes compose a word.

• Free morphemes, like town and dog, can appear with other lexemes (as in town
hall or dog house) or they can stand alone, i.e. "free".

• Bound morphemes like "un-" appear only together with other morphemes to
form a lexeme. Bound morphemes in general tend to be prefixes and suffixes.

55
 
• Derivational morphemes can be added to a word to create (derive) another
word: the addition of "-ness" to "happy," for example, to give "happiness." They
carry semantic information.

• Inflectional morphemes modify a word's tense, number, aspect, and so on,


without deriving a new word or a word in a new grammatical category (as in the
"dog" morpheme if written with the plural marker morpheme "-s" becomes
"dogs"). They carry grammatical information.

Agglutinative languages have words containing several morphemes that are always
clearly differentiable from one another in that each morpheme represents only one
grammatical meaning and the boundaries between those morphemes are easily
demarcated. The bound morphemes are affixes, and they may be individually
identified. Agglutinative languages tend to have a high number of morphemes per
word, and their morphology is highly regular [134].

3.2.6 Allomorphs

One of the largest sources of complexity in morphology is one-to-one correspondence


between meaning and form which is scarcely applies to every case in the language.
English have word form pairs like ship/ships, ox/oxen, goose/geese, and sheep/sheep,
where the difference between the singular and the plural is signaled in a way that
departs from the regular pattern, or is not signaled at all. Even cases considered
"regular", with the final -s, are not so simple; the -s in dogs is not pronounced the same
way as the -s in cats, and in a plural like dishes; an "extra" vowel appears before the -s.
These cases, where the same distinction is affected by alternative forms of a "word",
are called allomorph.

3.2.7 Morpho-Phonemics

Morpho-phonology or Morpho-phonemics studies the phonemic changes when a


morpheme is inflected with another. This phenomenon is called ‘sandhi’ in Tamil.
Sandhi occurs very frequently in Tamil and should be taken care when building
morphological analyzers or generators. For instance, the noun root ‘pU’ (flower),
when pluralized, becomes ‘pUkkaL’ instead of the ‘pUkaL’. When the root is

56
 
monosyllabic ending with a long and the following morpheme starts with a vallinam
consonant, the consonant geminates. Sandhi changes can occur between two
morphemes or words. Although sandhi rules are mostly dependent on phonemic
properties of the morphemes, they sometimes depend on the grammatical relations of
the words on which they operate. Sometimes gemination may be invalid when the
words are in subject-predicate relation, but valid if they are in modifier-modified
relation. Sandhi changes can occur in four different ways: Gemination, Insertion,
Deletion and Modification. Gemination is a case of insertion where the vallinam
consonants double themselves. In general, the insertion happens when new characters
are inserted between words or morphemes. Deletion happens when existing characters
at the end of the first word or the start of the second word are dropped. Modification
happens when characters get replaced by some other characters with close phonological
properties.

3.2.8 Morphotactics

The morphemes of a word cannot occur in random order. In every language, there are
well-defined ways to sequence the morphemes. The morphemes can be divided into a
number of classes and the morpheme sequences are normally defined in terms of the
sequence of classes. For instance, in Tamil, the case morphemes follow the number

morpheme in noun constructions. For example, க்கைள ( _கள்_ஐ). The other way

around is invalid. For example, ஐக்கள் ( _ஐ_ கள்). The order in which

morphemes follow each other is strictly governed by a set of rules called morphotactics.
In Tamil, these rule play a very important role in word construction and derivation as
the language is agglutinative and words are formed by a long sequence of morphemes.
Rules of morphotactics also serve to disambiguate the morphemes that occur in more
than one class of morphemes. The analyzer uses these rules to identify the structure of
words.

57
 
3.3 MACHINE LEARNING FOR NLP

3.3.1 Machine Learning

Machine learning deals with techniques that allow computers to automatically learn and
make accurate predictions based on past observations. The major focus of machine
learning is to extract information from data automatically, by using computational and
statistical methods. Machine learning techniques are being used for solving various
tasks of Natural Language processing. This includes speech recognition, document
categorization, document segmentation, part-of-speech tagging, and word-sense
disambiguation, named entity recognition, parsing, machine translation and
transliteration.

There are two main tasks involved in machine learning; learning/training and
prediction. The system is given with a set of examples called training data. The primary
goal is to automatically acquire effective and accurate model from the training data.
The training data provides the domain knowledge i.e., characteristics of the domain
from which the examples are drawn. This is a typical task for inductive learning and is
usually called concept learning or learning from examples. The larger the amount of
training data, usually the better the model will be. The second phase of machine
learning is the prediction, wherein a set of inputs is mapped into the corresponding
target values. The main challenge of machine learning is to create a model, with good
prediction performance on the test data i.e., model with good generalization on
unknown data.

Machine learning algorithms are categorized based on the desired outcome of


the algorithm. Types of machine learning algorithms include Supervised learning,
Unsupervised learning, Semi-supervised learning, Reinforcement learning and
Transduction [135]. In supervised learning the target function is completely specified
by the training data. There is a label associated with each example. If the label is
discrete, then the task is called classification. Otherwise, for real valued labels, the task
becomes a regression problem. Based on the examples in the training data, the label for
new case is predicted. Hence, learning is not only a question of remembering but also
of generalization to unseen cases. Any change in the learning system can be seen as

58
 
acquiring some kind of knowledge. So, depending on what the system learns, the
learning is categorized as

• Model Learning: The system predicts values of unknown function. This is called
as prediction and is a task well known in statistics. If the function is discrete, the
task is called classification. For continuous-valued functions it is called regression.

• Concept learning: The systems acquire descriptions of concepts or classes of


objects.

• Explanation-based learning: Using traces (explanations) of correct (or incorrect)


performances, the system learns rules for more efficient performance of unseen
tasks.

• Case-based (exemplar-based) learning: The system memorizes cases (exemplars)


of correctly classified data or correct performances and learns how to use them (e.g.
by making analogies) to process unseen data.

3.3.2 Support Vector Machines

Support Vector Machine (SVM) represents a new approach to supervised pattern


classification which has been successfully applied to a wide range of pattern
recognition problems. SVM as supervised machine learning technology is attractive
because it has an extremely well developed learning theory, statistical learning theory.
SVM is based on strong mathematical foundations and results in simple yet very
powerful algorithms. A simple way to build a binary classifier is to construct a
hyperplane separating class members from non-members in the input space.
Unfortunately, most real world problems involve non-separable data for which there
does not exist a hyperplane that successfully separates the class members from non-
class members in the training set. One solution to the inseparability is to map the data
into a higher dimensional space and define a separating hyperplane in that space. This
higher dimensional space is called the feature space, as opposed to the input space
occupied by training examples. With an appropriately chosen feature space of sufficient
dimensionality, any consistent training set can be made separable.

However, translating the training set into a higher dimensional space incurs both
computational and learning-theoretic costs. Representing the feature vectors

59
 
corresponding to the training set can be extremely expensive in terms of memory and
time. Furthermore, artificially separating the data in this way exposes the learning
system to the risk of finding trivial solutions that overfit the data.

Support Vector Machines elegantly sidestep both difficulties [136]. Support


vector machines avoid overfitting by choosing a specific hyperplane among the many
that can separate the data in the feature space. SVMs find the maximum margin
hyperplane, the hyperplane that maximises the minimum distance from the hyperplane
to the closest training point. The maximum margin hyperplane can be represented as a
linear combination of training points. Consequently, the decision function for
classifying points with respect to the hyperplane only involves dot products between
points. Furthermore, the algorithm that finds a separating hyperplane in the feature
space can be stated entirely in terms of vectors in the input space and dot products in
the feature space. Thus, a support vector machine can locate a separating hyperplane in
the feature space and classify points in that space without ever representing the space
explicitly, simply by defining a function, called a kernel function that plays the role of
the dot product in the feature space. This technique avoids the computational burden of
explicitly representing the feature vectors.

Another appealing feature of SVM classification is the sparseness of its


representation of the decision boundary. The location of the separating hyperplane in
the feature space is specified via real-valued weights on the training set examples.
Those training examples that lie far away from the hyperplane do not participate in its
specification and therefore receive weights of zero. Only the training examples that lie
close to the decision boundary between the two classes receive nonzero weights. These
training examples are called the support vectors, since removing them would change
the location of the separating hyperplane. It is believed that all the information about
classification in the training samples can be represented by these Support vectors. In a
typical case, the number of support vectors is quite small compared to the total number
of training samples.

The maximum margin allows the SVM to select among multiple candidate
hyperplanes. However, for many data sets, the SVM may not be able to find any
separating hyperplane at all, either because the kernel function is inappropriate for the
training data or because the data contains mislabeled examples. The latter problem can

60
 
be addressed by using a soft margin that accepts some misclassifications of the training
examples. A soft margin can be obtained in two different ways. The first is to add a
constant factor to the kernel function output whenever the given input vectors are
identical. The second is to define a priori an upper bound on the size of the training set
weights. In either case, the magnitude of the constant factor is to be added to the kernel
or to tie the size of the weights which controls the number of training points that the
system misclassifies. The setting of this parameter depends on the specific data at hand.
Completely specifying a support vector machine therefore requires specifying two
parameters: the kernel function and the magnitude of the penalty for violating the soft
margin.

Thus, a support vector machine finds a nonlinear decision function in the input
space by mapping the data into a higher dimensional feature and separating it there by
means of a maximum margin hyperplane. The computational complexity of the
classification operation does not depend on the dimensionality of the feature space,
which can even be infinite. Overfitting is avoided by controlling the margin. The
separating hyperplane is represented sparsely as a linear combination of points. The
system automatically identifies a subset of informative points and uses them to
represent the solution. Finally, the training algorithm solves a simple convex
optimization problem. All these features make SVMs an attractive classification
system.

3.3.3 Geometrical Interpretation of SVM

Typically, the machine is presented with a set of training examples, (xi,yi) where the xi
are the real world data instances and the yi are the labels indicating which class the
instance belongs to. For the two class pattern recognition problem, yi = +1 or yi = -1. A
training example (xi,yi) is called positive if yi = +1 and negative otherwise. SVMs
construct a hyperplane that separates two classes (this can be extended to multi-class
problems). While doing so, the SVM algorithm tries to achieve maximum separation
between the classes.

Separating the classes with a large margin minimizes a bound on the expected
generalization error [137]. A ‘minimum generalization error’, means that when new
examples (data points with unknown class values) arrive for classification, the chance

61
 
of making an error in the prediction (of the class which it belongs) based on the learned
classifier (hyperplane) should be minimum. Intuitively, such a classifier is one which
achieves maximum separation-margin between the classes. Figure 3.1 illustrates the
concept of ‘maximum margin’. The two planes parallel to the classifier and which
pass through one or more points in the data set are called ‘bounding planes’. The
distance between these bounding planes is called the ‘margin’ and SVM ‘learning’,
means, finding a hyperplane which maximizes this margin. The points (in the dataset)
falling on the bounding planes are called ‘support vectors’ . These points play a crucial
role in the theory and hence the name support vector machines. ‘Machine’, means
algorithm. Vapnik (1998) has shown that if the training vectors are separated without
errors by an optimal hyperplane, the expected error rate on a test sample is bounded by
the ratio of the expectation of the support vectors to the number of training vectors.
Since this ratio is independent of the dimension of the problem, if one can find a small
set of support vectors, good generalization is guaranteed [136].

Support Vectors 

Maximum  Margin 

Figure 3.1 Maximum Margin and Support Vectors

62
 
In the case, wherein the data points are shown in Figure 3.2, one may simply minimize
the number of misclassifications whilst maximizing the margin with respect to the
correctly classified examples. In such a case it is said that the SVM training algorithm
allows a training error. There may be another situation wherein the points are clustered
such that the two classes are not linearly separable as shown in Figure 3.3, that is, if one
tries for a linear classifier, it may have to tolerate a large training error. In such cases,
one prefers non-linear mapping of data into some higher dimensional space called
‘feature space’, F, where it is linearly separable. In order to distinguish between these
two spaces, the original space of data points is called ‘input space’. The hyperplane in
‘feature space’ corresponds to a highly non-linear separating surface in the original
input space. Hence the classifier is called a non-linear classifier

Figure 3.2 Training Errors in Support Vector Machine

63
 
Figure 3.3 Non-linear Classifier

The process of mapping the data into higher dimensional space involves heavy
computation especially when the data which itself may be of high dimensional.
However, there is no need to do any explicit mapping to higher dimensional space for
finding the hyper plane classifier, all computations will be done in the input space itself
[138].

3.3.4 SVM Formulation

Notation used

m = number of data points in the training set

n = number of features (variables) in the data

⎡ xi1 ⎤
⎢x ⎥
xi = ⎢ i 2 ⎥
⎢ ⎥
⎢ ⎥
⎣ xin ⎦ , n dimensional vector, which represent a data point in “input space”.

di = Dii =Taget value of the ith data, it takes +1 or -1 value

64
 
⎡ d1 ⎤
⎢d ⎥
d =⎢ 2⎥
⎢ ⎥
⎢ ⎥
⎣ d m ⎦ , vector representing target value of m data points

⎡ d1 0 0⎤
⎢0 d2 … 0 ⎥⎥
D = diag( d )= ⎢
⎢ . ⎥
⎢ ⎥
⎣0 0 … dm ⎦

⎡ w1 ⎤
⎢w ⎥
w = ⎢ 2⎥
⎢ ⎥
⎢ ⎥
⎣ wn ⎦ , weight vector orthogonal to the hyper plane

w1 x1 + w2 x2 + … wn xn − γ = 0
.
γ is a scalar which is generally known as bias term

⎡ x1T ⎤ ⎡ x1T x1 x1T x2 . . x1T xm ⎤


⎢ T⎥ ⎢ T ⎥
⎢ x2 ⎥ ⎢ x2 x1 x2T x2 . . x2T xm ⎥
A = ⎢ . ⎥ , AA = ⎢ .
T
. . . . ⎥
⎢ ⎥ ⎢ ⎥
⎢ . ⎥ ⎢ . . . . . ⎥
⎢ xT ⎥ ⎢ xT x xmT xm ⎥⎦
⎣ m⎦ ⎣ m 1
T
xm x2 . .

AAT is called linear kernel of the dataset

φ (.) → x A nonlinear mapping function that maps input vector x into a high

dimensional feature vector

⎡ φ ( x1 )T φ ( x1 ) φ ( x )T φ ( x2 ) . . φ ( x )T φ ( xm ) ⎤
⎢ ⎥
1 1

⎢ φ ( x2 ) φ ( x1 ) φ ( x2 )T φ ( x ) φ ( x2 )T φ ( xm ) ⎥
T
1
. .
K =⎢ . . . . . ⎥
⎢ ⎥
⎢ . . . . . ⎥
⎢φ(x ) φ(x )
T
φ ( x )T φ ( x ) φ ( xm )T φ ( xm )⎥⎦
⎣ m 1 m 2
. .

K is called the non-linear Kernel of input dataset.

65
 
Q = an mxm matrix whose (i,j)th element is di d jφ ( xi ) φ ( x j )
T

(
Q = K .* d * d T ), where .* represent element wise multiplication

From the geometric point of view, the support vector machine constructs an optimal
T
hyperplane given by w x - γ = 0 between two classes of examples. The free
parameters are a vector of weights w which is orthogonal to the hyperplane and a
threshold value γ. The aim is to find maximally separating bounding planes
wTx- γ=1
w T x - γ = -1
such that data points with d = -1 satisfy the constraints

w T x - γ ≤ -1

and data points with d = +1 satisfy

w T x - γ ≥ 1.

The perpendicular distance of the bounding plane w T x - γ = 1 from the origin is

|- γ + 1|/||w||

and the perpendicular distance of the bounding plane w T x - γ = -1 from the origin
is, |- γ - 1|/||w|| .

The margin between the optimal hyperplane and the bounding plane is 1/||w||, and
so the distance between the bounding hyperplanes is 2/||w||.

Then the learning problem is formulated as an optimization problem as given below.

1 2
Minimize = w
2
subject to Dii (w T x i − γ ) ≥ 1, i = 1, … , l.

The ‘training of SVM’ consists of finding w and γ, given the matrix of data points

A and the corresponding class vector d. Once w and γ are obtained then the decision

boundary is w x − γ = 0 . The decision function is given by f ( x ) = sign( w x − γ) . That


T T

is, for a new points the sign of w x − γ is assigned as the class value. The problem is
T

easily solved in terms of its Lagrangian dual variables.

66
 
3.4 VARIOUS APPROACHES FOR POS TAGGING

There are different approaches for POS tagging. The Figure 3.4 demonstrates different
POS tagging models. Most tagging algorithms fall into one of the two classes which are
rule-based taggers or stochastic taggers.

3.4.1 Supervised POS Tagging

The supervised POS tagging models require pre-tagged corpora which are used for
training to learn rule sets, information about the tagset, word-tag frequencies etc. The
learning tool generates trained models along with the statistical information. The
performance of the models generally increases with increase in the size of pre-tagged
corpus. 

   POS Tagging

Supervised     Unsupervised

Neural
    Rule Based  Stochastic Neural Rule Based Stochastic 

     Brill      Brill

N‐gram  Maximum  Hidden Markov  Baum‐Welch 


based  Likelihood  Model  Algorithm 

Viterbi Algorithm 

Figure 3.4 Classification of POS Tagging Models

67
 
3.4.2 Unsupervised POS Tagging

Unlike the supervised models, the unsupervised POS tagging models do not require a
pre-tagged corpus. Instead, they use advanced computational methods like the Baum-
Welch algorithm to automatically induce tagsets, transformation rules etc. Based on the
information, they either calculate the probabilistic information needed by the stochastic
taggers or induce the contextual rules needed by rule-based systems or transformation
based systems.

3.4.3 Rule based POS Tagging

The rule based POS tagging models apply a set of hand written rules and use contextual
information to assign POS tags to words in a sentence. These rules are often known as
context frame rules. For example, a context frame rule might say something like:

“If an ambiguous/unknown word X is preceded by a Determiner and followed by a


Noun, tag it as an Adjective.”

On the other hand, the transformation based approaches use a pre-defined set of
handcrafted rules as well as automatically induced rules that are generated during
training. Some models also use information about capitalization and punctuation, the
usefulness of which are largely dependent on the language being tagged. The earliest
algorithms for automatically assigning Part-of-Speech were based on two-stage
architecture. The first stage used a dictionary to assign each word a list of potential
parts of speech. The second stage used large lists of hand-written disambiguation rules
to bring down this list to a single Part-of-Speech for each word [139].

The ENGTWOL [140] tagger is based on the same two-stage architecture,


although both the lexicon and the disambiguation rules are much more sophisticated
than the early algorithms. The ENGTWOL lexicon is based on the two-level
morphology. It has about 56,000 entries for English word stems, counting a word with
multiple parts of speech (e.g. nominal and verbal senses of hit) as separate entries, and
of course not counting inflected and many derived forms. Each entry is annotated with
a set of morphological and syntactic features. In the first stage of the tagger, each word
is run through the two-level lexicon transducer and the entries for all possible parts of
speech are returned.

68
 
3.4.4 Stochastic POS Tagging

A stochastic approach includes frequency, probability or statistics. The simplest


stochastic approach finds out the most frequently used tag for a specific word in the
annotated training data and uses this information to tag that word in the unannotated
text. The problem with this approach is that it can come up with sequences of tags for
sentences that are not acceptable according to the grammar rules of a language.

An alternative to the word frequency approach is known as the n-gram approach


that calculates the probability of a given sequence of tags. It determines the best tag for
a word by calculating the probability that it occurs with the n previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes. These are known as the unigram,
bigram and trigram models. The most common algorithm for implementing an n-gram
approach for tagging a new text is known as the Viterbi Algorithm, which is a search
algorithm that avoids the polynomial expansion of a breadth first search by trimming
the search tree at each level using the best m Maximum Likelihood Estimates (MLE)
where m represents the number of tags of the following word.

Advantages of Statistical Approach,

• Very robust, can process any input strings


• Training is automatic, very fast
• Can be retrained for different corpora / tagsets without much effort
• Language independent
• Minimize the human effort and human error.

3.4.5 Other Techniques

Apart from these, a few different approaches for tagging have been developed.

Support Vector Machines: This is the powerful machine learning method used for
various applications in NLP and other areas like bio-informatics, data mining, etc.

Neural Networks: These are potential candidates for the classification task since
they learn abstractions from examples [141].

69
 
Decision Trees: These are classification devices based on hierarchical clusters of
questions. They have been used for natural language processing such as POS Tagging.
“Weka” can be used for classifying the ambiguous words [141].

Maximum Entropy Models: These avoid certain problems of statistical


interdependence and have proven successful for tasks such as parsing and POS tagging.

Example-Based Techniques: These techniques find the training instance that is


most similar to the current problem instance and assume the same class for the new
problem instance as for the similar one.

3.5 VARIOUS APPROACHES FOR MORPHOLOGICAL


ANALYZER

3.5.1 Two level Morphological Analysis

Koskenniemi (1985) [26] describes two-level morphology as a “general, language


independent framework which has been implemented for a host of different languages
(Finnish, English, Russian, Swedish, German, Swahili, Danish, Basque, Estonian,
etc.)”. It consists of two representations and one relation.

The surface representation of a word-form:

This is the actual spelling of the final valid word. For example English words
eating and swimming, are both surface representations.

The lexical (also called morphophonemic) representation of a word-form:

This shows a simple concatenation of base forms and tags. Consider the
following examples showing the lexical and surface form of English words.

Lexical Form Surface Form

talk + Verb talk

walk + Verb + 3PSg walks

eat +Verb + Prog eating

swim +Verb + Prog swimming

70
 
It may be noted that the lexical representation (or form) is often invariant or
constant. In contrast, affixes and bases of the surface form tend to have alternating
shapes. This can be seen in the above examples. The same tag “+Verb + Prog” is used
with both eat and swim, but swim is realized as swimm in the context of ing, while eat
shows no alternation in the context of ing. The rule component consists of rules which
map the two representations to each other. Each rule is described through a Finite-
State-Transducer (FST). Figure 3.5, schematically depicts two-level morphology.

Figure 3.5 Two Level Morphology

3.5.2 Unsupervised Morphological Analyzer

The definition of Unsupervised Learning of Morphology is given below.

“Input: Raw (un-annotated, non-selective) natural language text data.”

“Output: A description of the morphological structure

(there are various levels to be distinguished) of the language of the input text.”

Some approaches have explicit or implicit biases towards certain kinds of


languages; they are nevertheless considered to be Unsupervised Learning of
Morphology. Morphology may be narrowly taken as to include only derivational and
grammatical affixation, where the number of affixations a root may take is finite and
the order of affixation may not be permuted. A number of approaches focus on
concatenative morphology/ compounding only. All works considered are designed to
function on orthographic words, i.e., raw text data in orthography that segment on the
word-level.

71
 
3.5.3 Memory based Morphological Analysis

Memory based learning approach models morphological analysis (including


compounding) of complex word-forms as sequences of classification tasks. MBMA
(Memory-Based Morphological Analysis) is a memory-based learning system (Stanfill
and Waltz, 1986) [142]. Memory-based learning is a class of inductive, supervised
machine learning algorithm that learns by storing examples of a task in memory.
Computational effort is invested on a "call-by-need" basis for solving new examples
(henceforth called instances) of the same task. When new instances are presented to a
memory-based learner, it searches for the best matching instances in memory,
according to a task-dependent similarity metric. When it has found the best matches
(the nearest neighbors), it transfers their solution (classification, label) to the new
instance.

3.5.4 Stemmer based Approach

Stemmer uses a set of rules containing list of stems and replacement rules to stripping
of affixes. It is a program oriented approach where the developer has to specify all
possible affixes with replacement rules. Potter algorithm is one of the most widely used
stemmer algorithm and it is freely available. The advantage of stemmer algorithm is
that it is very suitable to highly agglutinative languages like Dravidian languages for
creating Morphological Analyzer and Generator.

3.5.5 Suffix Stripping based Approach

Highly agglutinative languages such as Dravidian languages, a Morphological Analyzer


and Generator can be successfully built using suffix stripping approach. The advantage
of the Dravidian language is that no prefixes and circumfixes exist for words. Words
are usually formed by adding suffixes to the root word serially. This property can be
well suited for suffix stripping based Morphological Analyzer and Generator. Once the
suffix is identified, the stem of the whole word can be obtained by removing that suffix
and applying proper orthographic (sandhi) rules. A set of dictionaries like stem
dictionary, suffix dictionary and also using morphotactics and sandhi rules, a suffix
stripping algorithm successfully implements MAG.

72
 
3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION

From the period when the first idea of using machine for the process of language
translation, there have been many different approaches to machine translation that have
been proposed, implemented and put into use, during the course of time. The main
approaches to machine translation are:

• Linguistic or Rule Based Approaches


o Direct Approach
o Interlingua Approach
o Transfer Approach
• Non-Linguistic Approaches
o Dictionary Based Approach
o Corpus Based Approach
o Example Based Approach
o Statistical Approach
• Hybrid Approach

Direct, Interlingua and Transfer approaches are linguistic approaches which require
some sort of linguistic knowledge to perform translations, whereas dictionary based,
example based and statistical approach falls under non-linguistic approaches that don’t
require any linguistic knowledge to translate the sentences. Hybrid approach is a
combination of both linguistic and non-linguistic approaches.

3.6.1 Linguistic or Rule Based Approaches

Rule based approaches requires a lot of linguistic knowledge during the translation and
so it uses grammar rules and computer programs which will be helpful in analysing the
text for determining grammatical information and features for each and every word in
the source language, translating it by replacing each word by lexicon or word that have
the same context in the target language. Rule based approach is the principal
methodology that was developed in machine translation. Linguistic knowledge will be
required in order to write the rules for this type of approaches. These rules will play a
vital role during the different levels of translation. This approach is also called as
Theory based Machine Translation.

73
 
The benefit of rule based machine translation method is that it can intensely
examine the sentence at its syntax and semantic levels. There are complications in this
method such as prerequisite of vast linguistic knowledge and very huge number of rules
is needed in order to cover all the features in a language. An advantage of the approach
is that the developer has more control over the translations than is the case with corpus-
based approaches. The three different approaches that require linguistic knowledge are
as follows.

3.6.1.1 Direct Approach

Direct translation approach can be considered as the first approach to machine


translation. In this type of approach, the machine translation system is designed more
specifically for one particular pair of language. There is no need of identifying the
schematic roles and universal concepts in this approach. It involves the process of
analysing morphological information, identify the constituents and reorder the words in
the source language according to the word order pattern of the target language and then
replace the words in the source language by the target language words using a lexical
dictionary of that particular language pair and as a last step, inflect the words
appropriately to produce translations. This approach as it is seen, looks like a lot of
work has to be done in order to produce translations, but all those work which has to be
employed will be simple and can be accomplished very easily, in a short span of time.
Figure 3.5 illustrates the block diagram of the direct approach to machine translation.

This approach perform a simple and minimal syntactic and semantic analysis,
by which it differs from the other rule based translation systems such as interlingua and
the transfer-based approaches. As the direct approach to machine translation is
considered to be ad-hoc and found to be an approach that is unsuitable approach to
machine translation. Table 3.6 describes the example, how the sentence “he came late
to school yesterday” will be translated from English to Tamil using the direct approach.

74
 
Figuree 3.6 Block Diagram off Direct App
proach to Machine
M Traanslation

Table 3.5 An
A Examplle to Illustraate the Direct Approach

Input Sentence in
n English He camee late to schoool yesterdayy

Morrphological Analysis He comee PAST late to school yeesterday

<He><ccome PAST> ><late><to


Con
nstituent Ideentification
school><
<yesterday>
>
After

<He><yyesterday><
<to school><
<late><comee
Worrd Reorderiing
PAST>
mtd; ne
ew;W gs;spff;F neuk; fH
Hpj;J th
Dicttionary Lookup
PAST
Infleect(the fina
al translateed
mtd; ne
ew;W gs;spff;F neuk; fH
Hpj;J te;jhd;;.
sentence)

75
3.6.1.2 Interlingua Approach

Interlingua approach to machine translation mainly aims at transforming the texts in the
source language to a common representation which is applicable to many languages.
Using this representation the translation of text to the target language is performed and
it should be possible to translate to every language from the same Interlingua
representation with the right rules.

Interlingua approach sees machine translation as a two stage process:

1. Analysing and transforming the source language texts into a common language
independent representation.

2. From the common language independent form generate the text in the target
language.

The first stage is particular to source language and doesn’t require any knowledge
about the target language whereas the second stage is particular to the target language
and doesn’t require any knowledge from the source language. The main advantage of
interlingua approach is that it creates an economical multilingual environment that
requires 2n translation systems to translate among n languages where in the other case,
the direct approach requires n(n-1) translation systems. Table 3.6 has the Interlingua
representation of the sentence, “he will reach the hospital in ambulance”.

Table 3.6 An Example for Interlingua Representation

Predicate Reach

Agent Boy (Number: Singular)

Theme Hospital (Number: Singular)

Instrument Ambulance (Number: Singular)

Tense FUTURE

The concepts and relations that are used are the most important aspect in any
interlingua-based system. The ontology should be powerful enough that all subtleties of
meaning that can be expressed using any language should be representable in the
Interlingua. Interlingua approach can be found more economical when translation is

76
 
carrried out witth three or more
m languagges but also the complexxity of this approach
a geets
inccreased, dram
matically. This
T is clearlly evident from
f ngle which is
the Vaauquois trian
shoown in the Figure 3.7.

Figure 3.77 The Vauqu


uois Trianglle

3.66.1.3 Transffer Approacch

The less deterrmined transsfer approacch has threee stages, coomprising th


he intellectuual
reppresentationss of the sourrce and targeet language texts,
t insteadd of the twoo stages in thhe
Interlingua appproach. Thee transfer appproach cann be done eeither by coonsidering thhe
ntactic or sem
syn mantic inforrmation of thhe text. In geeneral, transffer can eitheer be syntacttic
or semantic deppending on the
t need.

The transffer model involves thhree stages which are analysis, transfer, annd
genneration. In the analysiis stage, thee source lan
nguage senttence is parrsed, and thhe
sen
ntence structture and thee constituentts of the senntence are identified.
i In
n the transfe
fer
stage, transformations aree applied to the source language pparse tree too convert thhe
ucture to thaat of the targget languagee. The generation stage translates th
stru he words annd
exppresses the tense,
t numbber, gender eetc. in the taarget languaage. Figure 3.8
3 shows thhe
bloock diagram of the transffer approachh.

77
Figure 3.8 Block Diagram
D forr Transfer A
Approach

Consider the
t sentencee, “he will come
c to schoool in bus”. Table 3.7 illustrates thhe
three stages off the translatiion of this seentence usin
ng the transfeer approach. The sentencce
reppresentation after the an
nalysis stagee of the trannsfer approaach is show
wn in analyssis
stage. The reprresentation of
o the sentence after reorrdering it acccording to thhe Tamil worrd
ordder as result of the transfer stage of the transfer approach iss shown in Table
T 3.7. Thhe
final generatioon stage which replacess the sourcee language w
words to tarrget languagge
ords.
wo

From the above


a exampple, it will bbe clear thatt, the analysser stage of this approacch
prooduces a reppresentation that is sourrce languagee dependent and the gen
neration stagge
gennerates the final
f translattion from thee target lang
guage dependdent represeentation of thhe
sen
ntence. Thuss, using thiis approach in multilinngual machinne translatio
on system to
t
trannslate n langguages, will require ‘n’ aanalyser com
mponents, n((n-1) transfeer componennts
sinnce individuaal transfer components
c are requireed for transllation betweeen a pair oof
lannguages for each
e directioon and ‘n’ geeneration com
mponents.

78
Table 3.7 An Example for Transfer Approach

Input Sentence He will come to school in bus

Analysis <he><will come><to school><in bus>

Transfer <he><in bus><to school><will come>

Generation (Output) அவன் ேப ந்தில் பள்ளிக்கு வ வான்

3.6.2 Non-Linguistic Approaches

The non-linguistic approaches are those which don’t require any linguistic knowledge
explicitly to translate texts in the source language to target language. The only resource
required by this type of approaches is data either the dictionaries for the dictionary
based approach or bilingual and monolingual corpus for the empirical or corpus based
approaches.

3.6.2.1 Dictionary based Approach

The dictionary based approach to machine translation uses dictionary for the language
pair to translate the texts in the source language to target language. In this approach,
word level translations will be done. This dictionary based approach can either be
preceded by some pre-processing stages to analyse the morphological information and
lemmatize the word to be retrieved from the dictionary. This kind of approach can be
used to translate the phrases in a sentence and found to be least useful in translating a
full sentence. This approach will be very useful in accelerating the human translation,
by providing meaningful word translations and limiting the work of humans to
correcting the syntax and grammar of the sentence.

3.6.2.2 Empirical or Corpus based Approach

The corpus based approaches don’t require any explicit linguistic knowledge to
translate the sentence. But a bilingual corpus of the language pair and the monolingual
corpus of the target language are required to train the system to translate a sentence.
This approach has driven lots of interest in world-wide.

79
 
3.66.2.3 Examp
ple based Ap
pproach

This approach to machine translation is a techniquue that is maiinly based on how humaan
beiings interpreet and solve the
t problem
ms. That is, normally the humans spliit the problem
m
intoo sub probleems, solve eaach of the suub problems with the ideea of how they solved thhis
typpe of similarr problems inn the past annd integrate them to sollve the probllem in wholle.
This approachh needs a huge
h bilinguual corpus of
o the langguage pair among
a whicch
trannslation has to be perforrmed.

The EB
BMT system
m functions llike a translaation memorry. A translaation memorry
is a computer aided transllation tool tthat is able to reuse preevious transllations. If thhe
sen
ntence or a similar sentennce has beenn translated previously,
p tthe previouss translation is
retuurned. In coontrast, the EBMT system can traanslate noveel sentences and not juust
repproduce prevvious senten T translates iin three stepps; matching,
nce translations. EBMT
alig
gnment, andd recombinattion [143]. 1) In matchin
ng, the system
m looks in its database of
o
preevious exam
mples and finnds the piecees of text that together ggive the besst coverage of
o
thee input sentence. This matching
m is doone using vaarious heurisstics from ex
xact characteer
maatch to matchhes using hig
gher linguisttic knowledg
ge to calculaate the similaarity of wordds
or identify gen
neralized tem
mplates. 2) T
The alignmennt step is theen used to iddentify whicch
targget words thhese matchinng strings coorrespond to
o. This identtification caan be done by
b
usiing existing bilingual dicctionaries orr automaticaally deduced from the paarallel data. 3)
3
Finnally, these correspondeences are reccombined an
nd the rejoinned sentences are judgeed
usiing either heeuristic or sttatistical infoormation. Fiigure 3.9 shoows the blocck diagram of
o
exaample-based
d approach.

ure 3.9 Blocck Diagram of EBMT S


Figu System

80
In order to get a cleear idea of this
t approachh, consider tthe English sentence “H
He
ught a homee” and the Taamil translattion also given in Table 3.8.
bou

Tab
ble 3.8 Exam
mple of Engglish and Taamil Senten
nces

English
h Tamill
He bought a pen mtd; xU ngdh th
h';fpdhd;
He has a hoome mtDf;F
F xU tPL ,Uf;fpwJ

The paarts of the sentence


s to be translateed will be matched wiith these tw
wo
sen
ntences in thhe corpus. Here,
H the parrt of the senttence ‘He boought’ gets matched witth
thee words in thhe first senttence pair annd ‘a home’ gets matchhed with the words in thhe
seccond sentencce pair. Therrefore, the coorrespondingg Tamil partt of the matcched segmennts
of the sentencees in the corrpus are takeen and comb
bined approppriately. Som
metimes, posst-
proocessing mayy be requireed in order to handle nu
umbers and ggender if exact words arre
nott available in
n the corpus..

3.66.2.4 Statistiical Approaach

Staatistical app
proach to machine
m trannslation gen
nerates trannslations usiing statistical
meethods by deriving
d the parameters for those methods byy analysing the bilinguual
corrpora. This approach diiffers from the other appproaches tto machine translation in
i
maany aspects. Figure 3.10
0 shows thee simple blo
ock diagram of a Statisttical Machinne
Traanslation (SM
MT) system..

Figu
ure 3.10 Bloock Diagram
m of SMT S
System

81
The advantages of statistical approach over other machine translation approaches are as
follows:

• The enhanced usage of resources available for machine translation such as


manually translated parallel and aligned texts of a language pair, books available in
both languages and so on. That is large amount of machine readable natural
language texts are available with which this approach can be applied.

• In general, statistical machine translation systems are language independent i.e., it


is not designed specifically for a pair of language.

• Rule based machine translation systems are generally expensive as they employ
manual creation of linguistic rules and also these systems cannot be generalised for
other languages, whereas statistical systems can be generalised for any pair of
languages, if bilingual corpora for that particular language pair is available.

• Translations produced by statistical systems are more natural compared to that of


other systems, as it is trained from the real time texts available from bilingual
corpora and also the fluency of the sentence will be guided by a monolingual corpus
of the target language.

Statistical parameters are analysed and determined from Bi-lingual and


Monolingual corpora. Using these parameters translation and language models are
generated. Designing a statistical system for a particular language pair is a rapid
process because the work lies on creating bilingual corpora for that particular language
pair. In order to obtain better translations from this approach, the system needs at least
more than two million words for a particular domain. Moreover, Statistical Machine
Translation requires an extensive hardware configuration to create translation models in
order to reach average performance levels.

3.6.3 Hybrid Machine Translation System

Hybrid machine translation approach makes use of the advantages of both statistical
and rule-based translation methodologies. Commercial translation systems such as Asia
Online and Systran provide systems that were implemented using this approach. Hybrid
machine translation approaches differ in many numbers of aspects:

82
 
Ru ing by statisstical approoach: Here thhe rule baseed
ule-based sysstem with post-processi
p
maachine translation system
m produces ttranslations for
f a given ttext in sourcce language to
t
targget language. The outpput of this rule
r based system willl be post-processed by a
stattistical systeem to providde better traanslations. Figure
F 3.11 sshows the block
b diagram
m
forr this system.

Figgure 3.11 Ru
ule based Trranslation System
S with
h Post-proceessing

nslation systtem with pree-processingg by the rulee based apprroach: In thhis


Staatistical tran
appproach a staatistical macchine translaation system
m is incorpoorated with a rule baseed
sysstem to pre-pprocess the data before providing thhe data for ttraining andd testing. Alsso
thee output of the statistical system ccan also be post-processsed using thhe rule baseed
sysstem to provvide better translations.. The blockk diagram foor this type of system is
shoown in Figurre 3.12.

Figure 3.12 Statisttical Machin


ne Translattion System with Pre-processing

3.77 EVAL
LUATING
G STATIST
TICAL MACHINE
M E TRANSL
LATION

This section provides


p evaaluation metthods to findd the qualitty of machinne translatioon
sysstem. Evaluaation of macchine translaation is a very active fieeld of researrch. There arre
two
o importantt types of evaluation techniques in machinne translatio
on which arre
auttomatic evalluation and manual evaaluation or human evalluation. Thiis subdivisioon
shoows how too evaluate the perform
mance of ann MT systtem, both manually
m annd
auttomatically. The most reliable meethod for evaluating
e trranslation adequacy
a annd
fluency is throough human evaluation. But humann evaluation is a slow and
a expensivve
proocess. The judgments of
o more thann one humaan evaluator are usuallyy averaged. A

83
quick, cheap and consistent approach is required to judge the MT systems. A precise
automated evaluation technique would require linguistic understanding. Methods for
automatic evaluation usually find the similarity between the translation output and one
or more translation references.

3.7.1 Human Evaluation Techniques

Statistical Machine Translation outputs are very hard to evaluate. To judge the quality
of translation one may ask human translators to find the scores for a machine
translation output or compare a system output with a gold standard output. This gold
standard outputs are generated by human translators. In human evaluation, different
translators translated same sentence in different ways. There is no single correct answer
for the translation task because a sentence can be translated in different ways. The
reason for translation variation is choice of words, word order and style of translators.
So the machine translation quality is very hard to predict.

The human evaluation tasks provide the best insight into the performance of an
MT system, but they come with some major drawbacks. It is an expensive and time
consuming evaluation method. To overcome some of these drawbacks, automatic
evaluation metrics have been introduced. These are much faster and cheaper than
human evaluation, and they are consistent in their evaluation, since they will always
provide the same evaluation given the same data. The disadvantage of automatic
evaluation metrics is that their judgments are often not as correct as those provided by a
human. The evaluation process, however, has the advantage that it is not tied by the
realistic scenery of translation. Most often, evaluation is performed on sentences where
one or more gold standard reference translations already exist [143].

In human evaluation method, the judges are presented with a gold-standard


sentence and some translations. Table 3.9 shows the scales used for evaluation when
the language being translated into is English. Using this scale, the judges are asked to
assign a score to each of the presented translations. Accuracy and fluency is a
widespread means of doing manual evaluation.

84
 
Table 3.9 Scales of Evaluation

Score Adequacy Fluency

5 All Flawless

4 Most Good

3 Much Non-native

2 Little Disfluent

1 None Incomprehensible

3.7.2 Automatic Evaluation Techniques


The automatic evaluation is the method which use computer program to judge the
translation output is better or not. Currently automatic evaluation metrics is widely used
to evaluate machine translation system. These systems are upgrade based on the rise
and fall of scores in this automatic evaluation. The major advantage of this technique is
time and money. It requires less time to judge a huge amount of outputs. In situations
like everyday system evaluation, human evaluation can be too expensive, slow, and
inconsistent. Therefore, an automatic evaluation metric that is reliable and very
important to the progress of Machine translation field. In this section, the most widely
used automatic evaluation metrics, BLEU, NIST, Edit distance measures and precision
and recall are described.

3.7.2.1 BLEU Score

The first and most widely-used first automatic evaluation measure is BLEU (BiLingual
Evaluation Understudy) [144]. It was introduced by IBM in Papineni et.al. (2002). It
finds the geometric mean of modified n-gram precisions. BLEU considers not only
single word matches between the output and the reference sentence, but also n-gram
matches, up to some maximum n. It is the ratio of correct n-gram of a certain order n in
relation to the total number of generated n-gram of that order. The maximum order n
for n-gram to be matched is typically set to four. This mean is then called BLEU-4.
Multiple reference are also be used to compute BLEU. Evaluating system translation
against multiple reference translation provides a more robust assignment of the
translation quality [144]. The BLEU metric then takes the geometric mean of the scores
assigned to all n-gram lengths. Equation 3.1 shows the formula for BLEU, where N is

85
 
the order of n-grams that are used, usually 4, pn is a modified n-gram precision, where
each n-gram in the reference can be matched by at most one n-gram from the
hypothesis. BP is a brevity penalty, which is used to penalize too short translations. It is
based on the length of the hypothesis c, and the reference length r. If several references
are used, there are alternative ways of calculating the reference length, using the
closest, average or shortest reference length. BLEU can only be used to give accurate
system wide scores, since the geometric mean formulation means it will be zero if there
are no overlapping 4-grams, which is often the case in single sentences.

BLEU =BP. ∑ log (3.1)

1
BP=

3.7.2.2 NIST Metric

The NIST metric (Doddington, 2002) is an extension of the BLEU metric [145]. The
introduction of this metric tried to meet two characteristics of BLEU. First, the
geometric average of BLEU makes the overall score more sensitive to the modified
precision of the individual n’s, than if the arithmetic average is used. This may be a
problem if not many high n-gram matches exist. Second, all word forms are weighted
equally in BLEU. Less frequent word forms may be of higher importance for the
translation than for example high frequent function words, which NIST tries to
compensate for by introducing an information weight. Additionally, the BP is also
changed to have less impact for small variations in length. The information weight of
an n-gram abc is calculated by the following equation:

info(abc) = log (3.2)

This information weight is used in equation (3.4) instead of the actual count of
matching n-grams. In addition, the arithmetic average is used instead of the geometric,
and the BP is calculated based on the average reference length instead of the closest
reference length. The lengths of these are summed for the entire corpus (r) and the same
for the translations (t).

86
 
BP = exp . min ,1 (3.3)


NIST = BP · ∑ ∑
(3.4)

The NIST metric is very similar to the BLEU metric, and their correlations with human
evaluations are also close. Perhaps NIST correlates a bit better with adequacy, while
BLEU correlates a bit better with fluency (Doddington, 2002) [145].

3.7.2.3 Precision and Recall

In Automatic evaluation metrics each sentence in system translation is compared


against gold standard or human translations. This gold standard human translation is
called Reference translation. This precision and recall approach is based on word
matches. Precision is a fraction of retrieved docs that are relevant and Recall is defined
as fraction of relevant docs that are retrieved. This metric is mainly used in information
retrieval systems. The significant drawback of this metric while using in Machine
translation is, not considerable of word order.

Precision: number of relevant documents retrieved by a search divided by the total


number of documents retrieved by that search

P(relevant | retrieved)

Recall: the number of relevant documents retrieved by a search divided by the total
number of existing relevant documents (which should have been retrieved)

P(retrieved | relevant)

For Example,

SMT OUTPUT: Israeli officials responsibility of airport safety

REFERENCE: Israeli officials are responsible for airport security

Precision = Correct / Output length

87
 
= 3 / 6 = 50%

Recall = Correct / Reference length

= 3 / 7 = 42.85%

The F Measure (weighted harmonic mean) is a combined measure that assesses the
precision/recall tradeoff.

F= 2( P x R) / (P+R)

F= 2(.5 x .4285 ) / (.5 + .4285)

F= 46%

3.7.2.4 Edit Distance Measures

Edit Distance Measures provide an estimate of translation quality based on the number
of changes which must be applied to the automatic translation so as to transform it into
a reference translation

• WER- Word Error Rate (Nießen et al., 2000) [147]. This measure is based on
the Levenshtein distance (Levenshtein, 1966) [146] —the minimum number of
substitutions, deletions and insertions that have to be performed to convert the
automatic translation into a reference translation.

• PER- Position-independent Word Error Rate (Tillmann et al., 1997) [148]. A


shortcoming of the WER is that it does not allow reordering of words. In order
to overcome this problem, the position independent word error rate (PER)
compares the words in the two sentences without taking the word order into
account.

• TER- Translation Edit Rate (Snover et.al., 2006) [149]. TER measures the
amount of post-editing that a human would have to perform to change a system
output so it exactly matches a reference translation. Possible edits include
insertions, deletions, and substitutions of single words as well as shifts of word
sequences. All edits have equal cost. 

88
 
TER = # of edits to closest reference / average # of reference words

The edits that TER considers are insertion, deletion and substitution of
individual words, as well as shifts of contiguous words. TER has also been
shown to correlate well with human judgment. 

3.8 SUMMARY

This chapter provided background on Tamil language processing and various


approaches for developing linguistic tools and Machine Translation system. This
chapter also gives an overview of Tamil language and its morphology. Machine
learning for Natural Language Processing and evaluation methods for Machine
translation are also discussed in this chapter.

   

89
 
CHAPTER 4

PREPROCESSING FOR ENGLISH SENTENCE


Current phrase based Statistical Machine Translation system does not use any linguistic
information and it only operates on surface word form. It is shown that adding
linguistic information helps to improve the translation process. Adding linguistic
information can be done through preprocessing steps. On the other hand, machine
translation system for language pair with disparate morphological structure needs best
pre-processing or modeling before translation. This chapter explains about how
preprocessing is applied on the raw source language sentence to make it more
appropriate for translation.

4.1 MORPHO-SYNTACTIC INFORMATION OF ENGLISH


LANGUAGE

Grammar of a language is divided into syntax and morphology. Syntax is how words
are combined to form a sentence and morphology deals with the formation of words.
Morphology is also defined as the study of how meaningful units can be combined to
form words. One of the reasons to process a morphology and syntax together in
language processing is that a single word in a language is equivalent to combination of
words in another. The term “morpho-syntax” is a hybrid word that comes from
morphology and syntax. It plays a major role in processing different types of languages
and it is also a related term to machine translation because the fundamental unit of
machine translation is words and phrases. Retrieving the syntactic information is a
primary step in pre-processing English language sentences. The tool which is used for
retrieving syntactic structure from a given sentence is called parsing and which is used
to retrieve morphological features from a word is called as morphological analyzer.
Syntactic information includes dependency relation, syntactic structure and POS tag
morphological information consists of lemma and morphological features.

Klein and Manning (2003) [150] from Stanford University proposed a statistical
technique for retrieving the syntactical structure of English sentences. Based on this
technique a “Stanford Parser tool” was developed. This parser provides dependency
relationship as well as phrase structure trees for a given sentence. Stanford parser

90
 
package is a Java implementation of probabilistic natural language parsers, such as
highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG
parser. The parser was also developed for other languages such as Chinese, Italian,
Bulgarian, and Portuguese. This parser uses the knowledge gained from hand-parsed
sentences to produce the most likely analysis of new sentences. In this pre-processing
Stanford parser is used to retrieve the morpho-syntactic information of English
sentences.

4.1.1 POS and Lemma Information

Part-of-Speech (POS) tagging is the task of labeling each word in a sentence with its
appropriate parts-of-speech like noun, verb, adjective, etc. This process takes an
untagged sentence as input then assigns a POS tag to words and produces tagged
sentences as output. The most widely used part of speech tagset for English is PennTree
bank tagset which is given in the Appendix-A. In this thesis, English sentences are
tagged using this tagset. POS and lemma of word forms are shown in Table 4.1. The
example shown bellow represents the POS tagging for English sentences.

English Sentence

The boy is going to the school.

Part-of-Speech Tagging

The/DT boy/NN is/VBZ going/VBG to/TO the/DT school/NN ./.

Table 4.1 POS and Lemma of Words

Word POS Lemma


Playing NN Playing
Playing VBG Play
Walked VBD Walk
Pens NNS Pen
Training VBG Train
Training NN Training
Trains NNS Train
Trains VBZ Train

91
 
Morphological analyzer or lemmatizer is used to find the lemma of a word.
Lemmas have special importance in highly inflected languages. Lemma is a dictionary
word or a root word. For example the word “play” is available in dictionary but other
word forms like playing, played, and plays aren’t available. So the word “play” is
called a lemma or dictionary word for the above mentioned word forms.

4.1.2 Syntactic Information

Syntactic information of a language is used in NLP tasks like Machine translation,


Question Answering, Information Extraction and Language Generation. Syntactic
information can be extracted from parsing. Parsing extracts the information such as
parts-of-speech tags, phrases and relationships between the words in the sentences. In
addition, from the parse tree of a sentence, noun phrases, verb phrases, and
prepositional phrases are also identified. Figure 4.1 shows an example of English
syntactic tree. The parser output is a tree structure with a sentence label as the root. The
example shown bellow indicates the syntactic information of English sentences.

Figure 4.1 Example of English Syntactic Tree

92
 
English Sentence

The boy is going to the school.

Parts of speech for each word

(NN = Noun, VBZ = Verb, DT = Determiner, VBG = Verbal Gerund)


S NP DT the NN boy VP VBZ is VP VBG going PP TO to NP DT the NN school

Parsing information

(ROOT
(S
(NP (DT The) (NN boy))
(VP (VBZ is)
(VP (VBG going)
(PP (TO to)
(NP (DT the) (NN school)))))
(. .)))

Phrases

Noun Phrases (NP): “the boy”, “the school”

Verb Phrases (VP): “is”, “going”

Sentences (S): “the boy is going to the school”

4.1.3 Dependency Information

Dependency information represents a relation between individual words. A typed


dependency parser additionally labels dependencies with grammatical relations, such as
subject, direct object, indirect object etc. It is used in several NLP applications and such
applications benefit particularly from having access to dependencies between words
typed with grammatical relations. Since these relations also provide information about
predicate-argument structure which is not readily available from phrase structure parse
trees. The Stanford typed dependency representation was designed to provide a simple
description of the grammatical relationships in a sentence. It can be easily understood
and used even by people without linguistic knowledge. It is also used to extract textual
relations. An example of the typed dependency relation for an English sentence is given
below.

93
 
English Sentence

The boy is going to the school.

Subject Verb Object

The boy is going to the school

Subject Verb Object

Typed dependencies

det(boy-2, The-1)

nsubj(going-4, boy-2)

aux(going-4, is-3)

root(ROOT-0, going-4)

prep(going-4, to-5)

det(school-7, the-6)

pobj(to-5, school-7)

4.2 DETAILS OF PREPROCESSING ENGLISH SENTENCES

Recently, SMT systems are introduced with linguistic information in order to address
the problem of word order and morphological variance between the language pairs.
This preprocessing of source language is done constantly on the training and testing
corpora. More source side pre-processing steps brings the source language sentence
closer to that of the target language sentence.

This section explains the preprocessing methods for English sentence to


improve the quality of English to Tamil Statistical Machine Translation system. The
preprocessing module for English language sentence includes three stages, which are
reordering, factorization and compounding. Figure 4.2 shows the preprocessing stages
of English language sentence. The first step in preprocessing English sentence is to
retrieve the linguistic features such as lemma, POS tag, and syntactic relations using
Stanford parser. These linguistic features along with the sentence will be subjected to
reordering and factorization stages. Reordering applies the reordering rules to the

94
 
syntactic trees for rearranging the phrases in the English sentence. Factorization takes
the surface words in the sentence and then factored using syntactic tool. This
information is appended to the words in the sentence. Part-of-Speech tags are
simplified and included as a factor in factorization. This factored sentence is given to
the compounding stage. Compounding is defined as adding additional morphological
information to the morphological factor of source (English) language words. Additional
morphological information includes function word, subject information, dependency
relations, auxiliary verbs, and model verbs. This information is based on the
morphological structure of the target language. After adding this information, few
function words, auxiliary information are removed and reordered information is
incorporated in integration phase.

English Sentence

Stanford
Parser
Tool

Reordering Factorization

Compounding

Integration

Preprocessed English Sentence

Figure 4.2 Preprocessing Stages of English Sentence

95
 
4.2.1 Reordering English Sentences

Reordering transforms the source language sentence into a word order that is closer to
that of the target language. Mostly in Machine Translation system the order of the
words in the source language sentence is often different from the words in the target
language sentence. The word-order difference between source and target languages is
one of the most significant errors in a Machine Translation system. Phrase based SMT
systems are limited for handling long distance reordering. A set of syntactic reordering
rules are developed and applied on the English language sentence to better align with
the Tamil sentence. The reordering rules elaborate the structural differences of English
and Tamil sentences. These transformation rules are applied to the parse trees of the
English source language. Parse trees are developed using Stanford parser tool. Quality
of parse trees plays an important role in syntactic reordering. In this thesis, the source
language is English and therefore the parses are more accurate and the reordering based
on the parses are exactly matched with target language. Generally, English parsers are
performing better than other language parsers because, English parsers developed from
longer and advanced statistical parsing techniques are applied.

Reordering is successfully applied for French to English (Xia and Mc-Cord,


2004) [151] and from German to English (Collins et al., 2005) translation system [152].
Xia and McCord (2004) used reordering rules to improve the translation quality, with
these reordering rules being automatically learned from the parse trees for both source
and target sentences [151]. Marta Ruiz Costa-juss`a 2006, proposes a novel reordering
algorithm for SMT [153]. They introduced two new approaches; they are block
reordering and Statistical Machine Reordering (SMR). The author also explains various
reordering methods like syntax based reordering and heuristic reordering in 2009 [154].
Ananthakrishnan R, et.al (2008) developed a syntactic and morphological
preprocessing for English to Hindi SMT system. They reorder the English source
sentence as per Hindi syntax, and segment the suffixes of Hindi for morphological
processing [155]. Recent developments showed an improvement in translation quality
when using the explicit syntax based reordering. One of these developments is the pre-
translation approach which alters the word order of source language sentence to target
language word order before translation. This is done based on predefined linguistic
rules that are either manually created or automatically learned from parallel corpora.

96
 
4.2.1.1 Syntactic Comparison between English and Tamil

This subdivision gives a closer look and notable differences between the syntax of
English and Tamil language. Syntax is a theory of sentence structure and it guides
reordering when translations between a language pair contain disparate sentence
structure. English and Tamil are from different language families. English is an Indo-
European language and Tamil is a Dravidian language. English has the word order of
Subject–Verb-Object (SVO) and Tamil has the word order of Subject-Object-Verb
(SOV).  For example, the main verb of a Tamil sentence always comes at the end but in
English it comes between subject and object. English is a fixed word order language
where Tamil word order is flexible. Flexibility in word order represent that the order
may change freely without affecting the grammatical meaning of the sentence. While
translating from English to Tamil, English verbs have to be moved from after the
subject to end of the sentence.

English prepositions are postpositions in Tamil language. Tamil is a head-final


language also called as verb-final language. The Tamil verb comes at the end of the
clause. Demonstratives and modifiers precede the noun within the noun phrase. The
simplest Tamil sentence can consist of only two noun phrases, with no verb (even
linking verb) are present. For example, when a pronoun (இ ) idu 'this' is followed by

a noun ( த்தகம்) puththakam 'book', the sequence is translated into English 'This is a
book.' Tamil is a null subject language. Not all Tamil sentences have subjects, verbs,
and objects. It is possible to construct grammatically valid and meaningful sentences
using subject and verb only. For example, a sentence may only have a verb—such as
("completed")—or only a subject and object, without a verb such as (அ என் .)
athu envEdu ("That [is] my house").

Tamil language does not have a copula verb (a linking verb equivalent to the
word is). The word is included in the translations only to convey the meaning more
easily. Schiffman (1999) observed that Tamil syntax is the mirror-image of the order in
an English sentence, especially when there are relative clauses, quotations, adjectival
and adverbial clauses, conjoined verbal constructions, and aspectual and modal
auxiliaries, among others [156].

97
 
4.2.1.2 Reordering Methodology

Reordering is a first step which is applied in preprocessing of source language sentence.


Reordering is an important aspect for languages which differ in their syntactic structure.
Reordering rules are handcrafted using syntactic word order difference between English
and Tamil language. During training, English sentences in a parallel corpus are
reordered and in decoding, English sentence which is given for testing is reordered.
Lexicalized automatic reordering is implemented in Moses toolkit. This automatic
reordering is only better for short range sentences. Therefore external module is needed
for dealing the long range reordering. This reordering stage is also a way of indirectly
integrating syntactic information to the source language. The example given bellow
shows the reordering English sentences. The first sentence is the original English
sentence and the second sentence is pre-translation reordered source sentence and the
third sentence is the translated Tamil sentence.

English Sentence: I bought vegetables to my home.

Reordered English: I my to home vegetables bought

Tamil Sentence : நான் என் ைடய ட் ற்கு காய்கறிகள் வாங்கிேனன் .

( wAn ennudaiya vIddiRkku kAikaRikaL vAmginEn )

Figure 4.3 shows the method of reordering English sentences. English sentence
is given to the Stanford parser for retrieving syntactic information. After retrieving the
information, the English sentence is reordered using pre-defined syntactic rules. In
order to obtain the similar word order of the target language, reordering is applied in
source sentence prior to translation. 180 rules are developed according to the syntactic
difference between English and Tamil languages. Syntax-based rules are used to reorder
the English language sentence to better match the sentence with Tamil language. It is
applied to the English parse tree at the sentence level. The rule for reordering is
restricted to the particular language pair. Reordering will be carried out in phrases as
well as words. All the created rules are compared with the production rule of an English

98
 
sentence. If match is found then transformation is performed according to the target rule.
Examples of English reordering rules are shown in Table 4.2. All the rules are included
in Appendix B.

English Sentence

Stanford
Parser
Tool

Syntactic Information

Syntactic
Reordering
Rules

Reordered English
Sentence

Figure 4.3 Process of Reordering

In general, PBSMT reordering performance goes down when encountering


unknown phrases or long sentences. The main advantage of reordering is word order
improvement in translation and better utilization of Phrase based SMT system [157].
Incorporating reordering in the search process implies a high computational cost.

Reordering rules consists of three units (Table 4.2).

i. Production rules of original English sentence (source).


ii. Transformed production rules according to Tamil sentence (target).
iii. Source part numbers and target part numbers. These numbers indicate the
reorder of the source sentence (transformations).

99
 
Table 4.2 Reordering Rules

S.no Source Target Transformation

1  S ­> NP VP  # S ­> NP VP  # 0:0,1:1 

2  PP ­> TO NP­PRP   # PP ­> TO NP­PRP  # 0:0,1:1 

3  VP ­> VB NP* SBAR  # VP ­> NP* VB SBAR  #0:1,1:0,2:2 

4  VP ­> VBD NP  # VP ­> NP VBD  # 0:1,1:0 

5  VP ­> VBD NP­TMP  # VP ­> NP­TMP VBD  # 0:1,1:0 

6  VP ­> VBP PP  # VP ­> PP VBP  # 0:1,1:0 

7  VP ­> VBD NP NP­TMP  # VP ­> NP NP­TMP VBD  # 0:2,1:0,2:1 

8  VP ­> VBD NP PP  # VP ­>PP NP VBD  # 0:2,1:1,2:0 

9  VP ­> VBD S  # VP ­> S VBD  # 0:1,1:0 

10  VP ­> VB S  # VP ­> S VB  # 0:1,1:0 

11  VP ­> VB NP  # VP ­> NP VB  # 0:1,1:0 

12  PP­>TO NP  #PP­>NP TO  #0:1,1:0 

13  VP ­> VBD PP  # VP ­> PP VBD  # 0:1,1:0 

For instance, take a Sixth reordering rule from Table 4.2.

VP -> VBP PP # VP -> PP VBP# 0:1,1:0

Where, # divides the units of reordering rules, the last unit indicates source and
target indexes. In the above example, “0:1, 1:0” indicates first child of the target rule is
from second child of the source rule; second child of the target rule is from first child of
the source rule.

100
 
Source Æ 0(VBP) 1(PP)

Target Æ 1(PP) 0(VBP)

For example take an English sentence “I bought vegetables to my home” The


Syntactic tree for the sentence is shown in Figure 4.4.

English Sentence: I bought vegetables to my home.

Figure 4.4 English Syntactic Tree

Production rules of English Sentence


i. S->NP VP
ii. VP->VBD NP PP
iii. PP->TO NP
iv. NP->PRP$ NN

The first production rule (i) S->NP VP is matched with the first reordering rule
in Table.4.2. The target transformation is same as the source pattern and therefore no
change in first production rule.
The next production rule (ii) VP->VBD NP PP is matched with the eighth
reordering rule in table and the transformation is 0:2 1:1 2:0 , it means that source
word order (0,1,2) is transformed into (2,1,0). (0,1,2) are the index of VBD NP and PP,

101
 
now the transformed pattern is PP NP VBD. This process is continuously applied to
each of the production rules. Finally the transformed production rule is given below.

Reordered Production rules of English sentence


i. S->NP VP
ii. VP->PP NP VBD
iii. PP->NP TO
iv. NP->NN PRP$

Reordered English Sentence: I my home to vegetables bought.

English parallel corpora which is used for training is reordered and the testing
sentences are also reordered. 80% of English sentences are reordered correctly
according to the rules which are developed. Original and reordered English sentences
are shown in Table 4.3. After reordering the English sentences are given to the
compounding stage.

Table 4.3 Original and Reordered Sentences

Original Sentences Reordered Sentences

I saw a beautiful child I a beautiful child saw

He came last week He last week came

Sharmi gave her book to Arthi Sharmi her book Arthi to gave

She went to shop for buying fruits She fruits buying for shop to went

Cat is sleeping on the table Cat the table on sleeping is.

4.2.2 Factoring English Sentence

The current phrase-based models are limited to the mapping of small text chunks
without the use of any explicit linguistic information like morphological and
syntactical. Such information plays a significant role in morphologically rich
languages. In other hand, for many language pairs, the availability of bilingual corpora
is very less. SMT performance is based on the quality and quantity of corpora. So, SMT
strictly needs a new method which uses linguistic information explicitly with fewer

102
 
amounts of parallel data. Philip Koehn and Hoang (2007) developed a Factored
translation framework for statistical translation models to tightly integrate linguistic
information [10]. It is an extension of phrase-based Statistical Machine Translation that
allows the integration of additional morphological and lexical information, such as
lemma, word class, gender, number, etc., at the word level on both source and the target
languages. Factoring English language sentence is a basic step in factored translation
system. Factored translation model is one way of representing morphological knowledge
to Statistical machine translation explicitly. Factors which are considered in pre-
processing and their description of English language are shown in Table.4.4. In this
example, word refers surface word, lemma represents the dictionary word or root word,
word class represents word-class category and morphology tag represents compound tag
which contains morphological information and/or function words. In some cases the
“morphology” tag, also contains the dependency relations and/or PNG information. For
instance, the English sentence, “I bought vegetables to my home”, is factored into
linguistic factors which are shown in Table.4.5. Factored representation of English
sentence is shown in Table 4.6.

Table 4.4 Description of Factors in English Word

FACTORS DESCRIPTION EXAMPLE

Word Surface words or word forms Coming,went,beautiful,eyes

Lemma Root word or Dictionary word Play,run,home,pen

Word Class Minimized POS tag N,V,ADJ,ADV

POS tag, dependency VBD,NNS, nsubj,pobj, to,


information, function words,
Morphology has been,will
subject information, Auxilary
and model verbs

103
 
Table 4.5 Example of English Word Factors

WORD LEMMA POSTAG W-C DEPLABEL

I I PRP PRP Nsubj

bought buy VBD V Root

vegetables vegetable NNS V Dobj

to to TO PRE Prep

my my PRP$ PRP Poss


 
home home NN N Pobj

Table 4.6 Factored Representation of English Language Sentence

WORD FACTORS 1

I i|PRP|PRP_nsubj

bought buy|V|VBD

vegetables vegetable|N|NNS_dobj

to to|PRE|TO_prep

my my|PRP|PRP$_poss

home home|N|NN_pobj

English factorization is considered as one of the important pre-processing step.


Factorization splits the surface word into linguistic factors and integrates as a vector.
                                                            
1
Stanford Parser 1.6.5 Tool is used for Factorization 
 

104
 
Instead of mapping surface words in translation, factored models maps the linguistic
units (factors) of language pair. Stanford Parser is used for factorizing English language
sentence. From the parser output, linguistic information such as, lemma, part-of-speech
tags, syntactic information and dependency information are retrieved. This linguistic
information is integrated as factors to the surface word.

4.2.3 Compounding English Language Sentence

A baseline Statistical Machine Translation (SMT) system only considers surface word
forms and does not use linguistic information. Translating into target surface word form
is not only dependent on the source word-form and it also depends on additional
morpho-syntactic information. While translating from morphologically simpler
language to morphological rich language, it is very hard to retrieve the required
morphological information from the source language sentence. This morphological
information is an important term for producing a target language word-form. The
preprocessing phase compounding is used to retrieve the required linguistic information
from source language sentence. Morphologically rich languages have a large number of
surface forms in the lexicon to compensate for a free word-order. This large number of
word-forms in Tamil language is very difficult to generate from English language
words. Compounding is defined as adding additional morphological information to
morphological factor of source (English) language words. Additional morphological
information includes subject information, dependency relations, auxiliary verbs, model
verbs and few function words. This information is based on the morphological structure
of Tamil language. In compounding phase, dependency relations are used to identify
the function words from the English factored corpora. During integration, few function
words are deleted from the factored sentence and attached as a morphological factor to
the corresponding content word.

In Tamil language, function words are not directly available but it is fused with
corresponding content word. So instead of making the sentences into similar
representation, function words are removed from an English sentence. This process
reduces the length of the English sentences. Like function words, auxiliary verbs and
model verbs are also identified and attached in morphological factor of head word of
source sentence. Now the morphological factor representation of the English language

105
 
sentence is similar to that of the Tamil language sentence. This compounding step
indirectly integrates dependency information into the source language factor.

4.2.3.1 Morphological Comparison between English and Tamil

Morphology is the study of structure of words in a language. Words are made up of


morphemes. These are the smallest meaningful unit in a word. For example, "pens" is
made of "pen" + "s", here “s” is a plural marker, "talked" is made of "talk" + “ed” , here
“ed” represents past tense. English is morphologically simple language but Tamil is a
morphologically rich language. Morphology is one of the significant terms for
improving the performance of machine translation system. Morphological changes in
English verbs are due to tense and for nouns it’s because of count. Each root verb (or
lemma) in English is inflected into four or five word-forms. Word forms of English are
shown in Table 4.7. For example, an English verb ‘give’ has four word forms. In
English, sometimes verb morphological changes based on tense is represented by
auxiliary verbs.
Table.4.7 Word forms of English

Word Class Lemma or Root word Word-forms

Noun Cat Cats

Verb Give Gave ,giving, given, gives

Adjective Green Greener, greenest

Adverb Soon Sooner ,soonest

English words are divided into two types, one is content word which carries
meaning by referring objects and actions and another one is function word which gives
the relationship between content words. Table 4.8 and 4.9 shows the content word and
function words of English language. The relationship between content words is also
encoded in morphology. Content words are also called as open class words and
Function words are called as closed class words. Part-of-speech categories of content
words are verbs, nouns, adjectives and adverbs. For function words the categories are
prepositions, conjunctions and determiners.

106
 
Generally, languages not only differ in the word order but also differ in
encoding the relationship between words. English language is strictly in fixed word
order and involves heavy usage of function words but less usage in morphology. Tamil
language had a rich morphological structure and heavy usage of content word but free
word-order language.

Table 4.8 Content Words of English

Content Words Examples

Nouns John, room, answer, Kumar

Adjectives happy, new, large, grey

Full verbs search, grow, hold, play

Adverbs really, completely, very, also, enough

Numerals one, thousand, first

Yes/No answers yes, no (as answers)

Table.4.9 Function Words of English

Function Words Examples

Prepositions of, at, in, without, between

Pronouns he, they, anybody, it, one

Determiners the, a, that, my, more, much, either, neither

Conjunctions and, that, when, while, although, or

Modal verbs can, must, will, should, ought, need, used

Auxiliary verbs be (is, am, are), have, got, do

Particles no, not, nor, as

107
 
Because of the function words, the average number of words in English
sentences is more compared to the words in an equivalent Tamil sentence. Some of the
function words in English don’t exist in Tamil language because these words are
coupled with Tamil content words. English language contains more function words
than content words but Tamil language has more content words. Corresponding
translation of English function words are coupled in Tamil content word. While
translating from English to Tamil language, equivalent translation will not available for
English function words and this leads to the alignment problem. Table 4.10 shows the
various word forms based on English tenses

In Tamil, verbs are morphologically inflected due to tense and PNG (Person-
Number-Gender) markers and nouns are inflected due to count and cases. Each Tamil
verb root is inflected into more than ten thousand surface word forms because of
agglutinative nature of Tamil language. This morphological richness of Tamil language
leads to sparse data problem in Statistical Machine Translation system. Examples of
Tamil word forms based on tenses are given in Table 4.11.

Table 4.10 English Word Forms based on Tenses

Root Word Tenses Word Form

Simple Present Play

Present Continuous is playing

Present Perfect have played

Play Past

Past perfect
Played

had played

Future will play

Future Perfect will have played

108
 
Table 4.11 Tamil Word Forms based on Tenses

Root Word Tenses Word Form Word Form

Present+1S விைளயா கின்ேறன் vilayAdu-kinR-En

Present+3SN விைளயா கின்ற vilayAdu-kinR-athu

Present+3PN விைளயா கின்றன vilayAd-kinR-ana


விைளயா
Past+1S விைளயா ேனன் vilayAd-in-En
(vilayAdu)
Past+3SM விைளயா னான் vilayAd-in-An

Future+2S விைளயா வாய் vilayAdu-v-Ay

Future+3SF விைளயா வாள் vilayAdu-v-AL

4.2.3.2 Compounding Methodology for English Sentence

Morphological difference between English and Tamil makes the Statistical Machine
Translation into a complex task. English language mostly conveys the relationship
between words using function words or location of the words but Tamil language
expresses using morphological variations of word. Therefore Tamil language had larger
vocabulary of surface forms. This led to sparse data problem in English to Tamil SMT
system. In order to solve this, large amount of parallel training corpora is required to
cover the entire Tamil surface form. It is very difficult to create or collect the parallel
corpora which contain all the Tamil surface forms because Tamil is one of the less
resourced languages. Instead of covering entire surface forms a new method is required
to handle all word forms with the help of limited amount of data.

Consider an example English sentence (Figure 4.5) “the cat went to the room”.
From this sentence, the word “to” is a function word which does not have any separate
output translation unit in Tamil. Its translation output is coupled in Tamil function word

“அைற” (aRai).But, Statistical machine translation system uses phrase based models

109
 
and it will consider “to the room” is a single phrase and it is aligned to the Tamil word

“அைறக்கு” (aRaikku) so there is no difficulty in alignment and decoding. Again the

problem is a raised for a new sentence which contains a phrase like “to the X” (eg. to the
school). Here X is considered as any noun. Even if X (or home) is available in bilingual
corpora, system cannot decode a correct translation for “to the X”. Because phrase based
SMT guess “to the X” is an unknown phrase even if X is aligned correctly. So the
function words should be treated separately prior to the SMT system. Here, these words
are taken care of by a preprocessing step called compounding. Compounding identifies
some of the function words and attaches to the morphological factor of related content
word in factored model. It retrieves morphological information of English content word
from dependency relations and function words.

Figure 4.5 English to Tamil Alignment

Compounding also identifies the subject information from English dependency


relations [158]. This subject information is folded into the morphological factor of
English verb and it helps to identify the PNG (Person-Number-Gender) marker for
Tamil language during translation. PNG marker plays an important role in Tamil
morphology due to the subject-verb agreement nature of Tamil language. Most of the
Tamil verbs are generated using this PNG marker. English auxiliary verbs are also
identified from the dependency information and then removed and folded in
morphological factor of the head word/verb. Figure 4.6 shows the block diagram of
compounding English language sentence. English sentence is factorized and then
subjected to the compounding phase. A word in factorized sentence includes part of
speech and morphological information as factors. Compounding takes dependency
relations from Stanford parser and produces the compounded sentences using pre-

110
 
defined linguistic rules. These rules are developed based on morphological difference
between English and Tamil language. This rule identifies the transformations from
English morphological factors to Tamil morphological factors. Sample compounding
rules are shown in Table 4.12. Based on the dependency information the rules are
developed.

English Sentence 

Stanford
Parser

Factorization Dependency Information

Dependency
Rules

Deletion of Update Morph Update Morph


Function factor of factor of
words Head Word Child Word

Compounded English Sentence

Figure 4.6 Block Diagram for Compounding

Another important advantage of compounding is that it also used for solving the
difficulty of handling copula construction in English sentence. Copula is a special type

111
 
of verb in English, while in other languages other parts of speech serve the role of
copula. Copula is used to link the subject of a sentence with a predicate and it is also
referred as a linking verb because it does not describe action. Example for copula
sentences is given bellow.

1. Sharmi is a doctor.
2. Mani and Arthi are lawyers.

1. Table 4.12 Compounding Rules for English Sentence

Dependency Morphological Features


Removed Word
Information
Head Child

aux, auxpass Child Word +Child Word -

dobj - - +ACC

pobj Head word - +Head word

poss Child Word +poss -

nsubj - +Subject -

Alignment is one of the important features for improving the translation


performance. Compounding helps to improve the quality of word alignment and
reduces the length of the English sentences. Table 4.13 shows the average word count
in sentences. It also helps for target word generation indirectly. In this thesis, factored
SMT is used for only mapping the linguistic factors between English and Tamil
language. After compounding, morphological factor of English and Tamil words are
relatively more similar. Therefore now it is easy for SMT system to align and decode
morphological information. Table 4.14 and Table 4.15 shows the factored and
compounded sentences respectively. Reordered sentence is taken as input for
compounding.

112
 
Table 4.13 Average Words per Sentence 

Average words
Method Sentences Words
per Sentence

SMT/FSMT 8300 49632 5.98

C-FSMT 8300 33711 4.06

Table 4.14 Factored English Sentence Table 4.15 Compounded English Sentence 

I | i | PN | prn I | i | PN | prn

bought | buy | V | VBD bought | buy | V | VBD_i

vegetables | vegetable | N | NNS vegetables | vegetable | N | NNS

to | to | TO | TO to | to | TO | TO

my | my | PN | PRP$ my | my | PN | PRP$

home | home | N | NN home | home | N | NN_to

4.2.4 Integrating Reordering and Compounding

Integration is the final stage in source side preprocessing. Here the preprocessed
English sentence is obtained from reordering and compounding stages. Reordering
takes the raw sentence and reorders according to the predefined rules. Compounding
takes the factored sentence and alters the morphological factors of the content words
using the compounding rules. Function words are identified in compounding stage.
From these function words few of them are removed during the integration process.
Figure.4.7 shows the integration process of preprocessing stages. Table.4.16 shows the
examples of preprocessed sentences.

113
 
Original English sentence:
I bought vegetables to my home.
0 1 2 3 4 5
Reordered English sentence:
I my house to vegetables bought.
0 4 5 3 4 5
Factored English sentence:
I | i | PN | prn bought | buy | V | VBD vegetables | vegetable | N | NNS

to | to | TO | TO my | my | PN | PRP$ home | home | N | NN

Compounded English sentence:


I | i | PN | prn bought | buy | V | VBD_i vegetables | vegetable | N | NNS

to | to | TO | TO my | my | PN | PRP$ home | home | N | NN_to

Preprocessed English sentence:


I | i | PN | prn_i my | my | PN | PRP$ home | home | N |NN_to
vegetables | vegetable | N | NNS bought | buy | V | VBD_1S.

Factored Sentence

Reordered Sentence Compounded Sentence

Function
word Index

Integration

Preprocessed English Sentence

Figure 4.7 Integration Process

114
 
Table 4.16 Preprocessed English Sentences

Original Sentences Pre-processed Sentences

She may not come here she|she|PN|prp_she here|here|AD|adv

come|come|V|vb_3SF_may_not

I went to school I|i|PN|prp_i school|school|N|nn_to

went|go|V|vb.past_1S

I gave a book to him I|i|PN|prp_i a|a|AR|det book|book|N|nn_ACC

him|him|PN|prp_to gave|give|V|vb.past_1S

the|the|AR|det cat|cat|N|nn the|the|AR|det

The cat was killing the rat rat|rat|N|nn_ACC

killing|kill|V|vb.prog_3SN_was

4.3 SUMMARY
This chapter presented linguistic preprocessing for English language sentence for better
matching with Tamil language sentence. Preprocessing stages includes reordering,
factoring and compounding. Finally integration process incorporates the stages. The
chapter has also presented the effect of syntactic and morphological variance between
English and Tamil language. It is showed that reordering and compounding rules
produce significant gain in Factored translation system. However, reordering plays an
important role especially for language pairs with disparate sentence structure. The
difference in word order between two languages is one of the most significant sources
of errors in Machine Translation. While phrase based MT systems do very well at
reordering inside short windows of words, long-distance reordering seems to be a
challenging task. The translation accuracy can be significantly improved if the
reordering is done prior to translation. Reordering rules which are developed here is
only valid for English and Tamil language. It also can be used for other Dravidian
languages with small modifications. In future, automatic rule creation for reordering
using bi-lingual corpora will improve the accuracy and this system is applicable for any
language pair also. Compounding and factoring are used in order to reduce the amount

115
 
of English-Tamil bilingual data. Preprocessing also reduces the number of words in
English sentence. Accuracy of preprocessing heavily depends on the quality of the
parser. Different researches have proven that preprocessing is the effective method in
order to obtain a word-order and morphological information which match the target
language. Moreover, this preprocessing approach can be generally applicable for other
languages which differ in word order and morphology. This research work has proved
that adding linguistic knowledge in preprocessing of training data can lead to
remarkable improvements in translation performance.

116
 
CHAPTER 5
PART OF SPEECH TAGGER FOR TAMIL

5.1 GENERAL

The knowledge of the language pair is proved to improve the translation performance.
Adding pre-processing in SMT system convert the language pairs into more similar.
Philip Koehn and Hoang (2007) [10] developed a Factored translation framework for
statistical translation models to tightly integrate linguistic information. It is an extension
of phrase-based statistical machine translation that allows the integration of additional
morphological and lexical information, such as lemma, word class, gender, number, etc.,
at the word level on both source and the target languages. Preprocessing methods are
used to convert Tamil language sentences into factored Tamil sentences. The
preprocessing module for Tamil language sentence includes two stages, which are POS
tagging and Morphological analysis. The first step in preprocessing Tamil language
sentence is to retrieve the Part-of-Speech information of each and every word. This
information is included in the factors of surface word. This chapter explains about the
development of Tamil POS tagger system. In the next stage, Tamil morphological
analysis is used to retrieve the lemma and morphological information. This information
also included in factors of surface word. The next chapter (Chapter-6) explains the
implementation details about the Tamil morphological analyzer.

5.1.1 Part of Speech Tagging

Part of Speech (POS) tagging is the process of labeling a Part-of-Speech or other


lexical class marker to each and every word in a sentence. It is similar to the process of
tokenization for computer languages. Hence POS tagging is considered to be an
important process in speech recognition, natural language parsing, morphological
parsing, information retrieval and machine translation. Generally, a word in a language
contains both grammatical category and grammatical features. Tamil being a
morphologically rich language inflects, with more grammatical features which makes
the POS tagger system complex. Here a POS tagger system has been developed based
only on grammatical categories. Additionally, a morphological analyzer has also been
developed for handling grammatical features. Automatic Part-of-Speech tagger can

117
help in building automatic word-sense disambiguating algorithms. Parts of Speech are
very often used for shallow parsing texts, or for finding noun and other phrases for
information extraction applications. The corpora that have been marked for Part-of-
Speech are very useful for linguistic research, for example, to find instances or
frequencies of a particular word or sentence constructions in large corpora.

Apart from these, many Natural Language Processing (NLP) activities such as
summarization, document classification and Natural Language Understanding (NLU)
and Question Answering (QA) systems are dependent on Part-of-Speech Tagging.
Words are divided into different classes called Parts of Speech (POS), word classes,
morphological classes, or lexical tags. In traditional grammar, there are only a few parts
of speech (noun, verb, adjective, adverb, etc.). Many of the recent models have much
larger number of word classes (POS Tags). Part-of-Speech tagging (POS tagging or
POST), also called grammatical tagging, is the process of marking up the words in a
text as corresponding to a particular Part of Speech, based on both its definition, as well
as its context .

Parts-of-speech can be divided into two broad super categories:


• CLOSED CLASS types
• OPEN CLASS types.

Closed classes are those that have relatively a fixed membership. For example,
prepositions are a closed class because there is a fixed set of them in English; new
prepositions are rarely coined. By contrast, nouns and verbs are open classes because
new nouns and verbs are continually coined or borrowed from other languages. There
are four major open classes that occur in the languages of the world; nouns, verbs,
adjectives, and adverbs. It turns out that Tamil and English have all the four of these,
although not every other language does.

Parts of Speech (POS) tagging means assigning grammatical classes i.e. suitable
Parts of Speech tags to each word in a natural language sentence. Assigning a POS tag
to each word of an un-annotated text by hand is a laborious and time consuming
process. This has led to the development of various approaches to automate the POS
tagging work. Automatic POS tagger take a sentence as input, assigns a POS tag to
each and every word in the sentence, and produces the tagged text as output. Tags are

118
also applied to punctuation markers; thus tagging for natural language is the same
process as tokenization for computer languages. The input to a tagging algorithm is a
string of words and a specified tagset. The output is a single best tag for each word. For
example in English,

Take that Book.


VB DT NN (Tagged using Penn Tree Bank Tagset)

Even in this simple sentence, automatically assigning a tag to each word is not
trivial. For example, the word book is ambiguous. That is, it has more than one possible
usage and Part of Speech. It can be a verb (as in, book that bus or to book the suspect)
or a noun (as in, hand me that book, or a book of matches). Similarly that can be a
determiner (as in, Does that flight serve dinner), or a complimentizer (as in, I thought
that your flight was earlier).

As Tamil is a morphologically rich language, a word may have many


grammatical categories which lead to ambiguity. For example, consider the following
sentence and its corresponding POS tags:

Tamil Example:

ேகாவி ல் ஆ அ உயரமான மணி உள்ள .

kOvilil ARu adi uyaramAna maNi uLLathu


NN CRD NN ADJ NN VF
(Tagged using AMRITA Tagset)
Here
“adi” can be tagged as Noun (NN) or Verb Finite (VF),
“ARu” can be tagged as Noun (NN) or Cardinal (CRD)
“maNi” can be tagged as common noun(NN) or as proper noun (NNP).

Considering the syntax or the context in the sentence, the word “adi” should be
tagged as noun (NN). The problem of automatic POS-tagging is to resolve these
ambiguities in choosing the proper tag for the context. Part-of-Speech tagging is thus a
disambiguation task. Another important point which was discussed and agreed upon

119
was that POS tagging is NOT a replacement for morph analyzer. A 'word' in a text
carries the following linguistic knowledge

• Grammatical category and

• Grammatical features such as gender, number, person etc.

The POS tag should be based on the 'category' of the word and the
features can be acquired from the morph analyzer.

5.1.2 Tamil POS Tagging

Words can be classified under various parts of speech classes based on the role they
play in the sentence. Traditionally Tamil grammarian Tholkappiar has classified Tamil
word categories into four major classes.

• ெபயர் peyar (noun)

• விைன vinai (verb)

• இைட idai (part of speech which modifies the relationships


between verbs and nouns)
• உாி uri (word that further qualifies a noun or verb)

Examining the grammatical properties of words in modern Tamil, Thomas


Lehman (1983) has proposed eight POS categories [134]. The following are the major
POS classes in Tamil.

1. Nouns
2. Verbs
3. Adjectives
4. Adverbs
5. Determiners
6. Post Positions
7. Conjunctions
8. Quantifiers

120
Other POS categories for Tamil

Apart from nouns and verbs, the other POS categories that are “open class” are
the adverbs and adjectives. Most adjectives and adverbs, in their root can be placed in
the lexicon. But there are adjectives and adverbs that can be derived from noun and
verb stems. Following are the Morphotactics of derived adjectives from noun root and
verb stems.

Examples:

Noun_root + adjective_suffix
uyaram + Ana = uyaramAna <ADJ>
உயரம்+ ஆன=உயரமான

verb_stem + relative_participle
cey + tha = ceytha <VNAJ>
ெசய் +த = ெசய்த

Following is the Morphotactics of derived adverbs from noun roots and verb stem.

noun_root + adverb_suffix
uyaram +Aka = uyaramAka <ADV>
உயரம்+ ஆக =உயரமாக
verb_stem +adverbial participle
cey + thu = ceythu <VNAV>
ெசய் + = ெசய்

There are number of non-finite verb structure forms in Tamil. Apart from
participles forms, Grammatically they are classified into structures such as
infinitive, conditional, etc,.

Examples:
verb_stem + Infinitive marker
paRa + kka=paRakka <VINT> (to fly)
பற + க்க =பறக்க

121
Examples:
verb_stem + Conditional suffix
vA+ wth+ Al = vawthAl <CVB> (if one comes)
வா+ந்+ஆள் = வந்தாள்

There are other categories like conjunctions, complementizers etc. Some of


these may be derived forms. But there aren’t many. So they can be listed in the lexicon.
Other categories that need to be listed in the lexicon as roots are post positions which
are “closed class”. This is because they can occur as words in isolation even though
they are semantically bonded to the noun or verb preceding them.

5.2 COMPLEXITY IN TAMIL POS TAGGING

As Tamil is an agglutinative language, nouns get inflected for number and cases. Verbs
get inflected for various inflections which include tense, person, number, gender
suffixes. Verbs are adjectivalized and adverbialized. Also verbs and adjectives are
nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Tamil [159] are from nominal and verbal sources. So, many
times one has to depend on the syntactic function or context to decide upon whether
one is a noun or adjective or adverb or postposition. This leads to the complexity of
Tamil in POS tagging.

5.2.1 Root Ambiguity


The root word can be ambiguous. It can have more than one sense, sometimes roots
belong to more than one POS category. Though the POS can be disambiguated using
contextual information like co-occurring morphemes, it is not possible always. These
issues should be taken care of when POS taggers are built for Tamil language. For
example, the Tamil root words like adi, padi, isai, mudi, kudi can take both noun and
verb category which leads to the root ambiguity problem in POS tagging.

5.2.2 Noun Complexity


Nouns are the words which denote a person, place, thing, time, etc. In Tamil language,
nouns are inflected for the number and case in morphological level. However on
phonological level, four types of suffixes can occur with noun stem.

122
Morphological level inflection

Noun ( + number ) (+ case )


Example: க்கைள pUk-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Example: க்களிைன pUk-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix

Nouns are needed to be annotated into common noun, compound noun, proper
noun, compound proper noun, pronoun, cardinal and ordinal. Pronouns need to be
further annotated for personal pronoun. There occurs confusion between common noun
and compound noun and also between proper noun and compound proper noun.
Common noun can also occur as compound noun, for example

ஊராட்சி UrAdci <NNC> தைலவர் thalaivar <NNC>

When UrAdci and thalaivar comes together it can be compound noun


(<NNC>), but when UrAdci and thalaivar comes separately in a sentence it should be
tagged as a common noun (<NN>). Such complexity also occurs with the proper noun
<NNP> and compound proper noun (<NNPC>). Moreover there occurs confusion
between noun and adverb, pronoun and emphasis in syntactic level.

5.2.3 Verb Complexity

The verbal forms are complex in Tamil. A finite verb shows the following
morphological structure

Verb stem + Tense + Person-Number + Gender

Example: நட+ந்த்+ஏன் wada +wth +En நடந்ேதன் <VF>

‘I walked’
A number of non-finite forms are possible: adverbial forms, adjectival forms,
infinitive forms, and conditional.

Verb stem + Adverbial participle

Example: cey + thu = ceythu <VNAV>


123
ெசய் + =ெசய் ‘having done’

Verb stem + relative_participle


Example: cey + tha = ceytha <VNAJ>
ெசய் +த = ெசய்த ‘who did’

Verb stem + infinitive suffix


Example: azu + a = aza <VINT>
அ +அ =அழ ‘to weep’

Verb stem + conditional suffix


Example: kEL+d + Al =kEddAl <CVB>
ேகள்+ட்+ஆல் = ேகட்டால் ‘if asked’

Distinction needs to be made between a main verb followed by another main


verb and a main verb followed by an auxiliary verb. The main verb followed by an
auxiliary verb need to be interpreted together, whereas the main verb followed by
another main verb need to be interpreted separately. This lead to functional ambiguity
as described below:

Functional ambiguity in adverbial <VNAV> form

The morphological structure of adverbial verb is

Verb root +adverbial participle

 Example:      sey + thu = seythu <VNAV>


ெசய் + =ெசய்
‘having done’
vawthu <VNAV> sAppidduviddu <VNAV> pO <VF>

‘Having come and having eaten went’

Functional ambiguity in adjectival <VNAJ> form

The adjectival <VNAJ> forms differ by tense markings:

124
Verb stem+Tense+Adjectivalizer

Example: vandta ‘x who came’

varukiRa ‘x who comes’


varum ‘x who will come’
Adjectival<VNAJ> form allows several interpretations as given in the following
examples.

sAppida ilai ‘the leaf which is eaten by x’

‘the leaf on which x had his food and ate’

um-suffixed adjectival form clashes with other homophonous forms which leads
ambiguity.

varum <VNAJ>paiyan ‘the boy who will come’


varum <VF>‘it will come’

Functional ambiguity in infinitival <VINT> form

verb_stem + infinitive suffix


Example: azu + a = aza <VINT>
அ +அ =அழ
vara-v-iru ‘going to come’
vara-k-kuuTaatu ‘should not come’
vara-s-sol ‘ask to come’

5.2.4 Adverb Complexity

A number of adjectival and adverbial forms of verbs are lexicalized as adjectives and
adverbs respectively and clash with their respective sentential adjectival and adverbial
forms semantically creating ambiguity in POS tagging. Adverbs too need to be
distinguished based on their source category. Many adverbs are derived by suffixing
aaka with nouns in Tamil. Functional clash can be seen between noun and adverb in
aaka suffixed forms. This type of clash is seen among other Dravidian languages too.

'அவள் அழகாக இ க்கிறாள்'

125
avaL azakAka irukkiRAL

‘she beauty_ADV be_PRE_she

‘she is beautiful’

5.2.5 Postposition Complexity

Postpositions are from various categories such as verbal, nominal and adverbial in
Tamil. Many a time, the demarking line between verb/noun/adverb and postposition is
slim leading to ambiguity. Some postpositions are simple and some are compound.
Postpositions are conditioned by the nouns inflected for case they follow. Simply
tagging one form as postposition will be misleading.

There are postpositions which come after noun and also after verbs which
makes the postposition ambiguous (spatial vs. temporal).

pinnAl <PPO> ‘behind’ as in vIddukkup pinnAl ‘behind the house’


pinnAl ‘after’<ADV> avanukkup pinnAl vawthAn ‘he came after him’

5.3 PART OF SPEECH TAGSET DEVELOPMENT

For developing a POS tagged corpus, it is necessary to define a Tagset (POS Tagset)
used in that corpus. Collection of all the possible tags is called tagset. Tagsets differ
from language to language. After referring and considering the available tagsets for
Tamil and other languages, a customized Tagset named AMRITA Tagset was
developed. The guidelines from “AnnCorra, IIIT Hyderabad [160]” and EAGLES,
(1996) , were also considered while developing the AMRITA Tagset. Guidelines
followed while developing the AMRITA Tagset are given below.

1. The tags should be simple.


2. Maintaining simplicity for Ease of Learning and Consistency in annotation.
3. POS tagging is not a replacement for morph analyzer.
4. A 'word' in a text carries grammatical category and grammatical features such
as gender, number, person etc. The POS tag should be based on the 'category'
of the word and the features can be acquired from the morph analyzer.

126
Another point that was considered while deciding the tags was whether to come
up with a totally new tag set or take any other standard tagger as a reference and make
modifications in it according to the objective of the new tagger. It was felt that the later
option is often better because the tag names which are assigned by an existing tagger
may be familiar to the users and thus can be easier to adopt for a new language rather
than a totally new one. It saves time in getting familiar to the new tags and then work
on it.

5.3.1 Available POS Tagsets for Tamil

Tagset by AUKBC

AUKBC  Research Centre at Chennai developed a tagset with the help of


eminent linguists from Tamil University, Tanjore. This is an exhaustive tagset, which
covers almost all possible grammatical and lexical constituents. It contains 68 tags [
161].  

Vasu Renganathan’s Tagset

Tagtamil by Vasu Ranganathan is based on Lexical phonological approach. Tag


Tamil does morphotactics of morphological processing of verbs by using index method.
Tag Tamil does both tagging and generation [162].

Tagset by IIIT, Hyderabad

POS Tag Set for Indian Languages was developed by IIIT, Hyderabad.
Their tags are decided on coarse linguistic information with an idea to expand it to finer
knowledge if required. The annotation standards for POS tagging for Indian languages
include 26 tags [163].

CIIL Tagset for Tamil

This tagset was developed by CIIL (Central Institute of Indian


Languages) Mysore. It contains 71 tags for Tamil. As the tagset considers noun and
verb inflection, the number of tags got increased. It has 30 noun forms including
pronoun categories and 25 verb forms including participle forms [164].

127
Ganesan’s POS Tagset

Ganesan has prepared a POS tagger for Tamil. His tagger works well in CIIL
corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for Tamil.
He tagged a portion of CIIL corpus by using a dictionary as well as a morphological
analyzer. He corrected it manually and trained the rest of the corpus with it. The tags
are added morpheme by morpheme [165].

Selvam’s POS Tagset

The tagset developed by Selvam considers morphological inflections on nouns


for various cases such as accusative, dative, instrumental, sociative, locative, ablative,
benefactive, genitive and vocative and clitics and morphological inflections on verbs
for tense etc [166].

5.3.2 AMRITA POS Tagset

The main drawback in majority of tagsets used for Tamil is that they take into account
the verb and noun inflections for tagging. Hence at the tagging time, one needs to split
each and every inflected word into morphemes in the corpus. It is a tough and time
consuming process. At POS level, one needs to determine only the word’s grammatical
category, which can be done using a limited number of tagset. The inflectional forms
can be taken care of morph analyzer. So there is no need of using a large number of
tags. Moreover a large number of tags will lead to more complexity which in turn
reduces the tagging accuracy. Considering the complexity of Tamil in POS tagging and
referring to various tagsets, a customized tagset has been developed (AMRITA POS
tagset). The customized POS tagset which has been used for the present reaearch work
contains 32 tags without considering the inflections. The 32 tags are listed in the Table
5.1.

In AMRITA POS tagset, compound tags for common noun (NNC) and proper
noun (NNPC) were used. Tag VBG is used for verbal nouns and participle nouns.
These 32 POS tags are used for POS tagger and chunker. For morphological analyzer,
these 32 tags were further simplified and reduced to 10 Tags.

128
Table 5.1 AMRITA POS Tagset

5.4 DEVELOPMENT OF TAMIL POS CORPORA FOR


PREPROCESSING

Corpus linguistics seeks to further the understanding of language through the analysis
of large quantities of naturally occurring data. Text corpora are used in a number of
different ways. Traditionally, corpora have been used for the study and analysis of
language at different levels of linguistic description. Corpora have been constructed for
the specific purpose of acquiring knowledge for information extraction systems,
knowledge-based systems and e-business systems [167]. Corpora have been used for
studying child language development. Speech corpora play a vital role in the
specification, design and implementation of telephonic communication and for the
broadcast media.

There is a long tradition of corpus linguistic studies in Europe. The need for
corpus for a language is multifarious. Starting from the preparation of a dictionary or
lexicon to machine translation, corpus has become an inevitable resource for
technological development of languages. Corpus means a body of huge text
incorporating various types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on. Corpus represents all the styles
of a language. Corpus must be very huge in size as it is going to be used for many

129
language applications such as preparation of lexicons of different sizes, purposes and
types, NLP tools, machine translation programs and so on.

5.4.1 Untagged and Tagged Corpus

Untagged or un-annotated corpus provides limited information to the users. Corpus can
be augmented with additional information by way of labeling the morpheme, word,
phrase and sentence for their grammatical values. Such information helps the user to
retrieve information selectively and easily. Figure 5.1 presents an example of untagged
corpus.

The frequency of the lemma is useful in the analysis of the corpus. When the
frequency of a particular word is compared to other context words, it is useful to find
whether the word is common or rare. The frequencies are relatively reliable for the
most common words in a corpus, but to analyze the senses and association patterns of
words, a very large number of occurrences with a very large corpus containing many
different texts, a wider range of topics should be represented, so that the frequencies of
words are less influenced by individual texts. Frequency list based on an untagged
corpus are limited in usefulness, because they do not provide grammatical uses which
are common or rare.  Tagged corpus  is an important dataset for NLP applications.
Figure 5.2 shows an example of tagged corpus.

Figure 5.1 Example of Untagged Corpus

Figure 5.2 Example of Tagged Corpus

130
5.4.2 Available Corpus for Tamil

Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus.
The tagged corpus is that which is tagged for Part-of-Speech, morphology, lemma,
phrases etc. A parallel corpus contains texts and translations in each of the languages
involved in it. It allows wider scopes for double-checking of the translation equivalents.
Aligned corpus is a kind of bilingual corpus where text samples of one language and
their translations into another language are aligned, sentence by sentence, phrase by
phrase, word by word, or even character by character.

CIIL Corpus for Tamil

As for as building corpus for the Indian languages is concerned, it was Central
Institute of Indian Languages (CIIL) which took the initiative and started preparing
corpus for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam).
Department of Electronics (DOE) financed the corpus-building project. The target was
to prepare corpus with ten million words for each language. But, due to financial
crunch and time restriction, it ended up with just three million words for each language.
Tamil corpus, with three million words, is built by CIIL in this way. It is a partially
tagged corpus.

AUKBC-RC’s Improved Tagged Corpus for Tamil

AUKBC Research Centre which has taken up NLP oriented works for Tamil,
has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also
developed a parallel corpus for English-Tamil to promote its goal of preparing an MT
tool for English-Tamil translation. Parallel corpus is very useful for training the corpus
and for building example based machine translation. Parallel corpus is a useful tool for
MT programs.

5.4.3 POS Tagged Corpus Development

The tagged corpus is the immediate requirement for different analyses in the field of
Natural Language Processing. Most of the language processing works are in need of
such large database of texts, which provide a real, natural, native language of varying
types. Annotation of corpora can be done at various levels through, Part of Speech,

131
phrase/clause level, dependency level, etc. Part of Speech tagging forms the basic step
towards building an annotated corpus.

For creating Tamil Part of Speech Tagger, a grammatically tagged corpus is


needed. So a tagged corpus size of 500k words was built. Sentences from Dinamani
news paper, yahoo Tamil news and Tamil short stories etc were collected and tagged.

POS corpus tagging is done in three stages:

1. Pre-editing
2. Manual Tagging
3. Bootstrapping

In pre-editing, untagged corpus is converted to a suitable format for SVMTool,


in order to assign a Part of Speech tag to each word. Because of orthographic similarity,
one word may have several possible POS tags. After an initial assignment of possible
POS tags, words are manually tagged using AMRITA tagset. The tagged corpus is
trained using SVMTlearn component. After training, the new untagged corpus is tagged
using SVMTagger. The output of SVMTagger is again manually corrected and added
into the tagged corpus to increase the corpus size. The corpora development process is
elaborated bellow.

Pre-editing

Tamil text documents have been collected from Dinamani website, Yahoo
Tamil, Tamil short stories etc (For example, Figure 5.3). The corpus has been cleaned
using simple program i.e. to remove punctuations (except dots, commas and question
marks). The corpus has been sententialy aligned. The next step is to change the corpora
into a column format because the SVMTool training data must be in column format, i.e.
a token (word) per line corpus in a sentence by sentence fashion. The column separator
is the blank space.

Figure 5.3 Untagged Corpus before Pre-editing

132
Manual tagging

After pre-editing, the untagged corpus is tokenized into column format (Figure
5.4). In second stage, the untagged corpus is manually POS tagged using AMRITA
tagset. Initially 10,000 words were manually tagged. During manual POS tagging
process, great difficulties were faced while assigning tags to corpora.

Bootstrapping

After completing the manual tagging, the tagged corpus is given to the learning
component of training algorithm for generating the model. Using the model generated,
decoder of the training algorithm tags the untagged sentences. The output of the
component is a tagged corpus with some error. Then the tags are corrected manually.
After correcting the tags, the tagged corpus was added into the training corpus for
increasing the size of training corpus.

Figure 5.4 Untagged Corpus after Pre-editing

133
5.4.4 Applications of POS Tagged Corpus

The POS tagged corpus is used in the following task.

• Chunking
• Parsing
• Information extraction and retrieval
• Machine Translation
• Tree bank creation
• Document classification
• Question answering
• Automatic dialogue system
• Speech processing
• Summarization
• Statistical training of Language models
• Machine Translation using multilingual corpora
• Text checkers for evaluating spelling and grammar
• Computer Lexicography
• Educational application like Computer Assisted Language Learning

5.4.5 Details of POS Tagged Corpus Developed

The POS tagged corpus details are given in the corpus statistics and tag count table
(Table 5.2 and 5.3).

Table 5.2 Corpus Statistics

No of sentences 45682

No of words 510467

No of distinct words 70689

134
Table 5.3 Tag Count

S.No Tags Counts


1 <ADJ> 17400
2 <ADV> 22150
3 <CNJ> 6853
4 <COM> 8488
5 <COMM> 3955
6 <CRD> 12600
7 <CVB> 2315
8 <DET> 6368
9 <DOT> 43514
10 <ECH> 8
11 <EMP> 2777
12 <INT> 2289
13 <NN> 128349
14 <NNC> 74575
15 <NNP> 34594
16 <NNPC> 9042
17 <NNQ> 1911
18 <ORD> 1773
19 <PPO> 7467
20 <PRID> 399
21 <PRIN> 789
22 <PRP> 15559
23 <QM> 1615
24 <QTF> 922
25 <QW> 2389
26 <RDW> 273
27 <VAX> 7793
28 <VBG> 9294
29 <VF> 34888
30 <VINT> 11604
31 <VNAJ> 17843
32 <VNAV> 20671

135
5.5 DEVELOPMENT OF POS TAGGER USING SVMTOOL

5.5.1 SVMTool

This section presents the SVMTool, a simple, flexible, and effective generator of
sequential taggers based on Support Vector Machines (SVM) and explains how it is
applied to the problem of Part-of-Speech tagging. This SVM-based tagger is robust and
flexible for feature modeling (including lexicalization), trains efficiently with almost no
parameters to tune, and is able to tag thousands of words per second, which makes it
really practical for real NLP applications. Regarding accuracy, the SVM-based tagger
significantly outperforms the TnT tagger [39] exactly under the same conditions, and
achieves a very competitive accuracy of 94.6% for Tamil.

Generally, tagging is required to be as accurate as possible, and as efficient as


possible. But, certainly, there is a conflict between these two desirable properties. This
is so because obtaining a higher accuracy relies on processing more and more
information. However, sometimes, depending on the kind of application, a loss in
efficiency may be acceptable in order to obtain more precise results. Or the other way
around, a slight loss in accuracy may be tolerated in favor of tagging speed.

Moreover, some languages like Tamil have a richer morphology than others.
This leads the tagger to have a large set of feature patterns. Also, the tagset size and
ambiguity rate may vary from language to language and from problem to problem.
Besides, if few data are available for training, the proportion of unknown words may be
huge. Sometimes, morphological analyzers could be utilized to reduce the degree of
ambiguity when facing unknown words. Thus, a sequential tagger should be flexible
with respect to the amount of information utilized and context shape.

Another very interesting property for sequential taggers is their portability.


Multilingual information is a key ingredient in NLP tasks such as Machine Translation,
Information Retrieval, Information Extraction, Question Answering and Word Sense
Disambiguation. Therefore, having a tagger that works equally well for several
languages is crucial for the system robustness. For some languages, lexical resources
are hard to obtain. Therefore, ideally, a tagger should be capable for learning with
fewer annotated data. The SVMTool is intended to comply with all the requirements of
modern NLP technology, by combining simplicity, flexibility, robustness, portability
136
and efficiency with state–of–the–art accuracy. This is achieved by working in the
Support Vector Machines (SVM) learning framework, and by offering NLP researchers
a highly customizable sequential tagger generator. The SVMTool which is a language
independent sequential tagger is applied to Tamil POS tagging.

5.5.2 Features of SVMTool

The following are the features of the SVMTool [10].

Simplicity: The SVMTool is easy to configure and train. The learning is


controlled by means of a very simple configuration file. There are very few parameters
to tune. And the tagger itself is very easy to use, accepting standard input and output
pipelining. Embedded usage is also supplied by means of the SVMTool API.

Flexibility: The size and shape of the feature context can be adjusted. Also,
rich features can be defined, including word and POS (tag) n-grams as well as
ambiguity classes and “may be’s”, apart from lexicalized features for unknown words
and sentence general information. The behavior at tagging time is also very flexible,
allowing different strategies.

Robustness: The over fitting problem is well addressed by tuning the C


parameter in the soft margin version of the SVM learning algorithm. Also, a sentence-
level analysis may be performed in order to maximize the sentence score. And, for
unknown words not to punish so severely on the system effectiveness, several strategies
have been implemented and tested.

Portability: The SVMTool is language independent. It has been successfully


applied to English and Spanish without a priori knowledge other than a supervised
corpus. Moreover, thinking of languages for which labeled data is a scarce resource, the
SVMTool also may learn from unsupervised data based on the role of non-ambiguous
words with the only additional help of a morpho-syntactic dictionary.

Accuracy: Compared to state–of–the–art POS taggers reported up to date, it


exhibits a very competitive accuracy. Clearly, rich sets of features allow modeling very
precisely most of the information involved. Also the learning paradigm, SVM, is highly
suitable for working accurately and efficiently with high dimensionality feature spaces.

137
Efficiency: Performance at tagging time depends on the feature set size and the
tagging scheme selected. For the default (one-pass left-to-right greedy) tagging scheme,
it exhibits a tagging speed of 1,500 words/second whereas the C++ version achieves a
tagging speed of over 10,000 words/second. This has been achieved by working in the
primal formulation of SVM. The use of linear kernels causes the tagger to perform
more efficiently both at tagging and learning time, but forces the user to define a richer
feature space.

5.5.3 Components of SVMTool

The SVMTool [12] software package consists of three main components, namely the
model learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval).
Previous to the tagging, SVM models (weight vectors and biases) are learned from a
training corpus using the SVMTlearn component. Different models are learned for
different strategies. Then, at tagging time, using the SVMTagger component, one may
choose the tagging strategy that is most suitable for the purpose of tagging. Finally,
given a correctly annotated corpus and the corresponding SVMTool predicted
annotation, the SVMTeval component displays tagging results.

5.5.3.1 SVMTlearn

Given a set of examples (either annotated or unannotated for training), the SVMTlearn
trains a set of SVM classifiers. So as to do that, it makes use of SVM–light, an
implementation of Vapnik’s SVMs in C, developed by Thorsten Joachim’s (2002). The
SVMlight software implementation of Vapnik’s Support Vector Machine by Thorsten
Joachim’s has been used to train the models.

Training Data Format

Training data must be in column format, i.e. a token per line corpus in a
sentence by sentence format. The column separator is the blank space. The word is to
be the first column of the line. The tag to predict takes the second column in the output.
The rest of the line may contain additional information. Example is given bellow in
Figure 5.5. No special ‘<EOS>’ mark is employed for sentence separation. Sentence
punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence
separators. In this system these symbols [.?] are used as sentence separators.

138
Figure 5.5 Training Data Format

Known words features


C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1)
C(0;1,2) C(0;-2,-1,0) C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1) C(1;-
1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-1,0,1)
C(1;0) k(0) k(1) k(2) m(0) m(1) m(2)

Unknown words features


C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1)
C(0;1,2) C(0;-2,-1,0) C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1) C(1;-
1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-1,0,1)
C(1;0) k(0) k(1) k(2) m(0) m(1) m(2) a(2) a(3) a(4) a(5) a(6) a(7)(8)
a(9) a(10) a(11) a(12) a(13) a(14) a(15) z(2) z(3) z(4) z(5) z(6) z(7)
z(8) z(9) z(10) z(11) z(12) z(13) z(14) z(15) ca(1) cz(1) L SN CP CN

139
Models

Five different kinds of models have been implemented in this Tool. Models 0, 1,
and 2 differ only in the features they consider. Model 3 and Model 4 are just like Model
0 with respect to feature extraction but examples are selected in a different manner.
Model 3 is for unsupervised learning. Hence, given an unlabeled corpus and a
dictionary, at learning time it can only count on knowing the ambiguity class, and the
POS information only for unambiguous words. Model 4 achieves robustness by
simulating unknown words in the learning context at training time.

Model 0: This is the default model. The unseen context remains ambiguous. It was
thought of having in mind the one-pass on-line tagging scheme, i.e. the tagger goes
either left-to-right or right-to-left making decisions. So, past decisions feed future ones
in the form of POS features. At tagging time, only the parts-of-speech of already
disambiguated tokens are considered. For the unseen context, ambiguity classes are
considered instead (Table 5.4).

Model 1: This model considers the unseen context already disambiguated in a previous
step. So it is thought for working at a second pass, revisiting and correcting already
tagged text (Table 5.5).

Model 2: This model does not consider pos features at all for the unseen context. It is
designed to work at a first pass, requiring Model 1 to review the tagging results at a
second pass (Table 5.6).

Model 3: The training is based on the role of unambiguous words. Linear classifiers
are trained with examples of unambiguous words extracted from an unannotated
corpus. So, fewer POS information is available. The only additional information
required is a morpho-syntactic dictionary.

Model 4: The errors caused by unknown words at tagging time punish the system
severely. So as to reduce this problem, during learning, some words are artificially
marked as unknown in order to learn a more realistic model. The process is very
simple. The corpus is divided in a number of folders. Before starting to extract samples
from each of the folders, a dictionary is generated out from the rest of folders. So, the
words appearing in a folder but not in the rest are unknown words to the learner.

140
Table 5.4 Example of Suitable POS Features for Model 0

Ambiguity classes a ,a ,a
0 1 2
May_be’s m ,m ,m
0 1 2
POS Features p , p p-3, p-2, p-1,
−2 −1
POS Bigrams ( p−2, p−1),( p−1, a+1),( a+1, a+2 )
POS Trigrams ( p−2, p−1, a0 ), ( p−2, p−1, a+1),
( p−1,a0 ,a+1 ),( p−1, a+1, a+2 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC,
MW,L
Sentence_info punctuation ( '.','?','!' )

Table 5.5 Example of Suitable POS Features for Model 1

Ambiguity classes a ,a ,a
0 1 2
May_be’s m ,m ,m
0 1 2
POS Features p ,p ,p ,p
−2 −1 +1 +2
POS Bigrams ( p−2, p−1),( p−1, p+1),( p+1, p+2 )
POS Trigrams ( p−2, p−1, a0 ), ( p−2, p−1, p+1),
( p−1,a0 , p+1 ),( p−1, p+1, p+2 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L
Sentence_info punctuation ( '.','?','!' )

141
Table 5.6 Example of Suitable POS Features for Model 2

Ambiguity classes a
0
May_be’s m
0
POS Features p ,p
−2 −1
POS Bigrams ( p−2, p−1)
POS Trigrams ( p−2, p−1, a0 )
Single characters ca(1), cz(1)
Prefixes a(2), a(3), a(4)
Suffixes z(2), z(3), z(4)
Lexicalized features SA, CAA, AA, SN, CP, CN, CC,
MW,L
Sentence_info punctuation ( '.','?','!' )

SVMTlearn for Tamil POS Tagging

The SVMTlearn is the primary component in SVMTool. It is used for training


the tagged corpus using SVMlight. This component works in Linux Operating System.
POS Tagged corpus is required for training. However, if there is enough data, it is a
good practice to split it into three working sets (i.e. training, validation and test). That
will allow the system to train, tune and evaluate it. With less data also one can still
train, tune and test the system through cross-validation. But the accuracy and efficiency
will suffer.

In this component, input is POS tagged training corpus. The training corpus is
given to the SVMTlearn component that trains the model using the features given in the
configuration file. The features are defined in the configuration file, based on Tamil
language. The outputs of the SVMTlearn are dictionary file, merged files for unknown
and known words (for all models). Each merged file contains all the features of known
words and unknown words (Figure 5.6).

142
Tagged Tamil Corpus 

 
SVM Learn 

Dictionary  Merged Models  Features 

Known Unknown

Figure 5.6 Implementation of SVMTlearn

Example for training output of SVMTlearn


---------------------------------------------------------------------
-------------------
SVMTool v1.3
(C) 2006 TALP RESEARCH CENTER.
Written by Jesus Gimenez and Lluis Marquez.
---------------------------------------------------------------------
-------------------
TRAINING SET = /media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN
---------------------------------------------------------------------
-------------------
---------------------------------------------------------------------
-------------------
DICTIONARY <TAMIL_CORPUS.DICT> [31605 words]
*********************************************************************
************************
BUILDING MODELS... [MODE = 0 :: DIRECTON = LR]
*********************************************************************
************************

143
*********************************************************************
************************
C-PARAMETER TUNING by 10-fold CROSS-VALIDATION
on </media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN>
on <MODE 0> <DIRECTION LR> [KNOWN]
C-RANGE = [0.01..1] :: [log] :: #LEVELS = 3 :: SEGMENTATION RATIO =
10
*********************************************************************
************************
=====================================================================
========================
LEVEL = 0 :: C-RANGE = [0.01..1] :: FACTOR = [* 10 ]
=====================================================================
========================
---------------------------------------------------------------------
------------------------
******************************** level - 0 : ITERATION 0 - C = 0.01 -
[M0 :: LR]
---------------------------------------------------------------------
------------------------

TEST ACCURACY: 90.6093%

KNOWN[92.886% ] AMBIG.KNOWN [ 83.3052% ] UNKNOWN [ 78.5781% ]

TEST ACCURACY: 90.392%

KNOWN{92.6809% ] AMBIG.KNOWN [ 82.838% ] UNKNOWN [ 78.0815% ]

TEST ACCURACY: 90.1015%

KNOWN [ 92.6128% ] AMBIG.KNOWN [ 83.4766% ] UNKNOWN [ 77.5075% ]

TEST ACCURACY: 89.7127%

KNOWN [ 92.0721% ] AMBIG.KNOWN [ 81.8731% ] UNKNOWN [ 77.5281% ]

TEST ACCURACY: 90.7699%

KNOWN [ 92.7304% ] AMBIG.KNOWN [ 83.4785% ] UNKNOWN [ 80.3874% ]

TEST ACCURACY: 89.8988%

KNOWN [ 92.3462% ] AMBIG.KNOWN [ 81.5675% ] UNKNOWN [ 77.286% ]

144
TEST ACCURACY: 90.8836%

KNOWN [ 92.9671% ] AMBIG.KNOWN [ 83.6309% ] UNKNOWN [ 79.5591% ]

TEST ACCURACY: 89.9724%

KNOWN [ 92.4002% ] AMBIG.KNOWN [ 82.1854% ] UNKNOWN [ 77.664% ]

TEST ACCURACY: 90.2643%

KNOWN [ 92.5675% ] AMBIG.KNOWN [ 83.0289% ] UNKNOWN [ 78.0907% ]

TEST ACCURACY: 90.7494%

KNOWN [ 92.7798% ] AMBIG.KNOWN [ 82.7494% ] UNKNOWN [ 79.8929% ]

OVERALL ACCURACY [Ck = 0.01 :: Cu = 0.07975] : 90.33539%

KNOWN [ 92.6043% ] AMBIG.KNOWN [ 82.81335% ] UNKNOWN [ 78.45753% ]

MAX ACCURACY -> 90.33539 :: C-value = 0.01 :: depth = 0 :: iter = 1

---------------------------------------------------------------------
----
******************************** level - 0 : ITERATION 1 - C = 0.1 -
[M0 :: LR]
---------------------------------------------------------------------
----
TEST ACCURACY: 91.7702%
KNOWN [ 94.2402% ] AMBIG.KNOWN [ 87.5492% ] UNKNOWN [ 78.7175% ]
TEST ACCURACY: 91.8881%
KNOWN [ 94.4737% ] AMBIG.KNOWN [ 88.4324% ] UNKNOWN [ 77.9821% ]
TEST ACCURACY: 91.3219%
KNOWN [ 94.0596% ] AMBIG.KNOWN [ 88.0441% ] UNKNOWN [ 77.5928% ]
TEST ACCURACY: 91.0615%
KNOWN [ 93.6037% ] AMBIG.KNOWN [ 86.6795% ] UNKNOWN [ 77.9326% ]
TEST ACCURACY: 92.0852%
KNOWN [ 94.2575% ] AMBIG.KNOWN [ 88.3275% ] UNKNOWN [ 80.5811% ]
TEST ACCURACY: 91.3927%
KNOWN [94.1299% ] AMBIG.KNOWN [ 87.4226% ] UNKNOWN [ 77.286% ]
TEST ACCURACY: 91.9891%
KNOWN[94.2944% ] AMBIG.KNOWN [ 88.0182% ] UNKNOWN [ 79.4589% ]
TEST ACCURACY: 91.3063%
KNOWN[93.9605% ] AMBIG.KNOWN [ 87.1258% ] UNKNOWN [ 77.8502% ]

145
TEST ACCURACY: 91.3654%
KNOWN[93.8499% ] AMBIG.KNOWN [87.2127% ] UNKNOWN [ 78.2339% ]
TEST ACCURACY: 91.8693%
KNOWN [ 94.1% ] AMBIG.KNOWN [ 87.0546% ] UNKNOWN [ 79.9416% ]
OVERALL ACCURACY [Ck = 0.1 :: Cu = 0.07975] : 91.60497%
KNOWN [ 94.09694% ] AMBIG.KNOWN [ 87.58666% ] UNKNOWN [ 78.55767% ]
MAX ACCURACY -> 91.60497 :: C-value = 0.1 :: depth = 0 :: iter = 2

5.5.3.2 SVMTagger

Given a text corpus (one token per line) and the path to a previously learned SVM
model (including the automatically generated dictionary), it performs the POS tagging
of a sequence of words. The tagging goes on-line, based on a sliding window which
gives a view of the feature context to be considered at every decision.
In any case, there are two important concepts to be considered:

• Example generation

• Feature extraction

Example generation: This step is to define what an example is, according to the
concept in which the machine is to be learned. For instance, in POS tagging, the
machine has to correctly classify the words according to their POS. Thus, every POS
tag is the class of a word that generates a positive example for its class, and a negative
example for the rest of the classes. Therefore, every sentence may generate a large
number of examples.

Feature Extraction: The set of features based on the algorithm to be used have to be
defined. For instance, the POS tags should be guessed according to the preceding and
following words. Thus, every example is represented by a set of active features. These
representations will be the input for the SVM classifiers. If the working of SVMTool
has to be learned, it is necessary to run the SVMTlearn (Perl version). By setting the
REMOVE_FILES (in the configuration file) option to 0, it will not remove the
intermediate files; if option 1 is given, it will remove all the intermediate files.

Feature extraction is performed by the sliding window object. A sliding window


works on a very local context (as defined in the CONFIG file), usually a 5 words
context [-2, -1, 0, +1, +2], being the current word under analysis at the core position.

146
Taking this context into account, a number of features may be extracted. The feature set
depends on, how the tagger is going to proceed later (i.e., the context and information
that’s going to be available at tagging time). Generally, all the words are known before
tagging, but POS tag is available only for some words (those already tagged).

In tagging stage, if the input word is known and ambiguous, the word is tagged
(i.e., classified), and the predicted tag feeds forward next decisions. This will be done
in the "sub classify_sample_merged ( )” subroutine in the SVMTAGGER file. In order
to speed up SVM classification, merge mapping and SVM weights and biases, into a
single file. Therefore, when a new example is to be tagged, the tagger just accesses the
merged model and for every active feature, retrieves the associated weight. Then, for
every possible tag, the bias will also be retrieved. Finally, SVM classification rule (i.e.,
scalar product + bias) is applied.

Example:

கூட் க்கு நீர் திட்டம் நிைறேவற்றப்ப ம் என்றார் தல்வர்

kUddukkudiwIr thiddam wiRaivERRappadum enRAr muthalvar

Tags:

<NNC> <NNC> <VNAJ>/<VF> <VF> <NN>


(Ambiguity)
For tagging this sentence, first take the active features like w-1, w+1,
i.e., to predict POS-tags based only on the preceding and following words. Here the
ambiguity word is “wiRaivERRappadum”. The correct tag of this ambiguous word is
<VF>.
“w0, the current word is wiRaivERRappadum, the active features are "w-1 is
thiddam " and "w+1 is enRAr ".

while applying the SVM classification rule for a given POS Tagger, it is
necessary to go to the merged model and retrieve the weight for these features, and the
bias (first line after the header, beginning with "BIASES "), corresponding to the given
POS. For instance, suppose this ".MRG" file:

147
BIASES <ADJ>:0.37059487 <ADV>:-0.19514606 <CNJ>:0.43007979
<COM>:-0.037037037 <CRD>:0.55448766 <CVB>:-0.19911161 <DET>:-1.1815452
<EMP>:-0.86491783 <INT>:0.61775334 <NN>:-0.21980137 <NNC>:1.3656117
<NNP>:0.072242349 <NNPC>:0.7906585 <NNQ>:0.44012828 <ORD>:0.30304924
<PPO>:-0.2182171 <PRI>:0.89491131 <PRID>:-0.15550162 <PRIN>:0.56913633
<PRP>:0.35316978 <QW>:0.039121434 <RDW>:0.84771943 <VAX>:0.041690388
<VBG>:0.23199934 <VF>:0.33486366 <VINT>:0.0048185684 <VNAJ>:0.42063524
<VNAV>:0.18009116

C0~-1:thiddam <CRD>:0.00579042912371902 <NN>:0.532690716551973


<NNC>:-0.508699048073652 <ORD>:-0.000698015879911668
<VBG>:0.142313085089229 <VF>:0.296699729267891 <VNAJ>:-0.32
C0~1:enRAr <VAX>:0.132726597682121 <VF>:0.66667135122578
<VNAJ>:-0.676332541749603

The SVM score for “wiRaivERRappadum” being <VNAJ> is:


Weight ("w-1: thiddam","VNAJ") + weigh("w+1:enRAr
","VNAJ")– bias("VNAJ") = (-0.32) + (-0.676332541749603)-( 0.42063524)
= -1.416967781749603

The SVM score for “wiRaivERRappadum” being <VF> is:


Weight (“w-1: thiddam”, ”VF”)+weight (“w+1: enRAr”, ”VF”)
– bias (“VF”) = (0.296699729267891) + (0.66667135122578)-( 0.33486366
)=0.6285047420493671

Here SVM score for <VF> is more compared to <VNAJ>, So, the tag VF is assigned to
the word ‘wiRaivERRappadum’.

Calculated part–of–speech tags feed directly forward next tagging decisions as


context features. The SVMTagger component works on standard input/output. It
processes a token per line corpus in a sentence by sentence fashion. The token is
expected to be the first column of the line. The predicted tag will take the second
column in the output. The rest of the line remains unchanged. Lines beginning with
‘##’ are ignored by the tagger. Figure 5.7 is an example of input file. SVMTagger will
consider only the first column of the input file. Figure 5.8 shows an example of output
file.

148
Figure 5.7 Example Input Figure 5.8 Example Output

SVMTagger for Tamil

In SVMTagger component, the important options are strategies and backup


lexicon. Here, it is important to choose the tagging strategy that is going to be used.
This may depend, for instance, on efficiency requirements. If the tagging must be as
fast as possible, then one should forget about strategies 1, 5, and 6, because strategy 1
goes in two passes and strategies 5 and 6 perform a sentence-level tagging. Strategy 3 is
only for unsupervised learning (no hand-annotated data is needed). To choose among
strategies 0, 2 and 4, the best solution is to try them all. If unknown words are known to
the tagger at tagging time, strategies 2 and 4 are more robust than strategy 0. If any
speed requirement or information about future data is not needed, the tagging strategies
4 and 6 systematically show best results.

149
Here the format of backup lexicon file is same as the dictionary format. So a
PERL program can be used for converting a tagged corpus into a dictionary format.
Tagging will be complex for open tag categories. The main drawback in POS tagging is
tagging the proper nouns. For English, they use capitalization for tagging the proper
noun words. But in Tamil, it is not possible; therefore a large backup lexicon with
proper nouns is provided to the system. A large dataset for proper noun (Indian place
and person names) was collected and given as the input to the morphological generator
(using PERL program). Morph generator generates nearly twelve inflections for every
proper noun. This new dataset is converted into SVMTool dictionary format and given
to SVMTagger as a back up lexicon. Figure 5.9 shows the steps in implementation of
SVMTagger for Tamil. The input to the system is an untagged cleaned Tamil corpus
and output is tagged or annotated corpus. Supporting files are training corpus,
dictionary file, merged models for unknown and known words and backup lexicon.

SVMLearn
Untagged Tamil 
Corpus 
Backup lexicon 
Training Data 

Dictionary  SVM Tagger

Features 

Strategies 
Merged model  Tagged Tamil Corpus 

Figure 5.9 Implementation of SVMTagger

150
5.5.3.3 SVMTeval

Given a SVMTool predicted tagging output and the corresponding gold-standard,


SVMTeval evaluates the performance in terms of accuracy. It is a very useful
component for the tuning of the system parameters, such as the C parameter, the feature
patterns and filtering, the model compression etc. Based on a given morphological
dictionary (e.g., the automatically generated at training time), results may be presented
also for different sets of words (known words vs. unknown words, ambiguous words
vs. unambiguous words). A different view of these same results can be seen from the
class of ambiguity perspective too, i.e., words sharing the same kind of ambiguity may
be considered together. Also, words sharing the same degree of disambiguation
complexity, determined by the size of their ambiguity classes, can be grouped.

Usage: SVMTeval [mode] <model> <gold> <pred>

- mode: 0 - complete report (everything)

1 - overall accuracy only [default]

2 - accuracy of known vs. unknown words

3 - accuracy per level of ambiguity

4 - accuracy per kind of ambiguity

5 - accuracy per class


- model: model name
- gold: correct tagging file
- pred: predicted tagging file
Example: SVMTeval TAMIL_CORPUS_4L TAMIL.GOLD TAMIL.OUT

SVMTeval for Tamil

SVMTeval is the last component of SVMTool. In this, the component is used to


evaluate the outputs based on different modes. The main input of this component is a
correctly tagged corpus, also called gold standard (Figure 5.10).

151
Correct Tagged file  SVM Tagged 
(Gold Standard)  Output 

Model 
Model  SVMTeval 
Name 

Report 

Figure 5.10 Implementation of SVMTeval

SVMTeval report

Brief report
By default, a brief report mainly returning the overall accuracy is elaborated. It
also provides information about the number of tokens processed, and how much were
known/unknown and ambiguous/unambiguous according to the model dictionary.

Results are always compared to the most-frequent-tag (MFT) baseline.


*=========================SVMTevalreport========================
======
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
================================================================
========
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> vs.
<E:\\SVMTool-1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-
1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY================================
=======================
#TOKENS = 1063

152
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------
---------------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================OVERALLACCURACY===============================
=======================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
1002 1063 94.2615% 71.2135%
*
================================================================
=========================

Known vs. unknown tokens

Accuracy for four different sets of words is returned. The first set is that of all
known tokens, tokens which were seen during the training. The second and third sets
contain respectively all ambiguous and all unambiguous tokens among these known
tokens. Finally, there is the set of unknown tokens, which were not seen during the
training.
*=========================SVMTevalreport
==============================

* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
======================================================================
==
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> vs. <E:\\SVMTool-
1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY======================================
================
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token

153
* --------------------------------------------------------------------
---------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================KNOWNvsUNKNOWNTOKENS================================
===============
HITS TRIALS ACCURACY
* --------------------------------------------------------------------
---------------------
*=======known=========================================================
================
816 854 95.5504%
-------- known unambiguous tokens ------------------------------------
---------------------
604 623 96.9502%
-------- known ambiguous tokens --------------------------------------
---------------------
212 231 91.7749%
*=======unknown=======================================================
=================
186 209 88.9952%
*
======================================================================
===================
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------------
---------------------
1002 1063 94.2615% 71.2135%
*
======================================================================
===================

Level of ambiguity

This view of the results groups together all words having the same degree of
POS–ambiguity.

154
*=========================SVMTevalreport
==============================
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
======================================================================
==
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> vs. <E:\\SVMTool-
1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY======================================
================
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------------
---------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================ACCURACY PER LEVEL OF AMBIGUITY
=======================================
#CLASSES = 5
*=====================================================================
====================
LEVEL HITS TRIALS ACCURACY MFT
*---------------------------------------------------------------------
--------------------
1 605 624 96.9551% 96.6346%
2 204 220 92.7273% 66.8182%
3 7 9 77.7778% 66.6667%
4 2 3 66.6667% 33.3333%
28 184 207 88.8889% 0.0000%
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
*---------------------------------------------------------------------
--------------------
1002 1063 94.2615% 71.2135%

155
*
======================================================================
===================

Kind of ambiguity

This view is much finer. Every class of ambiguity is studied separately.


*=========================SVMTevalreport
==============================
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
======================================================================
==
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> vs. <E:\\SVMTool-
1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY======================================
================
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------------
---------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*=================ACCURACY PER CLASS OF AMBIGUITY
=======================================
#CLASSES = 55
*
======================================================================
===================
CLASS HITS TRIALS ACCURACY MFT
----------------------------------------------------------------------
-------------------
<ADJ> 28 28 100.0000% 100.0000%

<ADJ>_<ADV>_<CNJ>_<COM>_<CRD>_<CVB>_<DET>_<ECH>_<INT>_<NN>_<NNC>_<NNP>
_<NNPC>_<NNQ>_<ORD>_<PPO>_<PRID>_<PRIN>_<PRP>_<QTF>_<QW>_<RDW>_<VAX>_<

156
VBG>_<VF>_<VINT>_<VNAJ>_<VNAV> 184 207
88.8889% 0.0000%

<ADJ>_<NN> 1 1 100.0000% 100.0000%


<ADJ>_<VNAJ> 1 1 100.0000% 100.0000%
<ADV> 31 32 96.8750% 96.8750%
<ADV>_<NN>_<PPO> 1 2 50.0000% 50.0000%
<ADV>_<RDW> 2 2 100.0000% 100.0000%
<CNJ> 20 20 100.0000% 100.0000%
<CNJ>_<PPO> 1 1 100.0000% 100.0000%
<COM> 17 17 100.0000% 100.0000%
<COMM> 49 49 100.0000% 100.0000%
<CRD> 23 23 100.0000% 95.6522%
<CRD>_<DET> 22 22 100.0000% 100.0000%
<CRD>_<NN>_<NNC>_<PPO> 2 2 100.0000% 50.0000%
<CRD>_<ORD> 2 2 100.0000% 0.0000%
<CVB> 6 6 100.0000% 100.0000%
<CVB>_<VF> 0 1 0.0000% 0.0000%
<DET> 14 14 100.0000% 100.0000%
<DOT> 77 77 100.0000% 100.0000%
<EMP>_<PRP> 1 1 100.0000% 100.0000%
<INT> 6 6 100.0000% 100.0000%
<INT>_<NN> 0 1 0.0000% 0.0000%
<NN> 81 91 89.0110% 89.0110%
<NN>_<NNC> 148 161 91.9255% 58.3851%
<NN>_<NNC>_<NNPC> 1 1 100.0000% 100.0000%
<NN>_<PRID>_<PRIN>_<VNAJ>0 1 0.0000% 0.0000%
<NN>_<VBG> 1 1 100.0000% 100.0000%
<NN>_<VF> 1 1 100.0000% 100.0000%
<NNC> 46 47 97.8723% 97.8723%
<NNC>_<NNP>_<NNPC> 4 5 80.0000% 80.0000%
<NNC>_<VAX> 1 1 100.0000% 100.0000%
<NNC>_<VF>_<VNAJ> 1 1 100.0000% 0.0000%
<NNP> 34 37 91.8919% 91.8919%
<NNP>_<NNPC> 0 1 0.0000% 0.0000%
<NNPC> 0 1 0.0000% 0.0000%
<NNQ> 4 4 100.0000% 100.0000%
<ORD> 3 3 100.0000% 66.6667%
<PPO> 6 6 100.0000% 100.0000%
<PPO>_<VNAV> 1 1 100.0000% 100.0000%
<PRID> 2 2 100.0000% 100.0000%

157
<PRIN> 2 2 100.0000% 100.0000%
<PRP> 33 33 100.0000% 100.0000%
<QM> 4 4 100.0000% 100.0000%
<QTF> 5 5 100.0000% 100.0000%
<QW> 4 4 100.0000% 100.0000%
<RDW> 1 1 100.0000% 100.0000%
<VAX> 7 7 100.0000% 100.0000%
<VAX>_<VF> 8 8 100.0000% 100.0000%
<VBG> 11 11 100.0000% 100.0000%
<VBG>_<VF> 2 2 100.0000% 100.0000%
<VF> 39 40 97.5000% 97.5000%
<VF>_<VNAJ> 12 12 100.0000% 91.6667%
<VINT> 15 16 93.7500% 93.7500%
<VNAJ> 20 20 100.0000% 100.0000%
<VNAV> 17 18 94.4444% 94.4444%
*=================OVERALLACCURACY=====================================
=================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------------
---------------------
1002 1063 94.2615% 71.2135%
*
======================================================================
=========

Class

Every class is studied individually.


*=========================SVMTevalreport========================
=====
* model = [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*===============================================================
=========
EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> vs.
<E:\\SVMTool-1.3\\bin\\files\\test.gold> on model <E:\\SVMTool-
1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY================================
=======================

158
#TOKENS = 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* --------------------------------------------------------------
---------------------------
#KNOWN = 80.3387% --> 854 / 1063
#UNKNOWN = 19.6613% --> 209 / 1063
#AMBIGUOUS = 21.7310% --> 231 / 1063
#MFT baseline = 71.2135% --> 757 / 1063
*================= ACCURACY PER PART-OF-SPEECH
===========================================
POS HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
<ADJ> 30 31 96.7742% 90.3226%
<ADV> 47 48 97.9167% 70.8333%
<CNJ> 21 21 100.0000% 95.2381%
<COM> 17 17 100.0000% 100.0000%
<COMM> 49 49 100.0000% 100.0000%
<CRD> 26 26 100.0000% 84.6154%
<CVB> 7 8 87.5000% 75.0000%
<DET> 36 36 100.0000% 100.0000%
<DOT> 77 77 100.0000% 100.0000%
<EMP> 1 1 100.0000% 100.0000%
<INT> 6 7 85.7143% 85.7143%
<NN> 243 259 93.8224% 57.9151%
<NNC> 145 162 89.5062% 46.2963%
<NNP> 43 44 97.7273% 86.3636%
<NNPC> 0 16 0.0000% 0.0000%
<NNQ> 4 4 100.0000% 100.0000%
<ORD> 2 2 100.0000% 100.0000%
<PPO> 9 9 100.0000% 100.0000%
<PRID> 2 3 66.6667% 66.6667%
<PRIN> 2 2 100.0000% 100.0000%
<PRP> 34 34 100.0000% 97.0588%
<QM> 4 4 100.0000% 100.0000%
<QTF> 5 5 100.0000% 100.0000%
<QW> 6 6 100.0000% 66.6667%

159
<RDW> 1 1 100.0000% 100.0000%
<VAX> 18 18 100.0000% 77.7778%
<VBG> 20 22 90.9091% 54.5455%
<VF> 68 68 100.0000% 66.1765%
<VINT> 16 18 88.8889% 83.3333%
<VNAJ> 41 42 97.6190% 69.0476%
<VNAV> 22 23 95.6522% 73.9130%
*=================OVERALLACCURACY===============================
======================
HITS TRIALS ACCURACY MFT
* --------------------------------------------------------------
---------------------------
1002 1063 94.2615% 71.2135%*
================================================================
=========================

5.6 RESULTS AND COMPARISON WITH OTHER TOOLS

Apart from SVMTool, three other taggers namely TnT [39], MBT [60] and WEKA
[168] were trained with the same corpus. The accuracy result of SVMTool is compared
with the above tools for the same testing corpus. Following is brief description of the
above mentioned taggers.

TnT (Trigrams'n'Tags) is a very efficient statistical part-of-speech tagger that is


trainable on different languages and virtually for any tagset. The tagger is an
implementation of the Viterbi algorithm for second orders Markov Models. The
component for parameter generation trains on tagged corpora. The system incorporates
several methods of smoothing and of handling unknown words [39].

MBT (Memory Based Tagger) is an approach to POS tagging based on Memory-based


learning. It is an extension of the classical k-Nearest Neighbor (k-NN) approach to
statistical pattern classification. Here, all the instances are fully stored in memory and
classification involves a pass along all stored instances. The approach is based on the
assumption that reasoning is based on direct reuse of stored experiences rather than on
the application of knowledge (such as rules or decision trees) abstracted from
experience. Hence the tagging accuracy for unknown words is low [60].

160
WEKA is a collection of machine learning algorithms for solving real-world data
mining problems. The J48 classifier was used for implementation of Tamil POS
tagging. All the three tools were trained using the same corpus as used in SVMTool.
The same data format was followed in all the cases [168].

The experiments were conducted with our tagged corpus. The corpus was
divided into training set and test set. For SVMTool, 94.6% overall accuracy is obtained,
which is much higher than that of the other taggers. POS Tagging using MBT gave a
very low accuracy (65.65%) for unknown words since the algorithm is based on direct
reuse of stored experiences. Ambiguous words were handled poorly by TnT, whereas
WEKA gave a high accuracy of 90.11% for ambiguous words (Table 5.7). Though
SVMTool gave very high accuracy for all cases, the training time was significantly
higher when compared to other tools. The unknown word accuracy of the SVMTool is
86.25%. The accuracy goes down in case of some specific tags. Accuracy results of
SVMTool compared to the various tools for the same corpus is given in Table 5.7.

Table 5.7 Comparison of Accuracies

WEKA MBT TNT SVMTool


Known …….. 88.45% 92.54% 96.74%
Ambiguous 90.11% 80.23% 78.72% 94.57%
Unknown …….. 65.65% 74.18% 86.25%
Overall
…….. 78.48% 89.56% 94.6%
Accuracy

5.7 ERROR ANALYSIS

The detailed error analysis is conducted to identify the miscalculation of tags. The
untagged sentences size of about 1200 sentences (10 k words) is taken for testing the
system. For analyzing the error, 8 frequently error occurred tags are considered. The
tags and their trials and errors are shown in Table 5.8. For instance, errors represents
the tagger is failed to identify the CRD tag at 30 occurrences.

Table 5.9 shows the confusion matrix for 8 POS tags. This matrix shows the
performance of the tagger.

161
Table 5.8 Trials and Error

Tags Trails Hits Errors


CRD 642 612 30
NN 4200 3989 211
NNC 2317 2264 53
NNP 1768 1721 47
NNPC 47 33 14
ORD 32 22 10
VBG 274 258 16
VNAJ 682 662 20

Table 5.9 Confusion Matrix

CRD NN NNC NNP NNPC ORD VBG VNAJ O


CRD 0.953 0.019 0 0 0 0.028 0 0 0
NN 0 0.95 0.016 0.003 0.001 0.001 0.01 0.003 0.016
NNC 0.001 0.018 0.977 0.002 0.001 0 0 0 0.001
NNP 0.001 0.013 0.001 0.973 0.007 0 0.001 0 0.004
NNPC 0 0.106 0.106 0.085 0.702 0 0 0 0
ORD 0.125 0.063 0 0 0 0.688 0 0 0.125
VBG 0 0.04 0 0.004 0 0 0.942 0 0.015
VNAJ 0.001 0.004 0 0 0 0 0.007 0.971 0.016

5.8 SUMMARY

This chapter gave the detail about the development of POS tagger and tagged corpora.
Part of Speech tagging plays an important role in various speech and language
processing applications. Currently, many statistical tools are available to do Part of
Speech tagging. The SVMTool has been already successfully applied to English and
Spanish POS Tagging, exhibiting state–of–the–art performance (97.16% and 96.89%,
respectively). In both cases, results clearly outperform the HMM–based TnT part–of–
speech tagger. For Tamil, an accuracy of 94.6% has been obtained. Any language can
be trained easily using the existing statistical tagger tools. POS tagging can be extended
by applying this to other languages. The obstacle for the POS tagging for Indian
languages is there is no annotated (tagged) corpus. 45k sentences (5 lakh words) POS
annotated sentences are developed for train the POS Tagger.

162
CHAPTER 6
MORPHOLOGICAL ANALYZER FOR TAMIL
6.1 GENERAL

Morphological analyzer is one of the most important basic tools in automatic


processing of any human language. It analyses the naturally occurring word forms in a
sentence and identifies the root word and its features. In spite of its significance,
Dravidian languages do not have any morphological analyzers available in public
domain. The absence of such a tool for research severely impedes the development of
language technologies and applications like natural language interfaces, machine
translation, etc in these languages. In this thesis, Tamil morphological analyzer is used
for preprocessing Tamil sentences.

6.1.1 Morphology in Language

Grammar of any Language can be broadly divided into morphology and syntax. The
term morphology was coined by August Schleicher in 1859. Morphology deals with the
words and their construction. Syntax deals with how to put the words together in some
order to make meaningful sentences. Morphology is the field within linguistics that
studies the internal structure of words. While words are generally accepted as being the
smallest units of syntax, it is clear that in most languages, words can be related to other
words by rules. Morphology is attempts to formulate rules that model the knowledge of
the speakers of those languages. Morphemes are the smaller elements of which words
are built. Two broad classes of morphemes are stems and affixes. Affixes that are added
to the base to denote relations of words are morphemes. Morphemes can either be free
(they can stand alone, i.e. they can be words in their own right) e.g. dog, or they can be
bound (they must occur as part of a word) e.g. the plural suffix –s on dogs.

6.1.2 Computational Morphology

Computational morphology deals with developing theories and techniques for


computational analysis and synthesis of word forms. By computational analysis of
morphology, one can extract any information encoded in a word and bring it out so that
later layers of processing can make use of it.

163
6.1.3 Morphological Analyzer

Morphological analysis segments the words into lemma and morpho-lexical


information. It is a primary step for various types of text analysis of any language.
Morphological analyzers are used in search engines for retrieving the documents from
the keyword [169]. Morphological analyzer increases the recall of search engines. It is
also used in speech recognizer, lemmatization, Information Retrieval/Extraction,
Summarization, spell and grammar checker and machine translation.

A word is defined as a sequence of characters delimited by spaces, punctuation


marks, etc. in case of the written text. There is no difficulty in identifying words in the
written text entered into the computer because one simply has to look for the delimiters.
A word can be of two types: simple and compound. A simple word consists of a root or
stem together with suffixes or prefixes. A compound word (also called a conjoined
word) can be broken up into two or more independent words. Each of the constituent
words in a compound word is either a compound word or a simple word and may be
used independently as a word. On the other hand, the root and the affixes, which are
constituents of a simple word, are not all independent words and cannot occur as
separate words in the text.

Constituents of a simple word are called morphemes or meaning units. The


overall meaning of a simple word comes from the morphemes and their relationships.
Similarly, in case of a compound word, its meaning follows from its constituent words
and their inter-relationships. It should be noted that one has taken a pragmatic position
regarding words. Anything that is identifiable using the delimiters is a word. This is a
convenient position to take from the processing viewpoint. Similar is the case with the
definition of compound words.

With the above definition, an analyzer of words in a sentence does not have to
do much work in identifying a word. It simply has to look for the delimiters. Having
identified the word, it must determine whether it is a compound word or simple word.
If it is a compound word, it must first break it up into its constituent simple words
before proceeding to analyze them. The former is called as sandhi analyzer and the
later is morphological analyzer, both of which are important parts of a word analyzer.

164
The detailed linguistic analysis of a word can be useful for NLP. However, most
NLP researchers have concentrated on other aspects, like grammatical analysis,
semantic interpretation etc. As a result, NLP systems use rather simple morphological
analyzers. A generator does the reverse of an analyzer. Given a root and its features (or
affixes), a morphological generator generates a word. Similarly, a sandhi generator can
take the output of a morphological generator, and group simple words into compound
words, where possible.

Examples for Morphological Analysis;

books = book+Noun+PluraL(or)book+Verb+Pres+3SG.
stopping = stop+Verb+Cont
happiest = happy+Adj+Superlative
went = go+Verb+Past

Examples for Morphological Generator;

book+Noun+Plural = books
stop+Verb+Cont = stopping
happy+Adj+Superlative = happiest
go+Verb+Past = went

6.1.4 Role of Morphological Analyzer in NLP

Morphological analyzer plays an important role in the field of Natural Language


Processing. Figure 6.1 shows the Role of Morphological analyzer in NLP. Some of the
applications are,

• Spell checker
• Search Engines
• Information extraction and retrieval
• Machine Translation system
• Grammar checker
• Content analysis
• Question Answering system

165
• Automatic sentence Analyzer
• Dialoge system
• Knowlege representation in learning
• Language Teaching
• Language based educational exercises

Text Processing
Tools

Speech Morphological Machine


Analyzer Translation
Processing
System

Search Engines-
IR/IE

Figure 6.1 Role of Morphological Analyzer in NLP

6.2 TAMIL MORPHOLOGY

6.2.1 Tamil Morphology and Language

Tamil Morphology is very rich. It is an agglutinative language, like the other Dravidian
languages. Tamil words are made up of lexical roots followed by one or more affixes.
The lexical roots and the affixes are the smallest meaningful units and are called
morphemes. Tamil words are therefore made up of morphemes concatenated to one
another in a series. The first one in the construction is always a lexical morpheme
(lexical root). This may or may not be followed by other functional or grammatical
morphemes. For instance, a word த்தகங்கள் ‘puththakangkaL’ in Tamil, can be

meaningfully divided into த்தகம் ‘puththakam’ and கள் ‘kaL’ .

In this example, த்தகம்‘puththakam’ is the lexical root, representing a real

world entity and கள் ‘kaL ’ is the plural feature marker (suffix). கள் ‘kaL ’ is a

166
grammatical morpheme that is bound to the lexical root to add plurality to the lexical
root. Unlike English, Tamil words can have a large sequence of morphemes. For
instance,

த்தகங்கைள ‘puththakangkaLai’ = த்தகம் ‘puththakam’ (book) + கள் ’kaL’ (s)


+ஐ ‘ai’ (ACC. Case Marker).

Tamil nouns can take case suffixes after the plural marker. They can also have
post positions after that. Tamil words consist of a lexical root to which one or more
affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational
suffixes, which either changes the part of speech of the word or its meaning, or
inflectional suffixes, which mark categories such as person, number, mood, tense, etc.
The words can be analyzed like the one above by identifying the constituent
morphemes and their features can be identified.

6.2.2 Syntax of Tamil Morphology

Tamil is a consistently head-final language. The verb comes at the end of the clause,
with typical word order Subject Object Verb (SOV). Tamil is also a free word-order
language. Due to this relatively free word-order nature of Tamil language, the Noun
Phrase arguments before a final verb can appear in any permutation, yet it conveys the
same sense of a sentence. Tamil has postpositions rather than prepositions.
Demonstratives and modifiers precede the noun within the noun phrase. Subordinate
clauses precede the verb of the matrix clause.

Tamil is a null subject language. Not all Tamil sentences have subjects, verbs
and objects. It is possible to construct valid sentences that have only a verb—such as

ந்த “mudi-wth-athu”("Completed") or only a subject and object, without a verb

such as அ என் “athu en viitu” ("That, my house"). Tamil does not have a

copula (a linking verb equivalent to the word “is”). The word is included in the
translations only to convey the meaning more easily.

167
6.2.3 Word Formation Rules (WFR) in Tamil

Any new word created by Word Formation Rules (WFR) must be a member of a major
lexical category. The WFR determines the category of the output of the rule. In Tamil,
the grammatical category may change or may not change after the operation of WFR.
The following is the list of inputs and outputs of different kinds of WFR's in the
derivation of simple words in Tamil [170].

1. Noun → Noun

[ [vElai ]N + kAran ]suf ]N 'servant'

[ [ேவைல ]N + காரன் ]suf ]N ேவைலக்காரன்

[ [thozil ]N + ALi ]suf ]N 'laborer'

[ெதாழில் ]N + ஆளி ]suf ]N ெதாழிலாளி

2. Verb → Noun

[ [padi ]V + ppu ]suf ]N 'education'

[ [ப ]V + ப் ]suf ]N ப ப்

[ [ezuthu-]V + thu ]suf ]N 'letter'

[ [எ ]V + ]suf ] எ த்

[ [kEL ]V + vi ]suf] N 'question'

[ [ேகள் ]V + வி ]suf] N ேகள்வி

3. Adjective → Noun

[ [walla ]adj + thanam ]suf ]N 'good quality'

[ [நல்ல ]adj + தனம் ]suf ] நல்ல தனம்

[ [periya]adj + tu ]suf ]N 'big'

168
[ [ெபாிய ]adj + ]suf ]N ெபாிய

4. Noun → Verb

[ [uyir ]N + ppi ]suf ]V 'to give life'

[ [உயிர் ]N + ப்பி ]suf ]V உயிர்ப்பி

5. Adjective → Verb

[ [veLLai ]adj + aakku ]suf ]V 'to make (something) white'

[ [ெவள்ைள ]adj + ஆக்கு ]suf ]V ெவள்ைளயாக்கு

[ [karuppu ]adj + aakku ]suf ]V 'to make (something) black'

[ [க ப் ]adj + ஆக்கு ]suf ]V க ப்பாக்கு

6. Verb → Verb

[ [cey ]V + vi ]suf ]V 'cause to do'

[ [ெசய் ]V + வி ]suf ]V ெசய்வி

[ [wada ]V -thth]suf - u] ]V 'cause to walk'

[ [நட ]V -த் ] suf ]V நடத்

[ [vidu ]V + vi ]suf ]V 'to liberate'

[ [வி ]V + வி ]suf ]V வி வி

7. Noun → Adjective

[ [uyaram ]N + Ana ]suf ]adj 'high'

[ [உயரம் ]N + ஆன ]suf ]adj உயரமான

[ [azaku ]N + Ana ]suf ]adj 'beautiful'

[ [அழகு ]N + ஆன ]suf ]adj அழகான

169
[ [nErmai ]N + Ana ]suf ]adj 'honest'

[ [ேநர்ைம ]N + ஆன ]suf ]adj ேநர்ைமயான

8. Verb → Adverb

[ [ cey]V + tu ]suf ]adv 'having done'

[ [ ெசய்]V + ]suf ]adv 'having done' ெசய்

[ [ ezuth-]V + i ]suf ]adv 'having written'

[[எ -]V + இ ]suf ]adv எ தி

[ [ padi]V + ththu ]suf ]adv 'having read'

[ [ ப ]V + த் ]suf ]adv ப த்

Compound Word forms

Table 6.1 shows the possible combinations for compound word formation.
Examples:

{ [kalvi] N # [kUdam ]N # }N 'educational institution'


{ [கல்வி] N # [கூடம் ]N # }N கல்விகூடம்

{ [paNi ]N # [puri ]V # }V '(perform) work'

{ [பணி ]N # [ ாி ]V # }V பணி ாி

[ [ ezuth-]V # [kOl] N #}N 'writing instrument'


{[எ ] V # [ேகால்] N #}N எ ேகால்

{ [ periya ]adj # [ wakaram ]N # }N 'big city'


{ [ ெபாிய ]adj # [ நகரம் ]N # }N ெபாிய நகரம்

{ [ wERRu] adv # [ iravu] N #} N last night'


{ [ ேநற் ] adv # [ இர ] N #} N ேநற் இர

170
Table 6.1 Compound Word-forms Formation

No. Surface form Inflection Compound form

1 Noun Noun Noun

2 Noun Verb Verb

3 Verb Noun Noun

4 Adjective Noun Noun

5 Adverb Verb Verb

6.2.4 Tamil Verb Morphology

Tamil verbs are inflected by means of suffixes. Tamil verbs can be finite or non-finite
forms. Finite verb forms occur in the main clause of the sentence and non-finite forms
occur as the predicate of subordinate or embedded clauses. Morphologically, finite verb
forms are inflected for tense, mood, aspect, person, number and gender.

The simple finite verb forms are given in Table 6.2. First column presents the
PNG (Person-Number-Gender) Tag and the further columns presents present, past and
future tenses respectively. For the word “ப ” padi (study), various simple finite
inflection forms with tense markers and PNG markers are given in Table 6.2.

PNG_suffix is a portmanteau morpheme that encodes the person, number and


gender all in one. Finite verbs take the form, Verb_stem + Tense +
person_number_gender Tamil recognizes four kinds of non-finite verbs: infinitive,
verbal participle, adjectival participle and conditional. They take the following
Morphotactics.

Verb_stem + infinitive_suffix (infinitive)

Verb_stem + vp_suffix (verbal participle)

Verb_stem + tense + rp_suffix (adjectival participle)

Verb_stem + conditional_suffix (conditional)

171
Modal verbs can be defective in that, they cannot take any more inflectional
suffixes, or they can be regular verbs that can get inflected for tense and PNG suffixes.

Verb_stem + infinitive_suffix modal_verb


Verb_stem + infinitive_suffix modal_stem + tense + png_suffix

Table 6.2 Simple Verb Finite Forms

PNG Root-Pres-PNG Root-Past-PNG Root-Fut-PNG


3SE padi-kinR-Ar padi-thth-Ar padi-pp-Ar
3SM padi-kinR-An padi-thth-An padi-pp-An
3SF padi-kinR-AL padi-thth-AL padi-pp-AL
2S padi-kinR-Ay padi-thth-Ay padi-pp-Ay
1P padi-kinR-Om padi-thth-Om padi-pp-Om
1S padi-kinR-En padi-thth-En padi-pp-En
2SE padi-kinR-Ir padi-thth-Ir padi-pp-Ir
3SN padi-kinR-athu padi-thth-athu padi-pp-athu
2PE padi-kinR-IrkaL padi-thth-IrkaL padi-pp-IrkaL
3PE padi-kinR-ArkaL padi-thth-ArkaL padi-pp-ArkaL
3PN padi-kinR-ana padi-thth-ana padi-pp-ana

6.2.5 Tamil Noun Morphology

Tamil nouns (and pronouns) are classified into two super-classes “rational” and the
"irrational" which include a total of five classes. Humans and deities are classified as
"rational", and all other nouns (animals, objects, abstract nouns) are classified as
irrational. The "rational" nouns and pronouns belong to one of three classes masculine
singular, feminine singular, and rational plural. The "irrational" nouns and pronouns
belong to one of two classes - irrational singular and irrational plural. The plural form
for rational nouns may be used as an honorific, gender-neutral, singular form [132].

Suffixes are used to perform the functions of cases or postpositions. Traditional


grammarians tried to group the various suffixes into eight cases corresponding to the
cases used in Sanskrit. These were the nominative, accusative, and dative, sociative,

172
genitive, instrumental, locative, and ablative. The various noun forms are given in the
Table 6.3. The table represents the singular and plural forms of the word “எ ” eli (rat)
with the case markers.

Table 6.3 Noun Case Markers

Case Singular Plural


Nominative eli eli-kaL
Accusative eli-ai eli-kaL-ai
Dative eli-uku eli-kaL-uku
Benfefactive eli-ukk-Aka eli-kaL-ukk-Aka
Instrumental eli-Al eli-kaL-Al
Sociative-Odu eli-Odu eli-kaL-Odu
Sociative-udan eli-udan eli-kaL-udan
Locative eli-il eli-kaL-il
Ablative eli-il-iruwthu eli-kaL-il-iruwthu
Genitive eli-in-athu eli-kaL-in-athu

Noun form without any inflections are called noun stem. Nouns in their stem
forms are singular.

aaciriyarkaL = aaciriyar (teacher) + kaL (pl.marker)


ஆசிாியர்கள் = ஆசிாியர் + கள்

peenaakkaL = peenaa (pen) + kaL (pl.marker)

ேபனாக்கள்= ேபனா+கள்

puththakangkaL = puththakam (book) + kaL (pl.marker)

த்தகங்கள்= த்தகம் + கள்

The examples shown above are a few instances of plural inflection. Creating a
plural form of a noun isn’t simply about concatenating ‘kaL’. Similarly, in
“puththakangkaL”, the stem (puththakam) is inflected to puththakangkaL (‘am’ in the
stem is replaced by ‘ang’, followed by ‘kaL’). These differences are due to the ‘Sandhi’

173
changes that take place when the noun stem is concatenated to the ‘kaL’ morpheme.
Tamil uses case suffixes and post positions for case marking instead of prepositions.
Case markers indicate the relationship between the noun phrases and the verb phrase. It
indicates the semantic role of the noun phrases in the sentence. Genitive case, tells the
relationship between noun phrases. This is expressed by ‘in’ morpheme. Case suffixes
are concatenated to the nouns in their stem form or after the plural morpheme if it’s a
plural noun.

noun_stem + {kaL} + case_suffix

e.g. kaththiyAl (with a knife) = kaththi (knife) + Al (with)


கத்தியால் = கத்தி+ஆல்

Post positions are of two kinds: bound and free. In case of bound post positions,
they occur with their respective governing case suffixes. In such a case, the
Morphotactics would be,

noun_stem + {kaL} + case_suffix + bound_post_position

e.g. vIddiliruwthu (from the house) = vIdu (house) + il + iruwthu (from)


ட் ந் = + இல் +இ ந்

Sometimes the post positions follow a blank space after the case suffix as
another word. Free post positions follow noun stems without any case suffixes.
However they are written as another word and do not concatenate with the noun.
Basically, there are eight cases in Tamil. Verbs can take the form of nouns when
followed by nominal suffixes. Nominalized verb forms are an example of derivational
Morphology in Tamil. They occur in the following format.

Verb_stem + tense + nominal_suffix

e.g. ceythavar (one who did ) = cey (do) + th (past) + avar

(3rd person singular honorific)


ெசய்தவர் =ெசய் + த்த் + அவர்

174
6.2.6 Tamil Morphological Analyzer

Tamil language is morphologically rich and agglutinative. Such a morphologically rich


language needs deep analysis at the word level to capture the meaning of the word from
its morphemes and its categories. Each root is affixed with several morphemes to
generate a word. In general, Tamil language is postpositionally inflected to the root
word. Each root word can take more than ten thousand inflected word forms. Tamil
language takes both lexical and inflectional morphology. Lexical morphology changes
the meaning of the word and its class by adding the derivational and compounding
morphemes to the root. Inflectional morphology changes the form of the word and adds
additional information to the word by adding the inflectional morphemes to the root.

6.2.7 Challenges in Tamil morphological Analyzer

The morphological structure of Tamil is quite complex since it inflects to person,


gender, and number markings and also combines with auxiliaries that indicate aspect,
mood, causation, attitude etc in verb. A single verb root can inflect more than ten
thousand word forms including auxiliaries. Noun root inflects with plural, oblique,
case, postpositions and clitics. A single noun root can inflect more than five hundred
word forms including postpositions. The root and morphemes have to be identified and
tagged for further language processing at word level.

The structure of verbal complex is unique and capturing this complexity in a


machine analyzable and generatable format is a challenging job. The formation of the
verbal complex involves arrangement of the verbal units and the interpretation of their
combinatory meaning. Phonology also plays its part in the formation of verbal complex
in terms of morphophonemic or ‘Sandhi’ rules which account for the shape changes
due to inflection.

Understanding of verbal complexity involves identifying the structure of simple


finite verbs and compound verbs. By understanding the nature of the verbal complexity,
it is possible to evolve a methodology to recognize the verbal complexity. In order to
analyze the verbal forms in which the inflection vary from one set of verbs to another, a
classification of Tamil verbs based on tense markers is evolved. The inflection includes
finite, infinite, adjectival, adverbial and conditional forms of verbs.

175
Compared to verb morphological analysis, noun morphological analysis is
relatively easy. Noun can occur separately or with plural, oblique, case, postpositions
and clitics suffixes. A corpus was developed with all morphological feature
information. So the machine by itself captures all morphological rules, including
‘Sandhi’ and morphotactic rule.

6.3 TAMIL MORPHOLOGICAL ANALYZER SYSTEM

Morphological analyzer is the second stage of pre-processing Tamil language


sentences. In first stage, Tamil sentences are tagged by Tamil POS Tagger tool. The
system developed for Tamil morphological analyzer consists of five modules (Figure
6.3).

POS
Tagged
Sentence

Minimized POS Tagger

Noun/Verb Pronoun Proper Noun Other word


Analyzer Analyzer Analyzer Class Analyzers

Morphologically Annotated
Sentence

Figure 6.3 General Framework for Morphological Analyzer System

The five modules are,


1. Minimized POS Tagger

2. Noun/Verb Analyzer

3. Pronoun Analyzer

176
4. Proper Noun Analyzer

5. Other Word Class Analyzers

The input to the morphological system is a POS tagged sentence. In the first
module, POS Tagged sentence is refined according to the simplified POS tagset given
in Table 6.4. The refined POS tagged sentence is split according to the simplified POS
tags. The second module morphologically analyzes the Noun (<N>) and Verb (<V>)
forms. The third and fourth modules morphologically analyze Pronoun (P) and Proper
nouns (PN). Other word classes are analyzed in the fifth stage. This module considers
the POS tag as morphological features.

Table 6.4 Minimized POS Tagset

6.4 TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND


VERBS

6.4.1 Morphological Analyzer using Machine Learning

The morphological analyzer identifies root and suffixes of a word. Generally, rule
based approaches are used for morphological analysis which are based on a set of rules
and dictionary that contains root words and morphemes. In rule based approach, a
particular word is given as an input to the morphological analyzer and if the
corresponding morphemes or root word is missing in the dictionary, then the rule based
system fails. Here, each rule depends on the previous rule. So if one rule fails, it affects
the entire set of rules which follows.

177
Recently, machine learning approaches are found to be dominating the Natural
Language Processing field. Machine learning is a branch of Artificial Intelligence (AI)
concerned with the design of algorithms that learn from the examples. Machine
learning algorithms can be supervised or unsupervised. The input and corresponding
output data are used in supervised learning. In unsupervised learning, only input
samples are used. The goal of machine learning approach is to use the given examples
and find out generalization and classification rules automatically. All the rules
including complex spelling rules can be handled by this method. Morphological
Analyzer based on machine learning approaches does not require any hand coded
morphological rules. It only needs morphologically segmented corpora. H.Poon et.al
(2009) [189] reported the first log-linear model for unsupervised morphological
segmentation. For Arabic and Hebrew language, it outperforms the state-of-the-art
systems by a large margin. The sequence labeling is a significant generalization of the
supervised classification problem. One can assign a single label to each input element
in a sequence. The elements to be assigned are typically like parts of speech or
syntactic chunk labels [171]. Many tasks are formalized as sequence labeling problems
in various fields such as natural language processing and bioinformatics. There are two
types in sequence labeling approaches [171].

• Raw labeling.

• Joint segmentation and labeling.

In raw labeling, each element gets a single tag whereas in joint segmentation
and labeling, whole segments get a single label. In a morphological analyzer, sequence
is usually a word and, a character, is an element. As mentioned earlier, in
morphological analyzer, input is a word and output is root and inflections. Input word is
denoted as ‘W’, and, root word and inflections are denoted by ‘R’ and ‘I’ respectively.

[W]Noun/Verb = [R] Noun/Verb + [I] Noun/Verb

In turn, notation ‘I’ can be expressed as i1+ i2+…. + in where ‘n’ refers to the
number of inflections or morphemes. Further ‘W’ is converted into a set of characters.
Morphological analyzer accepts a sequence of characters as input and generates a
sequence of characters as output. Let X be the finite set of input characters and Y be the
finite set of output characters. If the input string is ‘x’, it is segmented as x1x2....xn
178
where each xn є X. Similarly, if y is an output string, it is segmented as y1y2...yn and yn
є Y where ‘n’ is the number of segments.

Inputs: x = (x1, x2, x3…, xn)

Labels: y = (y1, y2, y3…, yn)

The main objective of sequence labeling approach is predicting y from the given
‘x’. In training data, the input sequence ‘x’ is mapped with output sequence ‘y’. Now
the morphological analyzer problem is transformed into a sequence labeling problem.
The information about the training data is explained in the following sub sections.
Finally the morphological analysis is redefined as a classification task which is solved
by using sequence labeling methodology.

6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer

Data formulation plays the key role in supervised machine learning approaches. The
first step involved in the corpora development for morphological analyzer is classifying
paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based
respectively on tense markers and case markers. Each paradigm will inflect with the
same set of inflections. The second step is to collect the list of root words for all
paradigms.

6.4.2.1 Paradigm Classification

Paradigm provides information about all the possible word forms of a root word in a
particular word class. Tamil noun and verb paradigm classification is done based on its
case and tense markers respectively. Number of paradigms for each word class
(noun/verb) is defined. For the sake of computational data modeling, Tamil verbs were
classified into 32 paradigms [13]. Nouns are classified into 25 paradigms to resolve the
challenges in noun morphological analysis. Based on the paradigm, the root words are
grouped into its paradigm. Table 6.5 shows the number of paradigms and inflections of
verb and noun which are handled in the system. Total represents the total number of
inflections that are handled in this analyzer system. Noun and verb paradigm list is
shown in Tables 6.6 and 6.7.

179
Table 6.5 Number of Paradigms and Inflections

Word forms
Paradigms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320

Table 6.6 Noun Paradigms

6.4.2.2 Word-forms

Noun word forms

The Morphological System for noun handles more than three hundred word
forms including postpositions. Traditional grammarians group the various suffixes into
8 cases corresponding to the cases used in Sanskrit. These were the nominative,
accusative, and dative, sociative, genitive, instrumental, locative, and ablative. The
sample word forms which are used in this thesis are shown in Table 6.8. Remaining
word forms are included in Appendix B.

180
Table 6.7 Verb Paradigms

Table 6.8 Noun Word Forms

நரம்பிைன (narampinai) நரம்ப (narampathu)


நரம்ைப (narampai) நரம்பின (narampinathu)
நரம்பினத்ைத (narampinathai) நரம்பின்கண் (narampnkaN)
நரம்ேபா (narampOdu) நரம்ப கண் (narampathukaN)
நரம்பிேனா (narampinOdu) நரம் க்காக (narampukkAka)
நரம்பினத்ேதா நரம்பாலான (narampAlAna)
(narampinaththOdu) நரம் ைடய (narampudaiya)
நரம்பினால் (narampinAl) நரம்பி ைடய
நரம்பால் (narampAl) (narampinudaiya)
நரம் க்கு (narampukku) நரம்பில் (narampil)
நரம்பிற்கு (narampiRku) நரம்பினில் (narampinil)
நரம்பின் (narampin) நரம் டன் (narampudan)

181
Verb word forms

Verbs also morphologically deficient i.e. some verbs do not take all the suffixes
meant for verbs. Verb is an obligatory part of a sentence except copula sentences.
Verbs can be classified into different types based on morphological, syntactic and
semantic characteristics. Based on the tense suffixes, verbs can be classified into weak
verb, strong verbs and medium verbs. Based on the form and function, verbs can be
classified into finite verb (ex. va-ndt-aan 'come_PAST_he') and non-finite verb (ex. va-
ndt-a 'come_PAST_RP' and va-ndt-u 'come_PAST_VPAR'). Depending the non-finite
whether non-finite form occur before noun or verb, they can be classified as adjectival
or relative participle form (ex. vandta paiyan 'the boy who came') and adverbial or
verbal participle form (ex. vandtu poonaan 'having come he went'). The Morphological
system for verb handles more than ten thousand word forms including auxiliaries and
clitics. The sample verb forms which are used in this research are shown in Table 6.9.
Remaining word forms are given in Appendix B.

Table 6.9 Verb Word Forms

ப (padi) ப க்கின்ேறன் (padikkinREn)


ப த்தான் (padiththAn) ப க்கின்ேறாம் (padikkinROm)
ப த்தாள் (padiththAL) ப ப்பான் (padippAn)
ப த்தார் (padiththAr) ப ப்பாள் (padippAL)
ப த்தார்கள் (padiththArkaL) ப ப்பார் (padippAr)
ப த்த (padiththathu) ப ப்பார்கள் (padippArkaL)
ப த்தன (padiththana) ப ப்ப (padippathu)
ப த்தாய் (padiththAy) ப ப்பன (padippana)
ப த்தீர் (padiththIr) ப ப்பாய் (padippAy)
ப த்தீர்கள் (padiththIrkaL) ப ப்பீர் (padippIr)
ப த்ேதன் (padiththEn) ப ப்பீர்கள் (padippIrkaL)
ப த்ேதாம் (padiththOm) ப ப்ேபன் (padippEn)
ப க்கிறான் (padikkiRAn) ப ப்ேபாம் (padippOm)
ப க்கிறாள் (padikkiRAL) ப க்கும் (padikkum)
ப க்கிறார் (padikkiRAr) ப த்த (padiththa)
ப க்கிறார்கள் (padikkiRArkaL) ப க்கின்ற (padikkinRa)
ப க்கின்ற (padikkinRathu) ப க்காத (padikkAtha)
ப க்கின்றன (padikkinRana) ப த்தவன் (padiththavan)
ப க்கின்றாய் (padikkinRAy) ப த்தவள் (padiththavaL)
ப க்கின்றீர் (padikkinRIr) ப த்தவர் (padiththavar)
ப க்கின்றீர்கள் (padikkinRIrkaL) ப த்த (padiththathu)

182
6.4.2.3 Morphemes

Noun morphemes

The Morphological analyzer system for noun handles 92 morphemes in the Morpho-
lexical Tagging (Phase II). The morphemes which are used in this thesis are given in
Table 6.10.
Table 6.10 Noun Morphemes

ஐ ேமேல ெவளிேய
ஆக ேமல் ைவத்
ஆன ற்கு அ யில்
க் அண்ைட அப்பால்
ச் அ ேக அ கில்
த் உக்கு இைடயில்
ப் உள்ேள இ ந்
அ எதிேர எதிாில்
ஆல் ஒட் கிழக்கு
இன் கிட்ட ந வில்
இல் க்காக வைரயில்
உள் பதில் அப் றம்
ஓ பற்றி அல்லாமல்
கண் பிறகு இல்லாமல்
கள் தல் குறித்
ப லம் கு க்ேக
ேபால ஆட்டம் பார்த்
வைர ெகாண் பின்னால்
விட சுற்றி ன்னால்
அதன் தாண் ெவளியில்
இடம் ெதற்கு எதிர்க்கு
இனம் ேநாக்கி எதிர்க்ேக
உடன் பக்கம் தவிர்த்
உைடய பதிலாக வைரக்கும்
ஒழிய பிந்தி அ த்தாற்
கீேழ பின்ேன ேபால்
கீழ் பின் அ கி ந்
க்கு மாதிாி எதிர்த்தார்
தவிர ேமற்கு ேபால்
பின் வடக்கு எதிர்த்தாற்
ேபால் வழியாக ேபால்
ன் விட்

183
Verb morphemes

The Morphological system for verb handles 170 morphemes in the Morpho-
lexical tagging (Phase II). The morphemes which are used in this analyzer are shown in
Table 4.11.

Table 6.11 Verb Morphemes

அ ய
ஆ வா யல்
இ ேவ ரல்
உ ைவ லல்
ஏ வ் ளல்
ஓ அ வன்
க அல் வர்
ட ஆத் வள்
ண ஆன் ஆமல்
ன ஆைம இயல்
ய ஆம் காத்
ர ஆய் காைம
ற ஆர் கிற்
ல ஆல் கிழி
ள ஆள் கும்
ைக இன் கூ
க் இ ெகா
சா ஈர் ெகாள்
ச் உம் சாகு
உள் ெசய்
ட் ஏன் ணாத்
ஓன் ணாைம
ஓம் ம்
த் கல் தான்
கள் திாி
ன் கிட தீர்
ேபா க்க ெதாைல
ப் டல் த்
யி ணல் த்த்
தல் ந்
ற் னர் ந்த்
னல் னாத்
ப னாைம

184
Ambigious Morphemes of Noun and Verb

A morpheme may have one or more morpho-syntactic categories. This leads to


the ambiguity in morphemes. The ambiguous morphemes of noun and verb are shown
in Table 6.12.

Table 6.12 Verb/Noun Ambiguous Morphemes


Morphemes
Ambiguous Tags
அ <INF> <RELATIVE_PARTICIPLE>
ஆ <CLITIC> <NEG_MARKER>
ஆள் <3SF> <VERB_ROOT>
இ <PAST_TENSE> <VERBAL_PARTICIPLE>
இயல் <VERB_AUX> <VERB_ROOT>
இ <VERB_AUX> <VERB_ROOT>
உள் <VERB_AUX> <VERB_ROOT>
கட் <VERB_AUX> <VERB_ROOT>
கல் <NOM_kkal> <VERB_ROOT>
காட் <VERB_AUX> <VERB_ROOT>
கிட <VERB_AUX> <VERB_ROOT>
கிழி <VERB_AUX> <VERB_ROOT>
கூ <VERB_AUX> <VERB_ROOT>
ெகா <VERB_AUX> <VERB_ROOT>
ெகாண் <VERB_AUX> <VERB_ROOT>
ெகாள் <VERB_AUX> <VERB_ROOT>
ெசய் <VERB_AUX> <VERB_ROOT>
தள் <VERB_AUX> <VERB_ROOT>
திாி <VERB_AUX> <VERB_ROOT>
ெதாைல <VERB_AUX> <VERB_ROOT>
த் <PAST_TENSE> <SAN>
ப <VERB_AUX> <VERB_ROOT>
பண் <VERB_AUX> <VERB_ROOT>
பார் <VERB_AUX> <VERB_ROOT>
ெப <VERB_AUX> <VERB_ROOT>
ேபா <VERB_AUX> <VERB_ROOT>
ப் <FUT_TENSE> <SAN>
மாட் <VERB_AUX> <VERB_ROOT>
<VERB_AUX> <VERB_ROOT>
ய <INF> <RELATIVE_PARTICIPLE>
ள <INF> <RELATIVE_PARTICIPLE>
வா <VERB_AUX> <VERB_ROOT>
வி <VERB_AUX> <VERB_ROOT>
பக்கம் <PPO> <Noun_ROOT>
ஆக <Benefactive> <ADV_Suffix>
த் <Sandhi> <Oblique>
ேவண் <VERB_AUX> <VERB_ROOT>
ன <3PN> <INF> <RELATIVE_PARTICIPLE>

185
6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer

The data creation for the first phase of Noun/Verb Morphological analyzer system is
done by the following stages.

• Preprocessing

• Mapping

• Bootstrapping

Preprocessing

Preprocessing is an important step in data creation. It is involved in training


stage as well as decoding stage. Figure 6.4 shows the preprocessing steps involved in
the development of corpora. Morphological corpus which is used for machine learning
is developed by the following steps. Romanization, Segmentation and Alignment

Romanization

The input word forms are converted to Romanized forms using the Unicode to
Roman mapping. Romanization is done for easy computational processing. In Tamil,
syllable (Compound characters) exists as a single character, where one cannot separate
vowel and consonant. So, for this separation, Tamil graphemes are converted into
Roman forms. Tamil roman mapping is given in Appendix A.

Segmentation

After Romanization, each and every word in the corpora is segmented based on
the Tamil grapheme and additionally, each syllable in the corresponding word is further
segmented into consonants and vowels. To the segmented syllable, postfix “–C” and “–
V” to the consonant and vowel respectively. It is named as C-V representation i.e.
Consonant–Vowel representation. The C-V representation is given only for input data.
In the output data, morpheme boundaries are indicated by “*” symbol.

Alignment

The segmented words are aligned vertically as segments using the gap between
them.

186
Figure 6.4 Preprocessing Steps
Mapping and Bootstrapping

The aligned input segments are consequently mapped with output segments in
the mapping stage. Bootstrapping is done to increase the training data size. Sample data

format for the word ப த்தான் ‘padiththAn’ is given in Table 6.13. First column

represents the input data and the second one represents output data. “*” indicates the
morpheme boundaries.

Table 6.13 Sample Data Format

I/P O/P
p-C p
a-V a
d-C D
i-V i*
th Th
th-C th*
A-V A
n n*

187
6.4.2.5 Issues in Data Creation

Mapping mismatch segments

Mismatching is the main problem which occurs in mapping the input characters
with output characters. Mismatching occurs in two cases, i.e., either the input units are
larger or smaller than those of the output units. The mismatching problem is solved by
inserting a null symbol “$” or combining two units based on the morpho-phonemic
rules and further the input segments are mapped with output segments. After mapping,
machine learning tool is used for training the data.

In case 1, the input sequence is having more number of segments (14) than the
segments (13) in the output sequence. Tamil verb, “padikkayiyalum” is having 14
segments in input sequence but in output, only 13 segments are present. The first
occurrence of “y”(8th Segment) in the input sequence becomes null due to the morpho-
phonemic rule. So there is no segment to map the “y” segment in input sequence. For
this reason, in training, the input segment “y” is mapped with “$” symbol (“$” indicates
null) in output sequence. Now the input and the output segments are matched equally.

Case 1:
Input Sequence:
P-C | a-V | d-C | i-V | k | k-C | a-V | y-C | i-V | y-C |a-V | l-C | u-V | m
(14 segments)
Mismatched Output Sequence:
p | a | d | i* | k | k | a* | i | y | a | l* | u | m*
(13 segments)
Corrected Output Sequence:
p | a | d | i* | k | k | a* | $ | i | y | a | l* | u | m* (14 segments)

In case 2, the input sequence is having less number of segments than the output
sequence. Tamil verb OdinAn is having 6 segments in input sequence but output has 7
segments. Using morpho-phonemic rule, the segment “d-C”(2nd Segment) in the input
sequence is mapped to two segments “d” &”’u*”(2nd and 3rd Segments) in the output
sequence. For this reason, in training, “d-C” is mapped with “du*”. Now the input and
the output segments are equalized and thus the problem of sequence mismatching is
solved.

188
Case 2:
Input Sequence:
O | d-C | i-V | n-C | A-V | n (6 segments)

Mismatched Output Sequence:


O | d | u* | i | n* | A | n (7 segments)

Corrected Output Sequence:


O | du* | i | n* | A | n (6 segments)

6.4.3 Morphological Tagging Framework using SVMTool

6.4.3.1 Support Vector Machine (SVM)

Support Vector Machine (SVM) approaches have been around since the mid 1990s,
initially as a binary classification technique, with later extensions to regression and
multi-class classification. Here, Morphological analyzer problem is converted into a
classification problem. These classifications can be done through supervised machine
learning algorithms [12]. Support Vector Machine is a machine learning algorithm for
binary classification, which has been successfully applied to a number of practical

problems, including NLP [12]. Let {( x1 , y1 ),......,( xN , yN )} be the set of N training

examples, where each instance xi is a vector in R and yi ∈ {−1, +1} is the class label.
N

SVM is attractive because it has an extremely well developed statistical learning


theory. SVM is based on strong mathematical foundations and results in simple, yet
very powerful, algorithms. SVMs are learning systems that use a hypothesis space of
linear functions in a high dimensional feature space, trained with a learning algorithm
from optimization theory that implements a learning bias derived from statistical
learning theory.

6.4.3.2 SVMTool

The SVMTool is an open source generator of sequential taggers based on Support


Vector Machine [12]. Originally, SVMTool was developed for POS tagging, but, here
this tool is used in morphological analysis. The SVMTool software package consists of
three main components, namely the model learner (SVMTlearn), the tagger
189
(SVMTagger) and the evaluator (SVMTeval). SVM models (weight vectors and biases)
are learned from a training corpus using the SVMTlearn.

Different models are learned for the different strategies. Given a training set of
annotated examples, it is responsible for the training of a set of SVM classifiers. So as
to do that, it makes use of SVM–light an implementation of Vapnik’s SVMs in C,
developed by Thorsten Joachims (2002). Given a text corpus (one token per line) and
the path to a previously learned SVM model (including the automatically generated
dictionary), it performs tagging of a sequence of characters. Finally, given a correctly
annotated corpus, and the corresponding SVMTool predicted annotation, the
SVMTeval component displays tagging results. SVMTeval evaluates the performance
in terms of accuracy.

6.4.3.3 Implementation of Morphological Analyzer System

Exiting Tamil morphological analyzers are explained in Chapter 2. Jan Hajic et.al
(1998) [190] developed morphological tagging for inflectional languages using
exponential probabilistic model based on automatically selected features. The
parameters of the model are computed using simple estimates.
Using the machine learning approach, the morphological analyzer for Tamil is
developed. Separate engines are developed for noun and verb. Noun morphological
analyzer can handle inflected noun forms and postpositionally inflected nouns. The
verb analyzer handles all the verb forms like finite, infinite and auxiliaries.
Morphological analyzer is redefined as a classification task. Classification problem is
solved by using the SVM. In this machine learning approach, two training models are
developed for morphological analysis. These two models are grouped as Model-I
(segmentation model) and Model-II (morpho-syntactic tagging model). First model
(Model-I) is trained using the sequence of input characters and their corresponding
output labels. This trained Model-I is used for predicting the morpheme boundaries.
Second model (Model-II) is trained using sequence of morphemes and their
grammatical categories. This trained Model-II is used for assigning grammatical classes
to each morpheme. Figure 6.5 illustrates the three phases involved in the process of
morphological analyzer.

• Pre-processing.

190
• Morpheme Segmentation.

• Morpho syntactic tagging.

Preprocessing

The word that has to be morphologically analyzed is given as the input to the
pre-processing phase. The word primarily undergoes Romanization process. The
romanized word is segmented based on Tamil graphemes. Tamil grapheme consists of
vowel, consonant and syllable. The syllables are broken into vowel and consonant. To
these consonant and vowel, –C and –V are suffixed.

Segmentation of morpheme

In segmentation of morpheme process, words are segmented into morpheme


according to their morpheme boundary. The input sequence is given to the trained
Model-I. The trained model predicts each label to the input segments. This output
sequence is aligned as a morpheme segments using alignment program.

Trained
Model-I

Input
ப த்தான் Preprocessing
Morph Morpheme
Analyzer Alignment
Postprocessing
Output:
ப <ROOT>
த்த் <PAST >
ஆன் <3SM>
Trained
Model-II

Figure 6.5 Implementation of Noun/Verb Morph Analyzer

Morpho-syntactic tagging

The segmented morpheme sequence is given to the trained Model-II. It predicts


grammatical categories to the each segment (morphemes) in the sequence.

191
6.5 MORPHOLOGICAL ANALYZER FOR PRONOUN USING
PATTERN MATCHING
The morphological analyzer for Tamil pronoun is developed by using pattern matching
approach. Personal pronouns are playing an important role in Tamil language therefore
they need very special attention while generating as well as analyzing. Figure 6.6
shows the implementation of pronoun morphological analyzer. Morphological
processing of Tamil pronoun word form is handled independently by using pronoun
morphological analysis system. Morphological analysis of pronoun is based on the
pattern matching and pronoun root word. Structure of the pronoun word form is used
for creating a pattern file. Pronoun word structure is divided into four stages. They are,

i. PRN – ROOT
ii. CASE – CLITIC
iii. PPO – CL
iv. CLITIC

These stages are explained in Figure 6.6.

PP

Case Clitic

Clitic
Pronoun

Clitic

Figure 6.6 Structure of Pronoun Word-form

192
Example for the Structure of Pronoun

அவ க்க கில்

அவ க்கு அவ க்க கிலா

அவன் அவ க்கா

அவனா

Pronoun word class is a closed class word, so it is easy to collect all root words of
pronoun. In pronoun morphological system the word form is treated from left to right.
Generally in morphological analysis systems handles the word from right to left but
here limited vocabulary of pronoun makes to formulate a system from left to right. The
pronoun word form is Romanized using Unicode to roman mapping this Romanized
word is first compared with pronoun stem file. The pronoun stem file is consists of all
the stems and roots of pronoun words. If the roman form is matched with any entry in
the pronoun stem file then the matched part of the roman form is replaced with the
value of the corresponding entry. After this process the remaining part is compared
with three different suffix files. In this comparison, the matched part is replaced with
corresponding value of the suffix element. Finally the root word is converted into
Unicode form. Figure 6.7 shows the implementation of Pronoun Morph Analyzer

Pronoun Pattern Pronoun + MLI


Dataset

Figure 6.7 Implementation of Pronoun Morph Analyzer

Steps

Step1: Check whether the input word is present in the dictionary.


Step2: If present go to Step3.Else go to Step 4.
Step3: Retrieve the Root and Morpho lexical information (MLI).
Step4: Assign, the input word as root word and null to the MLI.
Step5: The final output is a combination of Root word and MLI

193
6.6 MORPHOLOGICAL ANALYZER FOR PROPER NOUN
USING SUFFIXES
The morphological analyzer for Proper Noun is developed by using the suffixes. Figure
6.8 shows the implementation of Proper Noun Morph analyzer. Proper noun word form
is taken as input for proper noun morphological analysis. Proper noun word form is
taken from minimized POS Tagged sentence. It is identified from a POS tag <PN>.
Initially proper noun word form is converted into roman form for easy computation.
This Roman conversion is done by using simple key pair mapping of Roman and Tamil
characters. This mapping program recognizes each Tamil character unit and replace
with corresponding roman character.

This Roman form is given to the proper noun analyzer system. System
compares the word with the suffix which is predefined. First, it identifies the suffix and
replaced with the corresponding information in the proper noun suffix data set. The
suffix data set is created using various proper noun inflection and their end characters.
For example from a table 6.14, the word “sithamparam”(சிதம்பரம்) is end with

‘m’(ம்), and the other word “pANdisEri” (பாண் ச்ேசாி) is end with ‘ri‘(ாி) , the
possible inflections of both words are given in table. Morphological changes are
differing for the proper noun based on the end characters. So end characters are used in
creating rules. From the various inflections of the word-form the suffix is identified and
the remaining part is stem. This suffix is mapped to the original morphological
information. This algorithm replaces the encountered suffix with the morphological
information in a suffix table.

Steps

Step1: Suffix of Input word is identified using Suffix Table.


Step2: Identified suffix is stripped from the word
Step3: Suffix striping also gives the stem of the word.
Step4: Based on suffix, the stem is converted into root word.
Step5: Morpho lexical information is identified for the suffix.
Step6: The final output is a combination of Root word and Morpho lexical information.

194
Input Word
Word

No
Suffix
Stem ?

Yes

Split the Suffix and Stem Suffix


DB

Stem Suffix
Convert into Morpho-Lexical
Lemma Information

Lemma + MLI
Morphological
Output

Figure 6.8 Implementation of Proper Noun Morph Analyzer


Table 6.14 Example for Proper Noun Inflections

Root சிதம்பரம் பாண் ச்ேசாி


Root+ACC சிதம்பரத்ைத பாண் ச்ேசாிைய
Root+LOC சிதம்பரத்தில் பாண் ச்ேசாியில்
Root+DAT சிதம்பரத்திற்கு பாண் ச்ேசாியிற்கு
Root+ABL சிதம்பரதி ந் பாண் ச்ேசாியி ந்
Root+UM சிதம்பர ம் பாண் ச்ேசாி ம்

6.7 RESULTS AND EVALUATION

Efficiency of the system is compared in this sub section. Various machine learning
tools are also compared using the same morphologically annotated data. The system
accuracy is estimated at various levels, which are briefly discussed below.

195
Training Data
a Vs Accuraacy

In Figuure 6.9, X ax
xis represennts training data
d and Y axis represeents accuracyy.
om the grapph, it is fou
Fro und that M
Morphologicaal Analyzer accuracy inncreases witth
inccrease in thee volume off training daata. Accuraccies are calcculated from
m 10k to3000k
traiining corpuss size.

Accuraacy
100

90
A
c 80
c
70
u
r 60
a
c 50
y
40

30
10k 25k 40k 50k 75k 100k 125k 150k 175k 200k 300k
Training Data

Figure 66.9 Trainingg Data Vs A


Accuracy

Taagged and Untagged


U Acccuracies

In the sequence
s bassed morphollogical system, output is obtained inn two differennt
stages using thhe trained models.
m First stage takes a sequence of characterr as input annd
givves untaggedd morphemees as output using the trrained Model-I. It also represents as
a
moorpheme ideentification. In the secoond stage, these morphemes are tagged usinng
traiined Model--II. Accuraccies of the uuntagged an
nd tagged m
morphemes for
f verbs annd
nou
un are shown
n in Table 6.15.
ble 6.15 Taggged Vs Untaagged Accu
Tab uracies

Accuracyy Verb Nou


un
Untagged
d(Model-I) 93.56% 94.334%
Tagged(M
Model-II) 91.73% 92.22 %

196
Word level and character level accuracies

Accuracies are compared with word level as well as character level. Two
thousand three hundred verb data and one thousand seven hundred and fifty noun data
are taken randomly from POS Tagged corpus for testing the system. Table 6.16 shows
the number of words as well as the characters in the whole testing data set as well as the
efficiency of prediction.

Table 6.16 Number of Words and Characters and level of Efficiencies

VERB NOUN
Category
Words Characters Words Characters
Testing data 2300 20627 1750 10534
Predicted correctly 2071 19089 1639 9645
Efficiency 90.4% 92.5% 91.5% 93.6%

The percentage of Word level efficiencies and Character level efficiencies are
calculated by the following formulae.

Number of words split correctly


Word level efficiency
Total number of words in Testing set

Number of characters tagged correctly


Character level efficiency
Total number of characters in Testing set

Sentence level accuracies

The POS tagged sentences are given to the Morphological analyzer tool.
Therefore, the accuracy of POS tagging affects the performance of the analyzer. Here,
1200 POS tagged sentences consisting of 8358 words were taken for testing the
Morphological system. Table 6.17 shows the Sentence level accuracy of Morphological
analyzer system. For other categories of simplified POS tags, part-of-speech
information is considered as morphological information.

197
Table 6.17 Sentence Level Accuracies

WORD COUNT
Categories
Input Correct Output Percentage
N 2821 2642 93.65
V 2794 2543 91.00
P (Pronoun) 562 543 96.61
PN 279 258 92.47
O (Others) 1902 1817 95.53
Overall Accuracy 93.86

6.8 PREPROCESSED ENGLISH AND TAMIL SENTENCE


English language sentences are preprocessed using existing parser and developed rules
(Chapter-4). For Tamil, POS tagger (Chapter-5) and Morphological analyzers (Chapter-
6) are used to preprocess the sentences. Preprocessing in Tamil sentences is same as the
factorization of Tamil sentences. Table 6.18 shows the example of English and Tamil
preprocessed sentence.

Table 6.18 Preprocessed English and Tamil Sentence


Preprocessed English Sentence Preprocessed Tamil Sentence
I | i | PN | prn_i நான் | நான் |P| null
my | my | PN | PRP$ என் ைடய|என்|P| poss
home | home | N |NN_to ட் ற்கு| |N| DAT
vegetables | vegetable | N | NNS காய்கறிகள்|காய்கறி|N|PL
bought | buy | V | VBD_1S. வாங்கிேனன்|வாங்கு|V|PAST_1S

6.9 SUMMARY

This chapter explains the development of Morphological analyzer for Tamil language
using Machine learning approach. Capturing the agglutinative structure of Tamil words
by an automatic system is a challenging job. Generally, rule based approaches are used
for building morphological analyzer. Tamil morphological analyzer for noun and verb
is developed using the new and state of the art machine learning approach.
Morphological analyzer problem is redefined as a classification problem. This approach
is based on sequence labeling and training by kernel methods that captures the non

198
linear relationships of the morphological features from training data samples in a better
and simpler way. SVM based tool is used for training the system with the size of 6 lakh
morphologically tagged verbs and nouns. The same methodology is implemented for
other Dravidian languages like Malayalam, Telugu, and Kannada. Tamil Pronouns and
Proper nouns are handled using separate analyzer system. Other word classes need not
to be further analyzed for morphological features. So, POS tag information is
considered as the morphological information.

199
CHAPTER 7
FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL
This chapter describes the outline about Statistical Machine Translation system and its
components. This section also explains the integration of linguistic knowledge in
factored translation models and the development of factored corpora.
7.1 STATISTICAL MACHINE TRANSLATION

An outline of the noisy channel model is used in speech recognition and machine
translation. If one needs to translate a sentence, f in the source language F to a sentence,
e in the target language E, the noisy channel model describes the situation as: the
sentence f to be translated was initially conceived in language E as some sentence e.
During the process the sentence e was corrupted by the channel to the sentence f. Now
an assumption is made that each sentence in E is a translation of the sentence f with

some probability, and the sentence which choose as the translation ( e ) is the one that
has the highest probability which is given in equation (7.1). In mathematical terms
[172],

e = arg max P(e | f )


e
(7.1)
According to Bayes theorem,

P (e) P ( f | e)
P (e | f ) = (7.2)
P( f )

Therefore, the equation (7.1) becomes

P (e) P ( f | e)
e = arg max (7.3)
e P( f )

Since f is always fixed so P ( f ) is omitted in the maximization step in above equation


(7.3).

e = arg max P (e) P ( f | e) (7.4)


e

200

 
Here, P ( e | f ) is split
s into P (e) and P ( f | e ) accordinng to Bayess rule. This is
don
ne because practical
p gh probabilitiies to P ( f | e ) or P (e | f )
trannslation moddels give hig
wh ds in f are geenerally transslations of thhe words in e. For instannce, when thhe
hen the word
English sentennce, “He willl come tomoorrow” is trranslated to Tamil, bothh P (e | f ) annd
P ( f | e ) gives equal probaabilities to th
he following sentences.

அவ . (avan
( wALa
ai varuvAn))

அவ . (avan
( varuvvAn wALai))

The abo
ove problem
m can be avooided when the equationn (7.4) is ussed instead of
o
equuation (7.1).. The second
d sentence will
w be ruled
d out as the first sentennce which haas
muuch higher vaalue of P (e) than the seccond one an
nd the first seentence will be taken intto
connsideration during
d the traanslation proocess.

This leads to anothher perspecttive in the statistical


s maachine transslation modeel:
thee best translaation is the output senteence that is both faithfuul to the orig
ginal sentencce
andd fluent in thhe target lannguage [173]]. This is shoown in equaation (7.4), where
w P (e) is
thee probabilityy of the senttences that are
a likely inn the Englishh language, the languagge
moodel which is responsibble for the fluency of the translattion and P ( f | e ) is thhe
proobability of the
t way in which
w the seentences in E get translaated to sentennces in F, thhe
trannslation model which iss responsiblle for the faithfulness of the translaated sentencce.
Bloock diagram
m of the noisy
y channel moodel for Stattistical Machhine Translaation (SMT) is
givven in Figuree 7.1.

Figure.7.1 The
T Noisy Channel
C Moodel to Machine Transllation

7.22 COMP
PONENTS OF SMT
T

The three mainn componentts in statisticcal machine translation


t aare,

1. Translation model
201

 
2. Language model

3. The Statistical Machine Translation Decoder

7.2.1 Translation Model

Translation system is capable of producing the words that retrieves its original meaning
and arranging those words in a sequence that form fluent sentences in the target
language. The role of the translation model is to find P ( f | e ) the probability of the
source sentence f given the translated sentence e. Note that P ( f | e ) that is computed
by the translation model and not P (e | f ) . The training corpus for the translation model
is a sentence-aligned parallel corpus of the languages F and E.

It is obvious to compute P ( f | e ) from counts of the sentences f and e in the


parallel corpus. Again, the problem is that of data sparsity. The solution that is
immediately apparent is to find (or approximate) the sentence translation probability
using the translation probabilities of the words in the sentences. The word translation
probabilities in turn can be found from the parallel corpus. There is, however, a
remaining issue - the parallel corpus gives us only the sentence alignments; it does not
tell us how the words in the sentences are aligned.

A word alignment between sentences tells us exactly how each word in sentence
f is translated in e. The problem is getting the word alignment probabilities given a
training corpus that is only sentence aligned. This problem is solved by using the
Expectation-Maximization (EM) algorithm.

7.2.1.1 Expectation Maximization

The key intuition behind Expectation Maximization is that if the number of times a
word aligns with another in the corpus is known then it is easy to calculate the word
translation probabilities. Conversely, if the word translation probability is known then it
should be possible to find the probability of various alignments.

However, if one can start with some uniform word translation probabilities and
calculate alignment probabilities and then use these alignment probabilities to get better
translation probabilities. This iterative procedure, which is called the Expectation-

202

 
Maximization algorithm, works because words that are actually translations of each
other co-occur in the sentence-aligned corpus.

7.2.1.2 Word based Translation Model

As explicitly introduced by IBM formulation as a model parameter, word alignment


becomes a function from source positions j to target positions i, so that a ( j ) = i . This
definition implies that resultant alignment solutions will never contain many-to-many
links, but only many-to- one , as only one function result is possible for a given source
position j.

Although this limitation does not account for many real-life alignment
relationships, in principle IBM models can solve this by estimating the probability of
generating the source empty word, which can translate into non-empty target words.
However, many current statistical machine translation systems do not use IBM model
parameters in their training methods, but only the most probable alignment (using a
Viterbi search) given the estimated IBM models. Therefore, in order to obtain many-to-
many word alignments, usually alignments from source-to-target and target-to-source
are performed, and symmetrization strategies have to be applied.

In word-based translation model [174], translation elements are words.


Typically, the number of words in translated sentences is different due to compound
words, morphology and idioms. The ratio of the length of sequences of translated
words is called fertility, which tells how many English words, each native word
produces. Simple word-based translation is not able to translate language pairs with
fertility rates different from one. To make word-based translation systems manage, for
instance, high fertility rates, and the system could be able to map a single word to
multiple words, but not vice versa. For instance, if one is translating from English to
Tamil, each word in Tamil could produce zero or more English words. But there's no
way to group two Tamil words producing a single English word.

An example of a word-based translation system is the freely available GIZA++


package, which includes the training program for IBM models and HMM models. The
word-based translation is not widely used today comparing to phrase-based systems,
whereas, most phrase based system are still using GIZA++ to align the corpus. The

203

 
alignments are then used to extract phrase or induce syntactical rules. And the word
alignment problem is still actively discussed in the community. Because of the
importance of GIZA++, there are now several distributed implementations of GIZA++
available online.

Statistical machine translation is based on the assumption that every sentence t


in a target language is a possible translation of a given sentence e in a source language.
The main difference between two possible translations of a given sentence is a
probability assigned to each, which is to be learned from a bilingual text corpus. The
first statistical machine translation models applied these probabilities to words,
therefore considering words to be the translation units of the process.

7.2.1.3 Phrase based Translation Model

In phrase-based translation model [175], the aim is to reduce the restrictions of word-
based translation by translating whole sequences of words, where the lengths may
differ. The sequences of words are called blocks or phrases, but typically are not
linguistic phrases but phrases found using statistical methods from corpora.

The job of the translation model, given a Tamil sentence T and an English
sentence E, is to assign a probability that T generates E. While one can estimate these
probabilities by thinking about how each individual word is translated. Modern
statistical machine translation is based on the intuition that a better way to compute
these probabilities is by considering the behavior of phrases. The intuition of phrase-
based statistical machine translation is to use phrases i.e., sequences of words as well as
single words as the fundamental units of translation.

The generative story of phrase based translation has three steps. First, the source
word is grouped into phrases E1 , E2 ,… El . Second, each Ei is translated into Ti . Finally,
each phrase in the source is reordered.

The probability model for phrase based translation relies on a translation


probability and distortion probability. The factor ϕ (Ti | Ei ) is the translation probability

of generating source phrase Ti from target phrase Ei . The reordering of the source
phrase is done by distortion probability d. The distortion probability in phrase based

204

 
translation means the probability of two consecutive Tamil phrases being separated in
English by a span of English word of a particular length. The distortion is
parameterized by d (ai − bi −1 ) where ai is the start position of the source English phrase

generated by the ith Tamil phrase, and bi −1 is the end position of the source English
phrase generated by i-1th Tamil phrase. One can use a very simple distortion probability
which penalizes large distortions by giving lower and lower probability for larger
distortion. The final translation model for phrase based machine translation is based on
the equation (7.5).

P(T | E ) = ∏ ϕ (Ti | Ei )d (ai − bi − 1) (7.5)


i

Phrase based models works in a successful manner only if the source and the
target language have almost same in word order. Difference in the order of words in
phrase based models is handled by calculating distortion probabilities. Reordering is
done by the phrase based models. It has been shown that restricting the phrases to
linguistic phrases decreases the quality of translation. By the turn of the century it
became clear that in many cases specifying translation models at the level of words
turned out to be inappropriate, as much local context seemed to be lost during
translation. Novel approaches needed to describe their models according to longer
units, typically sequences of consecutive words or phrases.

The translation process takes three steps:

1. The sentence is first split into phrases - arbitrary contiguous sequences of words.

2. Each phrase is translated.

3. The translated phrases are permuted into their final order. The permutation
problem and its solutions are identical to those in word-based translation.

Consider the following particular set of phrases for our example sentences:

Position 1 2 3 4

Tamil ேநற் நான் அவைள பார்த்ேதன்

205

 
Netru naAn avaLai pArththEn

English yesterday i saw her

Since each phrase follows are not directly in order, the distortions are not all 1,
and the probability P ( E | T ) can be computed as:

P(E|T)=P(yesterday|Netru)×d(1)
×P(i|naAn)×d(1)
×P(her|avaLai)×d(2)
×P(saw|pArththaen)×d(2)

Phrase-based models produce better translations than word-based models, and


they are widely used. They successfully model many local re-orderings, and individual
passages are often fluent. However, they cannot easily model long-distance reordering
without invoking the expense of arbitrary permutation.

7.2.2 Language Model

In general, the language model is used to estimate the fluency of the translated
sentence. This plays an important role in the statistical approach as it chooses the best
fluent sentence with high value of P (e) among all possible translations generated by
the translation model P ( f | e ) . Language model can be defined as the model which
estimates and assigns a probability P (e) to the sentence, e. A high value will be
assigned for the most fluent sentence and a low value for the least fluent sentence.
Language model can be estimated from a monolingual corpus of the target language in
the translation process.

For example, consider the English sentence,

“The pen is on the table.”

If it is translated to Tamil by a system without the language model, the


following are the few possible translations as output from the system with the
translation model alone,

எ ேகால் ேமைஜயின் ேமல் உள்ள .

206

 
எ ேகால் ேமல் ேமைஜயின் உள்ள .

ேமைஜயின் ேமல் எ ேகால் உள்ள .

Even the second and third translation looks awkward to read, the probability
assigned to the translation model to each sentences will be same, as translation model
mainly concerns with producing the best output words for each word in the source
sentence, e. But when the fluency and accuracy of the translation comes into picture,
only the first translation of the given English sentence is correct. This problem can be
very well handled by the language models. This is because the probability assigned by
the language model for the first sentence will be greater when compared with the other
two sentences.

That is,

P(எ ேகாள் ேமைஜயின் ேமல் உள்ள .) > P(எ க்ேகாள் ேமல்

ேமைஜயின் உள்ள .)

P(எ ேகாள் ேமைஜயின் ேமல் உள்ள .) >P(ேமைஜயின் ேமல்

எ ேகாள் உள்ள .)

Consider a sentence, e which consists of ‘n’ number of words w1 , w2 ,… wn .The

naïve estimation of P (e) on a monolingual corpus of target language with N number of


words is done as in equation (7.6).

n
P (e) = ∏ P ( wi ) (7.6)
i =1

Where,

count ( wi )
P( wi ) = (7.7)
count ( wn )

The above equation (7.6) will assign a zero probability to the sentence e, if a
word in the sentence has not occurred in the monolingual corpus. This in turn will
207

 
affect the accuracy of the translation process. In order to overcome this problem the
probabilities that have been estimated from the corpus have to be approximated and this
can be done by using n-gram language models.

7.2.2.1 N-gram Language Models

One of the methods for language models that have been widely used for language
modelling is the n-gram language models. In general, n-gram language models are
based on statistics of how likely the words are to follow each other. For the above
example, analyse a corpus for the determining the probability of the sentence using n-

gram language model, the probability for the word ேமல் will be greater for following

the word ேமைஜயின் than the other words.

In n-gram language modelling, the process of predicting a word sequence W is


broke up into a process of predicting one word at a time. Thus the probability is
decomposed using the chain rule as in equation (7.8),

P( w1 , w2 ,…, wn ) = P( w1 ) P( w2 | w1 )… P( wn | w1 , w2 ,…, wn−1 ) (7.8)

The language model probability can be defined as the probability of word


probabilities given a history of preceeding words. In order to estimate these probability
disributions for words, It is possible to limit the history to m words as in equation (7.9),

P( wn | w1 , w2 ,…, wn−1 ) = P( wn | wn−m ,…, wn−2 , wn−1 ) (7.9)

Thus this type of chain where one can consider only a limited history is called
as Markov Chain. And the number of previous words considered is termed as the order
of the model. This is because of the Markov assumption which states that only a limited
number of previous words affect the probability of the next word. But the above
assumption can be proved wrong with counter examples that a longer history is needed.
Typically the order of the language model is based on the amount of training data
available. Limited data resource restricts to short histories i.e., small order for the
language model. Generally, trigram language models are used, whereas language
models of small order such as unigrams and bigrams as well also models of high orders

208

 
are also used. In most cases this depends mainly on the amount of data from which
language model probabilities are estimated.

The language model estimates the probability, P (e) to a sequence of words {

w1 , w2 ,… wm } in a sentence e using the n-gram approach as the product of conditional


probabilities of each word wi given the previous N-1 words. In other words, an n-gram
model [43] can be defined as the probability of the word given the previous N-1 words
instead of the probability of a word given all the previous words. Thus the probability
assigned by the language model using the N-gram approach to a sequence of words {
w1 , w2 ,… wm } in sentence e is given by equation (7.10).

m
P ( wn | w1 ,… , wm ) = ∏ P ( wi | wi −( n −1) ,… , wi −1 ) (7.10)
i =1

In case of n-gram approach, the probability of each word wi is only conditioned

by the previous N-1 words. Though N-gram approach is simple, it has been
incorporated with many applications such as speech recognition, spell checker,
translation and many other tasks where language modelling is required. In general, n-
gram model that generates the probability by taking account of the word and the
previous one word is termed as bigram model and the previous two words is the trigram
model. Language model probabilities with n-gram approach can be directly calculated
from a monolingual corpus. The equation for calculating trigram probabilities is given
by equation (7.11).

count ( wn − 2 wn −1wn )
P( wn | wn − 2 , wn −1 ) = (7.11)
∑ count (wn−2 wn−1w)
w

Here count ( wn −2 wn −1wn ) denotes the number of occurrences of the sequence wn-

2wn−1wn in the corpus. The denominator on the right hand side sums over all words w in
the corpus the number of times wn − 2 wn −1 occurs before any word. Since this is just the

count ( wn−2 wn−1 ) , the above equation (7.10) can be writing as in equation (7.12).

count ( wn − 2 wn −1wn )
P ( wn | wn − 2 , wn −1 ) = (7.12)
count ( wn − 2 wn −1 )

209

 
7.2.3 The Statistical Machine Translation Decoder

The statistical machine translation decoder performs decoding which is the process of
finding a target translated sentence for a source sentence using translation model and
language model. In general, decoding is a search problem that maximizes the
translation and language model probability. Statistical machine translation decoders use
best-first search based on heuristics. In other words, decoder is responsible for the
search of best translation in the space of possible translations. Given a translation
model and a language model, the decoder constructs the possible translations and look
for the most probable one. There are a numerous decoders for statistical machine
translation. A few of them is greedy decoders and beam search decoders. In greedy
decoders, the initial hypothesis is a word to word translation which was refined
iteratively using the hill climbing heuristics. Beam search decoders use a heuristic
search algorithm that explores a graph by expanding the most promising node in a
limited set.

7.3 INTEGRATING LINGUISTIC INFORMATION IN SMT

7.3.1 Factored Translation Models

Factored translation models [10] can be defined as an extension to phrase-based models


where every word is substituted by a vector of factors such as word, lemma, part-of-
speech information, morphology, etc. Here, the translation process has now become a
combination of pure translation and generation steps. Figure 7.2 provides a simple
block diagram to illustrate the work of translation and generation steps.

Factored translation models differ from the standard phrase based models from
the following [10]:

• The parallel corpus must be annotated with factors such as lemma, part-of-
speech, morphology, etc., before training.

• Additional language models for every factor annotated can be used in training
the system.

210

 
• Transllation steps will be siimilar to sttandard phrrase based systems. Buut
mply trainingg only on thee target side of the corpuus.
generaation steps im

• Modelss correspondding to the different


d facttors and com
mponents aree combined in
i
a log-lin
near fashion
n.

Figure 7.2 Block


B Diagrram for Facctored Tran
nslation

The cuurrent state-oof-the-art appproaches to


o statistical machine traanslation, soo-
mall text chuunks (phrases)
callled phrase-bbased modells, are limiteed to the maapping of sm
witthout any exxplicit use off linguistic innformation. Such additioonal informaation has beeen
dem
monstrated to
t be valuablle by integraating it in pree-processingg or post-processing.

Therefo
ore, a fram
mework is ddeveloped for
f statisticaal translatioon models to
t
inteegrate additional inform
mation. This framework is an extennsion of the phrase-baseed
appproach. It addds additionaal annotationn at the word
d level.

In baselline SMT, thhe word houuse is compleetely indepenndent of the word housees.
ny instance of house in
An i the trainiing data dooes not addd any know
wledge to thhe
211

 
translation of houses. In the extreme case, while the translation of house may be known
to the model, the word houses may be unknown and the system will not be able to
translate it. While this problem does not show up as strongly in English - due to the
very limited morphological production in English. But it is a significant problem for
morphologically rich languages.

Thus, it may be preferably to model translation between morphologically rich


languages on the level of lemmas, and thus pooling the evidence for different word
forms that derive from a common lemma. In such a model lemma and morphological
information should be translated separately, and combine this information on the output
side to ultimately generate the output surface words. Such a model can be defined
straight-forward as a factored translation model.

7.3.1.1 Decomposition of Factored Translation

The translation of factored representations of input words into the factored


representations of output words is broken up into a sequence of mapping steps that
either translate input factors into output factors, or generate additional output factors
from existing output factors.

In this model the translation process is broken up into the following three
mapping steps:

• Translate input lemmas into output lemmas

• Translate morphological and POS factors

• Generate surface forms given the lemma and linguistic factors

7.3.2 Syntax based Translation Models

Syntax-based translation models [176] use parse-tree representations of the sentences in


the training data to learn, among other things, tree transformation probabilities. These
methods require a parser for the target language and, in some cases, the source
language too. Yamada and Knight (2001) propose a model that transforms target
language parse trees to source language strings by applying reordering, insertion, and

212

 
translation operations at each node of the tree. In general, this model incorporates
syntax to the source and/or target languages.

Graehl et al. [177] and Melamed [178], propose methods based on tree-to-tree
mappings. Imamura et al. (2005) [179] present a similar method that achieves
significant improvements over a phrase based baseline model for Japanese-English
translation. Recently, various preprocessing approaches have been proposed for
handling syntax within Statistical machine translation. These algorithms attempt to
reconcile the word order differences between the source and target language sentences
by reordering the source language data prior to the SMT training and decoding cycles.

Approaches in syntax based models

ƒ Syntactic phrase-based based on tree transducers:

o Tree-to-string: Build mappings from target parse trees to source strings.

o String-to-tree: Build mappings from target strings to source parse trees.

o Tree-to-tree: Mappings from parse trees to parse trees.

ƒ Synchronous grammar formalism that learns grammar can simultaneously


generate both trees.

o Syntax-based: Respect linguistic units in translation.

o Hierarchical phrase-based: Respect phrases in translation.

7.4 TOOLS USED IN SMT SYSTEM

7.4.1 MOSES

Moses is an open-source toolkit for statistical machine translation. Moses is an


extended phrase-based machine translation system with factors and confusion network
decoding. Morphological, syntactic and semantic information can be integrated as
factored during training. The confusion network allows the translation of ambiguous
sentences. This enables, for instance, the strong integration of speech recognition and
machine translation. Instead of passing along the one best output of the recognizer, a
network of different word choices may be examined by the machine translation system
213

 
[180]. Moses has an efficient data structure that allows memory-intensive translation
model and language model by exploiting larger data resources with limited hardware. It
implements an efficient representation of phrase translation table using the prefix tree
structure, which allows loading only the fraction of phrase table into memory that is
needed to translate the test sentences. Moses uses the beam-search algorithm that
quickly finds the highest probability translation among the exponential number of
choices.

• Moses offers two types of translation models: phrase-based and tree-


based
• Moses features factored translation models, which enable the integration
linguistic and other information at the word level
• Moses allows the decoding of confusion networks and word lattices,
enabling easy integration with ambiguous upstream tools, such as
automatic speech recognizers or morphological analyzers

The development of Moses is mainly supported under the EuroMatrix,


EuroMatrixPlus, and LetsMT projects funded by the European Commission under
Framework Program 6 and 7.

7.4.2 GIZA++ & MKCLS

GIZA++ is a statistical machine translation toolkit that is used to train IBM models 1-5
and an HMM word alignment word alignment model. It is an extension of GIZA which
was designed as part of the SMT toolkit, in Egypt. It includes IBM models 3 and 4. It
uses the mkcls [181] tool for unsupervised classification to help the model. It also
implement the HMM alignment model and various smoothing techniques for fertility,
distortion or alignment parameters. A bilingual dictionary will be built by this tool from
the bilingual corpus. More details about GIZA++ can be found in [182].

7.4.3 SRILM

SRILM is a toolkit for language modelling that can be used in speech recognition,
statistical tagging and segmentation, and statistical machine translation. It is a freely
available collection of C++ libraries, executable programs, and supporting scripts. It

214

 
can build and manage language models. SRILM implements various smoothing
algorithm such as Good-Turing, Absolute discounting, Written-Bell and modified
Kneser-Ney. Besides the standard word based n-gram back-off models, SRILM
implements several other language model types [182], such as word-class based n-gram
models, cache-based models, disfluency and hidden event language models, HMM of
n-gram models and many more.

7.5 DEVELOPMENT OF FACTORED CORPORA

7.5.1 Parallel Corpora Collection

Corpora are the term used on Linguistics, which corresponds to a (finite) collection of
texts (in a specific language). A collection of documents in more than one language is
called multilingual corpora. A parallel corpus is a collection of texts in different
languages where one of them is the original text and the other is their translations. A
bilingual corpus is a collection of texts in two different languages where each of one is
translation of other.

Parallel corpora are very important resources for tasks in the translation field
like linguistic studies, information retrieval systems development or natural language
processing. In order to be useful, these resources must be available in reasonable
quantities, because most application methods are based on statistics. The quality of the
results depends a lot on the size of the corpora, which means robust tools are needed to
build and process them. The alignment at sentence and word levels makes parallel
corpora both more interesting and more useful.

Aligned bilingual corpora have been proved useful in many ways including
machine translation, sense disambiguation and bilingual lexicography. The availability
of parallel sentences for English-Tamil language pair is available, but not abundantly.
In European countries, parallel data for many European language pair are available
from the proceedings of the European Parliament. But in case of Tamil, no such parallel
data are readily available. Hence English sentences have to be collected and manually
translated to Tamil in order to create a bilingual corpus for English-Tamil language
pair. Even though, if parallel data are available for English-Tamil language pair, there
are chances that it might not be aligned properly and have to be separate the paragraphs
215

 
in to individual sentences (Example-News paper corpora). This will employ a lot of
human resource. This is a time extensive work and has it is the main resource for the
statistical machine translation system, more time and importance has to be provided in
developing a bilingual corpus for English-Tamil language pair. During manual
translations of English sentences to Tamil, terminology data banks for English-Tamil
language pair are found to be very useful for humans

7.5.2 Monolingual Corpora Collection


The situation for developing bilingual corpus for English-Tamil language pair is not the
same for the development of monolingual corpus for Tamil language. Tamil data is
available in the form of news in many websites of Tamil newspapers. And so it is not a
tedious job to develop a monolingual corpus for Tamil language. But some human
resource is necessary to perform some pre-processing to remove unnecessary words or
characters from the data, manually.

7.5.3 Automatic Creation of Factored Corpora


Before providing the bilingual corpus of English-Tamil language pair and monolingual
corpus of Tamil language to the statistical machine translation decoder and the
language modelling kit, SRILM, respectively for training the system in order to create
translation models and language models, both the corpus has to be and tokenized in
order to separate the words and punctuations i.e., ‘coming,’ will be separated as
‘coming’ and ‘,’ with space in between them, lowercased in order to consider all the
same words but differs in case has a single word (for example, ‘He’ and ‘he’ if not
lowercased will be considered as different entities by the statistical systems which will
be a problem whereas if lowercased this problem can be avoided) and in some cases
clean the corpus so has remove the sentences from the corpus that exceeds the limit
which is the maximum length of the parallel sentences to be considered in the corpus.
Cleaning the corpus is not necessary in case of monolingual corpus of Tamil language.

Pre-processing plays a major role in creating factored training corpora. For


English, reordering and compounding steps are used for creating the factored corpora.
In Tamil, linguistic tools such as POS Tagger and morphological analysers are used in
creation of factored corpora. Factored parallel sentences are given in the Table 7.1.

216

 
Table 7.1 Factored Parallel Sentences

Factored English Sentences Factored Tamil Sentences

I|i|PN|prp_i school|school|N|nn_to நான்| நான் |PN| null பள்ளிக்கு| பள்ளி |NN

went|go|V|vb.past_1S |DAT ெசன்ேறன்|ெசல்|V|PAST_1S .|.|.|.

I|i|PN|prp_i a|a|AR|det நான்| நான்| PN| null அவ க்கு| அவர்


book|book|N|nn_ACC
|PN|DAT ஒ | ஒ AD| null த்தகத்ைத|
him|him|PN|prp_to த்தகம்| MN|ACC ெகா த்ேதன்|ெகா |
gave|give|V|vb.past_1S
V| PAST-1S.|.|.|.

the|the|AR|det cat|cat|N|nn the|the|AR|det ைன| ைன|NN|INS எ யால்|எ |NN|

rat|rat|N|nn_ACC null ெகால்லப்பட் க்கும்|ெகால்| V| VP-


IRU-UM .|.|.|.
killing|kill|V|vb.prog_3SN_was

7.6 FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE


Factored translation is an extension of phrase-based statistical machine translation that
allows the integration of additional morphological and lexical information, such as
lemma, word class, gender, number, etc., at the word level on source and the target
languages. In SMT, three key components are used for translation modeling, language
modeling and decoding. These components are implemented using GIZA++, SRILM
and Moses toolkits.

GIZA++ is a statistical machine translation toolkit that is used to train IBM


models 1-5 and an HMM word alignment model. It is an extension of GIZA which was
designed as part of the SMT toolkit. SRILM is a toolkit for language modeling that can
be used in speech recognition, statistical tagging and segmentation, and statistical
machine translation. Moses is an open source statistical machine translation system
toolkit that allows to automatically training translation models for any language pair.
What is need is a collection of translated texts (parallel corpus).

217

 
An efficient search algorithm finds quickly the highest probability translation
among the exponential number of choices. Morphological, syntactic and semantic
information can be integrated in factors during training. Figure.7.5 explains the mapping
of English factors and Tamil factors in Factored SMT. Initially, English factors
“Lemma” and “Minimized-POS” are mapped to Tamil factors “Lemma” and “M-POS”
then “Minimized-POS” and “Compound-Tag” factors of English language is mapped to
“Morphological information” factor of Tamil language.

Figure 7.3 Mapping English factors to Tamil Factors

Here, the important thing is Tamil surface word forms are not generated in SMT
decoder. Only factors are generated from SMT and the word is generated in the post
processing stage. Tamil morphological generator is used in post processing to generate a
Tamil surface word from output factors.

7.6.1 Building Language Model


SRILM language modelling kit can be used to build an n-gram language model from
the monolingual corpus of Tamil language. A script, ‘n-gram-count’, in SRILM can be
used to generate n-gram language models of any order by specifying optional
parameters such as interpolation, modified Kneser-Ney smoothing, absolute
discounting, Good – Turing smoothing and Written-Bell smoothing for unseen n-
grams. The output of this script will be a language model file that contains the n-gram
probabilities of each word in the monolingual corpus [183]. The general syntax of
executing the script ‘ngram-count’ in SRILM is,

218

 
>ngram-count -order n -[options] -text CORPUS_FILE –lm LM_FILE

Where,

order n - the order of the n-gram language model can be mentioned here, with ‘–
order n’, where ‘n’ denotes the order of the n-gram model.

[options]– various switches, such as interpolate, kndiscount, ndiscount and so


on, that can be used to generate the language model file.

text – the file name of the monolingual corpus file

lm – the file name of the language model file to be created by the script.

7.6.2 Building Phrase based Translation Model


To build a phrase-based translation model, the perl script, ‘train-model.perl’ in Moses
is used. The train-model perl script involves the following steps,

• Prepare the data: convert the parallel corpus into a format that is suitable to
GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is
converted into a numbered format. The vocabulary files contain words, integer word
identifiers and word count information. GIZA++ also requires words to be placed into
word classes. This is done automatically calling the mkcls program. Word classes are
only used for the IBM reordering model in GIZA++.

• Run GIZA++:GIZA++ is a freely available implementation of the IBM


Models. It is required in initial step to establish word alignments. Our word alignments
are taken from the intersection of bidirectional runs of GIZA++ plus some additional
alignment points from the union of the two runs. Running GIZA++ is the most time
consuming step in the training process. It also requires a lot of memory. GIZA++ learns
the translation tables of IBM Model 4, but the requirement is word alignment file.

• Aligning words: To establish word alignments based on the two GIZA++


alignments, a number of heuristics may be applied. The default heuristic grow-diag-
final starts with the intersection of the two alignments and then adds additional
alignment points. Other possible alignment methods are intersection, grow, grow-diag,

219

 
union, srctotgt and tgttosrc. Alternative alignment methods can be specified with the
switch alignment.

• Get lexical translation table: Given the word alignment, it is quite straight-
forward to estimate a maximum likelihood lexical translation table. w(e | f ) and the
inverse w( f | e) are estimated from word translation table.

• Extract Phrases: In the phrase extraction step, all phrases are dumped into
one big file. The content of this file is for each line: foreign phrase, English phrase, and
alignment points. Alignment points are pairs (English,Tamil). Also, an inverted
alignment file extract.inv is generated, and if the lexicalized reordering model is trained
(default), a reordering file extract.o.

• Score Phrases: Subsequently, a translation table is created from the stored


phrase translation pairs. The two steps are separated, because for larger translation
models, the phrase translation table does not fit into memory. Therefore, no need to
store the phrase translation table into memory; It can be construct it disk itself. To
estimate the phrase translation probability ϕ (e | f ) one should proceed as follows: First,
the extract file is sorted. This ensures that all English phrase translations for a foreign
phrase are next to each other in the file. Thus, It can process the file, one foreign phrase
at a time, collect counts and compute ϕ (e | f ) for that foreign phrase f. To estimate
ϕ ( f | e ) , the inverted file is sorted, and then ϕ ( f | e ) is estimated for an English phrase
at a time. Next to phrase translation probability distributions ϕ ( f | e ) and ϕ (e | f ) ,
additional phrase translation scoring functions can be computed, e.g. lexical weighting,
word penalty, phrase penalty,etc. Currently, lexical weighting is added for both
directions and a fifth score is the phrase penalty. Currently, five different phrase
translation scores are computed. They are, phrase translation probability ϕ ( f | e ) ,
lexical weighting lex( f | e) , phrase translation probability ϕ (e | f ) , lexical weighting
lex (e | f ) and phrase penalty (always exp(1) = 2.718 ).

• Build Reordering model: By default, only a distance-based reordering model


is included in final configuration. This model gives a cost linear to the reordering
distance. For instance, skipping over two words costs twice as much as skipping over
one word. Possible configurations are msd-bidirectional-fe (default), msd-bidirectional-
220

 
f, msd-fe, msd-f, monotonicity-bidirectional-fe, monotonicity-bidirectional-f,
monotonicity-fe and monotonicity-f.

• Build Generation model: The generation model is built from the target side
of the parallel corpus. By default, forward and backward probabilities are computed. If
you use the switch generation-type single only the probabilities in the direction of the
step are computed.

• Creating Configuration file: As a final step, a configuration file for the


decoder is generated with all the correct paths for the generated model and a number of
default parameter settings. This file is called model/moses.ini.

7.7 SUMMARY

This chapter described the factored translation model which extends the phrase based
model to incorporate linguistic information as additional factors in the representation of
words. The unavailability of more training data increases the advantage of using word-
level information in more linguistically motivated models. Mapping translation factors
in the factored model aids in the disambiguation of source words and improves the
grammar of target factors. SMT’s generation model is not utilized in this translation
system. The model is tuned only for producing the lemma, POS tag and morphological
factors. It is shown that the developed system improves translation over a base line
system and other factored system.

221

 
 

CHAPTER 8
POSTPROCESSING FOR ENGLISH TO TAMIL SMT

8.1 GENERAL

The aim of Natural Language Processing (NLP) is studying the problems in an


automatic generation and understanding of natural languages. Computational models
are built for analyzing and generating natural languages. Tamil is morphologically rich
and agglutinative language [134], so Tamil words are postpositionally inflected with
various grammatical features. Tamil verb specifies almost everything like gender,
number, and person markings and also with auxiliaries which represents mood and
aspect [13]. Tamil noun inflects for plural, case suffixes and post positions. In Tamil
language, the lemma undergoes morphological change when it get attach to certain
morphemes.

Post-processing transforms the translated output from SMT system into


standard target language sentence. In pre-processing, each and every Tamil words as
well as English words are segmented into root and morphological information. In
contrast, post-processing generates the Tamil word from root word and morphological
information.

Tamil morphological generator is utilized in the post-processing stage of


English to Tamil machine translation system. The morphological generator takes
lemma, POS category and morpho-lexical description as input and gives a word-form
as output. It is a reverse process of morphological analyzer. In any natural language
generation system, morphological generator is an essential component in post-
processing stage. Based on the syntactic category of a Tamil word, different approaches
are followed for generating a word form. Morphological generator is developed for
noun, verb and pronoun. Morphological generator system for verb and noun are
implemented using a new paradigm based algorithm, which is simple and efficient. A
paradigm classification is done for noun and verb based on Dr.S.Rajendran’s paradigm
classification [13]. Tamil verbs are classified into 32 paradigms and nouns are
classified into 25 paradigms. Table 6.6 and 6.7 shows verb and noun paradigm for
Tamil.

222 
 
 

This proposed morphological generator algorithm requires only minimum


amount of data for generating a word form. It handles the morpho-phonemic changes
without using any hand coded rules. So this approach can be easily implemented to less
resourced and morphologically rich languages. For generating a pronoun word form
separate morphological generator is developed using pattern matching approach. Noun
morphological generator is used for handling proper nouns. Other categories like
adjectives and adverbs are treated based on their suffixes and part of speech tags.

8.2 MORPHOLOGICAL GENERATOR

Morphological generator is an individual module or integrated with several NLP


applications like Machine Translation (MT), automatic sentence generation, and
information retrieval etc. Automated machine translation system requires,
morphological analyzer of the source language and morphological generator of the
target language. The most competent approach to morphological generator is using
Finite State Transducers (FST). Letter transducer based morphological analyzer and
generator was developed by Alicia Garrido [184]. Perez Aguiar has used an intuitive
pattern-matching approach for developing morphological generator to Spanish
language [184]. Guido Minnen and his team have developed a morphological generator
based on Finite state techniques and it is implemented using the widely available Unix
Flex utility [185].

For Indian languages, many attempts have been made to build morphological
generator. A Hindi morphological generator has been developed based on data driven
approach [186]. Tel-More, a morphological generator for Telugu is based on linguistic
rules and implemented in Perl program [187]. Morphological generator has been
designed for syntactic categories of Tamil using paradigms and sandhi rules [72]. Finite
state machines are used for developing morphological generator for Tamil [131].

8.2.1 Challenges in Tamil Morphological Generator

Tamil is morphologically rich and is an agglutinative language. Each verb is inflected


with more than ten thousand forms including auxiliaries and clitics. Inflection includes
finite, infinite, adjectival, adverbial and conditional forms of verbs. In generation the
inflections vary from one set of verbs to another. So, it is difficult to translate/generate

223 
 
 

the required word form of Tamil verbs. To resolve this complexity, a classification of
Tamil verbs based on tense markers and inflections is made.

Normally rule based approaches are used for developing a morphological


generator system. In this approach, the paradigm number of the given root word is
identified using the lookup table. The lookup table contains word class lemma and its
corresponding paradigm number. If the given lemma is not present in the lookup table
then the system fails to identify its paradigm. It’s difficult to create a look table with all
dictionary words, as all the proper nouns and compound word forms are not easy to
include in dictionary. This challenging task can be solved, if the system can
automatically identify the paradigm number of the dictionary word. When the lemma is
given as an input, the proposed system automatically identifies the paradigm number
based on its end characters.

Another challenging task in Tamil word generation is to handle the morpho-


phonemic change. Morpho-phonemic change represents modifications that occur when
an inflection is attached to a root word. The proposed morphological generator system
solves this challenging task by using stemming rules and suffixes. Morpho-phonemic
changes are shown in Table 8.1. Suffixes play a major role in this proposed Tamil
morphological generator system. Creation of these suffixes for the all word forms is
also a challenging job.

Table 8.1 Morpho-phonemic Changes


 
Morpho-phonemic
Word+ Inflection Word-form
Change

+கள் (pU+kaL) +க்+கள் (pUkkaL) க்கள்(pUkkaL)

கத்தி+ஆல்(kaththi+Al) கத்தி+ய்+ஆல் (kaththiyAl) கத்தியால் (kaththiyAl)

படம்+ஐ (padatm+ai) பட+த்+ஐ (padaththai) படத்ைத (padaththai)

224 
 
 

8.2.2 Simplified Part-of-Speech Categories

The input of the morphological generator system is a factored sentence from SMT
output. The factored Tamil sentence is categorized according to the simplified POS tag.
The simplified POS tagset is shown in Table 8.2. Based on this simplified tag factor the
morphological generator generates the word-form. The morphological generator for
noun handles the proper nouns and common nouns. The generation of Tamil verb forms
is taken care of morphological generator for verbs.

Table 8.2 Simplified POS Tagset

Figure 8.1 shows the categorization of Tamil sentence generation system. This
system contains five different modules. Morphological generators for Tamil noun and
verb are developed using suffix based approach. Tamil pronouns come under the
‘closed word-class’ category. So a pattern matching technique is followed for
generating pronominal word forms.

Tamil Sentence Generator

Morph Generator Morph Generator Morph Generator Morph Generator


for Verbs for Nouns for Pronouns for other categories

Figure 8.1 Tamil Sentence Generation

225 
 
 

8.3 MORPHOLOGICAL GENERATOR FOR TAMIL NOUNS


AND VERBS

This section depicts a new simple morphological generator algorithm for Tamil.
Generally, morphological generator tool is developed using rule based approach where
it requires a set of morpho-phonemic (spelling) rules and morpheme dictionary. The
method which is proposed here can be applied to any morphologically rich language. In
this novel approach, morphemes and dictionaries are not required. This algorithm only
needs the suffix table and the code for paradigm classification. If the lemma, POS
category and Morpho-Lexical Inflection (MLI) are given, the proposed algorithm will
generate the intended word form.

The morphological generator receives an input in the form of


lemma+word_class+ Morpho-lexical Information, where lemma specifies the root
word of the word-form to be generated, word_class specifies the grammatical category
(POS category) and Morpho-lexical Information specifies the type of inflection. Word
class information is used to decide whether the particular word is noun or verb. The
Morpho-lexical Information (MLI) has been extracted from the morphological analyzer
tool for Tamil. Example of the Tamil morphological generator system is given bellow.

Lemma + WC+ MLI = WORD FORM

ஓ + V + FT_3SM = ஓ வான்

Odu + V + FT_3SM = OduvAn (Run)

கா + N + ACC = காட்ைட

kAdu + N + ACC = kAddai (Forest)

In the above example “V” represents verb and “FT_3SM” represents future
tense with third person singular masculine.”N” symbolizes noun and ACC means
accusative case. FT_3SM and ACC are called as Morpho-lexical Information (MLI).

Three different modules are used to build the noun and verb generator system.
The first module takes the lemma and POS category as input and gives the lemma’s
paradigm number and word’s stem as output. The second module takes morpho-lexical

226 
 
 

information as an input and gives its index number as an output. In the third module a
suffix-table is used to generate the word form with the information from the above two
modules.

8.3.1 Algorithm for Noun and Verb Morphological Generator

This subsection illustrates the new algorithm which is developed for morphological
generator system. This algorithm is implemented using Java program. Algorithm is
shown in Figure 8.2.

Input = (Lemma +word class + morpho-lexical Information)

lemma,wc,morph =SPLIT(Input)

roman_lemma=ROMAN(lemma)

parnum=PARNUM(roman_lemma,wc)

col-index=parnum

row-index=INDEX(morph,wc)

suff=SUFFIX-TABLE[row-index][col-index]

stem=STEM(roman_lemma,wc,parnum)

word=JOIN(stem,suff)

output=UNICODE(word)

Figure 8.2 Algorithm for Morphological Generator

In this algorithm, lemma represents the root word of the word form, wc denotes
the word class and morph stands for morpho-lexical information. The given input is
divided into lemma, word class and Morpho-lexical information; this is done by using
SPLIT function. The lemma or root word in Unicode form is romanized using the
function ROMAN. roman_lemma represents the romanized lemma. parnum represents
paradigm number of lemma. PARNUM function identifies the paradigm number for
the given lemma using the end suffixes. Romanized lemma and the paradigm number

227 
 
 

are given as input to the STEM function along with the word class. This function is
used to find the stem of the root word. Given morpho-lexical information is matched
with the morpho-lexical information list, and the corresponding index number is
retrieved. This index number is referred as row-index. Paradigm number of the input
lemma is named as col-index. Using the row and column index the suffix part is
retrieved from the suffix-table. The stem and the retrieved suffix are attached to
generate the word form. This word form is then converted to Tamil Unicode form.

Figure 8.3 Architecture of Tamil Morphological Generator

Figure 8.3 shows the architectural view for Tamil morphological generator
system. The morphological generator system need to process three major component;
first one is the lemma part, then the word class and finally the morpho-lexical
information. By the way the generator is implemented makes it distinct from other
generator system. The input which is in Tamil Unicode form is first romanized and then
the paradigm number is identified by using the end characters. Romanized form is used
for the purpose of easy computation and efficient processing. The morpho-lexical
information of the required word form is given as input. From the morpho-lexicon
information list the index number of the corresponding input is identified. This is

228 
 
 

referred as row-index. Based on the word class specified the system uses the
corresponding suffix table. In two-dimensional suffix table, rows are morpho-lexical
information index and columns are paradigm numbers. For each paradigm a complete
set of morphological inflections corresponding to the morpho-lexical information list is
created. Finally using the column index and row index morphological suffix is retrieved
from the suffix table. This suffix form is affixed with the stem to generate the word
form.

8.3.2 Word-forms Handled in Morphological Generator

Tamil verb morphological generator is accomplished for generating more than ten
thousand forms of single Tamil verb. Some of the word-forms are shown in Table 8.3,
remaining word forms are listed in Appendix-B. Noun morphological generator system
handles nearly three hundred word forms including postpositions. These noun word-
forms are also given in the Appendix-B.

Table 8.3 Verb and Noun Word-forms  

Verb Word forms Noun Word forms

ப . நரம் .

ப த்தான். நரம்ைப.

ப த்தாள். நரம்பிைன.

ப த்தார். நரம்பினத்ைத.

ப த்தார்கள். நரம்ேபா .

ப த்த . நரம்பிேனா .

ப த்தன. நரம்பினத்ேதா .

ப த்தாய். நரம்பினால்.

ப த்தீர். நரம்பால்.

ப த்தீர்கள். நரம் க்கு.

229 
 
 

ப த்ேதன். நரம்பிற்கு.

ப த்ேதாம். நரம்பின்.

ப க்கிறான். நரம்ப .

ப க்கிறாள். நரம்பின .

ப க்கிறார். நரம்பின்கண்.

ப க்கிறார்கள். நரம்ப கண்.

ப க்கின்ற . நரம் க்காக.

ப க்கின்றன. நரம்பாலான.

ப க்கின்றாய். நரம் ைடய.

ப க்கின்றீர். நரம்பி ைடய.

ப க்கின்றீர்கள். நரம்பில்.

ப க்கின்றீர்கள். நரம்பினில்.

8.3.3 Data Required for the Algorithm

Tamil morphological generator for noun and verb requires following resources for
generating a word form.

i. Morpho Lexical Information (MLI)


ii. Paradigm classification rules
iii. Suffix table
iv. Stemmer

8.3.3.1 Morpho Lexical Information File

Morphological features of a word form are considered as Morpho-lexical information.

For example, Morpho-lexical information of a word ப த்தான் (padiththAn) is

230 
 
 

PAST_3SM. The Morpho-Lexical-Information (MLI) file is the collection of possible


morphological patterns of a particular word. Noun and verb needs a separate MLI file.
Example of the MLI file is shown in Table 8.4. All the patterns in the file are given in
the Appendix-B. According to the different MLI patterns, a suffix table is created.

Table 8.4 MLI file for Tamil Verbs

PT+3SM PRT+1P
PT+3SF FT+3SM
PT+3SE FT+3SF
PT+3SE+PL FT+3SE
PT+3SN FT+3SE+PL
PT+NOM_athu FT+3SN
PT+RP+3SN FT+RP+3SN
PT+3PN FT+NOM_athu
PT+RP+3PN FT+3PN
PT+NOM_ana FT+RP+3PN
PT+2S FT+NOM_ana
PT+2EH FT+2S
PT+2EH+PL FT+2EH
PT+1S FT+2EH+PL
PT+1P FT+1S
PRT+3SM FT+1P
PRT+3SF FT_3SN
PRT+3SE RP_UM
PRT+3SE+PL PT+RP
PRT+3SN PRT+RP
PRT+RP+3SN NM+RP
PRT+NOM_athu PT+RP+3SM
PRT+3PN PT+RP+3SF
PRT+RP+3PN PT+RP+3SE
PRT+NOM_ana PRT+RP+3SM
PRT+2S PRT+RP+3SF
PRT+2EH PRT+RP+3SE
PRT+2EH+PL PRT+3SN
PRT+1S PRT+3SN

231 
 
 

8.3.3.2 Paradigm Classification Rules

This section explains how the paradigm is classified based on the end characters.
Generally paradigm is identified using look up table. Initially, the root word is
romanized using Tamil Unicode to roman mapping file. This romanized form is
compared with end suffixes in paradigm classification file. If an end suffix is matched
with the end characters of the root word then the paradigm number is identified. End
suffixes are created based on the paradigms and sorted according to their character
length. Figure 8.4 shows the algorithm for paradigm classification.

Look up table is used for only the paradigm 25 (ப , padi) and 26 (நட, wada),

as the end suffixes cannot be generalized for these two paradigms. Figure 8.4 shows the
pseudo code for paradigm classification. End suffixes with corresponding paradigm

number are given in the Table 8.7. For example, the verb பயில் (payil), is matched

with the end suffix ‘ல்’ (il), therefore, the first paradigm is 11 and there is no possibility

for second paradigm (N).

In some cases, the word may have two paradigms. For example the words like

(ப , padi) (தீர், thIr) have two paradigms. Because of word’s sense padi has two

paradigms and intransitive form makes the word thIr to be fall under two paradigms.
Example of various word forms is given bellow.

padi (ப ) (read) ப த்தான், ப ந்தான்,

ப க்க, ப ய,

ப த் , ப ந் .

thIr (தீர்) (finish) தீர்த்தான்.தீர்ந்தான்

தீர,தீர்க்க

தீர்ந் ,தீர்த்

232 
 
 

Root word is Romanized

For all End Suffix

If End Suffix is matched with root word

Then , Paradigm number is identified

End if

End for

Figure 8.4 Pseudo Code for Paradigm Classification

Table 8.5 shows the look up table for Tamil verb paradigm and end suffixes. For

instance, the first paradigm is ெசய் (sey), some other words fall under the paradigm are

ெபய் (pey), ெமய் (mey), உய் (uy). So the generalized end suffixes are எய் (ey),

ஏய்(Ey), உய்(uy).

Table 8.5 Look-up Table for Paradigm Classification

End 1st IInd sudu 4 N


Suffix P- P- odu 4 N
Num Num pOdu 4 N
A 10 N eRu 5 N
O 17 N uRu 5 N
ey 1 N RRu 30 15
Ey 1 N Ru 29 N
oy 1 N Ez 6 N
uy 1 N Az 6 N
ozu 2 N iz 6 N
uzu 2 N az 6 N
azu 2 N Uz 6 N
idu 4 N Ay 6 N
sAku 3 N Iy 6 N
wOku 9 N vizu 7 N
padu 25 4 izu 25 N
wadu 4 N ezu 7 N
kedu 25 4 i 8 N

233 
 
 

ai 8 N UN 20 N
akal 11 N uN 21 N
kal 18 N AN 22 N
al 11 18 in 23 N
AL 12 N wil 24 N
aL 12 N il 11 N
uL 12 N en 27 N
kEL 19 N In 28 N
sol 16 N Aku 31 N
ol 13 N puku 32 N
el 13 N Nuku 15 N
oL 14 N uku 32 N
Ul 18 N iku 32 N
Ol 18 N u 15 N
vil 18 N r 6 N
El 19 N
UL 19 N
IL 19 12

8.3.3.3 Suffix Table

The suffix table is the most essential resource for this algorithm. It is a simple two-
dimensional (2D) table where row corresponds to the morpho-lexical information and
column corresponds to the paradigm number. Noun and verb has its own suffix table.
The noun suffix table contains 325 rows (word-forms) and 25 columns (paradigms).
Verb suffix table has two suffix tables, first table is made for without auxiliaries and
the next is designed for with auxiliary. First table contains 164 rows and the next has
67 rows. Both tables contain 32 columns (paradigms). Table 8.6 shows the number of
paradigms and inflections of verb and noun which are handled. Total represents the
total number of inflections which are handled by generator system.

Table 8.6 Paradigms and Inflections

Word forms
Paradigms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320

234 
 
 

For eveery paradigm


m, a word is selected and it is termeed as head word.
w For thiis

heaad word, alll the inflectted word foorms are creeated. For eexample thee verb

“pA
Adu”, the in
nflected worrd forms aree (pAdinAn)), (pAdinAL),

(pAddiyathu). A morpho-lexxical Information list is also createed for all thhe

wo
ord forms. Using
U these word-forms a 2D tablee is created with colum
mns and rows,
eacch column corresponds to
t paradigm
m and rows reepresent morpho-lexicall informationn.
The word-form
ms of every paradigm iss romanized and put in a table. Thee stem of thhe
eacch head worrd is identiffied and rem
moved from its word-foorm (i.e com
mmon term is
i
rem
moved in alll word form
ms). Now, thhe remaining portion iss only availaable in tablee.
maining porrtion is calleed as suffix and the tablle is called bby suffix tab
Rem ble. Table 8..7
illu
ustrates the sample
s suffix-table for T
Tamil verbs.. In this table row (MLI-1, MLI-2…
…)
speecifies the morpho-lexic
m cal inflectioon and colum
mn (P-1, P--2…) indicaates paradigm
m
num
mber.

Table 8.9 Suffix Taable

8.33.3.4 Stemm
ming Rules

Steemming is thhe process for


f reducing word-form to their stem
m. The stem
m need not bbe
ideentical to thee morphologgical root off the word. It is an impportant proccess in searcch
enggines and information
i retrieval. T
Table 8.8 shows
s the identified
i characters foor
stem
mming accoording to their paradigm.. In stemmin
ng, these chaaracters are removed
r from
m

thee root word.. For exampple, the stem


mming charaacter for thee second paaradigm

“azzu” is ‘zuu’ . So, the end


e characteer ‘zu’ wiill be deletedd for the worrds which arre

235
 

fall in second paradigm. If paradigm number has an end character “*” means then, no
character should be removed from the word.

Table 8.8 Stemming End Characters

Paradigm End 16 L
Number Characters 17 *
1 * 18 L
2 Zu 19 L
3 sAku 20 *
4 Du 21 *
5 Ru 22 kAN
6 wOku 23 *
7 Zu 24 L
8 * 25 *
9 wOku 26 *
10 A 27 *
11 L 28 *
12 L 29 U
13 L 30 Ru
14 L 31 U
15 U 32 U

8.4 MORPHOLOGICAL GENERATOR FOR TAMIL


PRONOUNS

Tamil pronouns are very essential and its structure is used in every day conversation. It
include personal pronouns (refer to the persons speaking, the persons spoken to, or the
persons or things spoken about), indefinite pronouns, relative pronouns (connect parts
of sentences) and reciprocal or reflexive pronouns (in which the object of a verb is
being acted on by verb's subject). Personal pronouns, indefinite pronouns, relative
pronouns, reciprocal or reflexive pronouns play an important role in Tamil language.
Therefore they need very special attention while generating as well as analyzing. Tamil
pronouns come under the closed word class category. Morphological generator for

236 
 
 

Tamil pronouns is developed separately using pattern matching technique. The primary
advantage of using pattern matching method is that it performs well for closed class
words. The reverse process of pronoun morphological analyzer is used in generating
pronoun surface word. So, the same data which are used in pronoun morphological
analyzer is used in generation too. Pronoun word form structure is shown in Figure 8.5.
Example for the pronoun structure is described bellow.

PP 

Case  Clitic 

Clitic 
Pronoun 
Root

Clitic 

Figure 8.5 Structure of Pronoun Word-form

Example for the Structure of Pronoun

அவ க்க கில்

அவ க்கு அவ க்க கிலா

அவன் அவ க்கா

அவனா

237 
 
 

Pattern matching approach is followed for generating pronoun word form.


Figure 8.6 shows the architecture of Pronoun morphological generation.

Pronoun root + MLI

Root word
Romanization

Pronoun root + MLI

SUFFIX DATABASE 
1  2 

3 4

Pronoun Word Form

Figure 8.6 Pronoun Morphological Generation

The input for this morphological generator is pronoun root word and the
morphological information. Pronoun root word is first romanized and then verify with
the suffix database which is used in morphological analysis. Four types of suffix lookup
tables are used in this system. Each table is created based on the levels in pronoun
structure.

8.5 SUMMARY

Post-processing generates a Tamil sentence from the factored output of SMT system.
Each unit in the factored output contains root word, word class and morphological
information. Morphological generator is used as a post processing component for word

238 
 
 

generation. Morphological generator generates a word-form from a lemma, a word


class tag, and morpho-lexical description. It is needed for various applications in
Natural Language Processing. It also acts as a post-processing component for other
NLP applications. The structure of the main sentence in Tamil is Subject-Object- Verb
(SOV). In this SOV order, the verb agrees with the subject in gender and number.
Morphological generator fulfills this agreement while generating a word-form.
Paradigms and suffixes are used to generate Tamil verbs and nouns. Pattern matching
based algorithm is used for generating pronoun word-form. Finally these generators are
integrated to generate a complete Tamil sentence. The important features of developed
morphological generator system is given bellow,

• Automatic paradigm identification.


• Using very less suffix data set for generating more ten thousand word form.
• Simple and efficient method.
• Handles compound words and Proper nouns.
• Handles Transitive and Intransitive forms
• No explicit morpho-phonemic Rules.
• No verb/noun dictionary for paradigm identification.
• Easily updatable and errors can be corrected without difficulty.
• Applicable for any morphologically rich language

239
CHAPTER 9
EXPERIMENTS AND RESULTS

9.1 GENERAL

This section explains the experiment and results of English to Tamil Statistical Machine
Translation system. Experiments include installation of SMT toolkit, training and
testing regulations. This machine translation system is an integartion of various
modules. So, the accuracy is depends on each module in the system. Roughly, this
translation system is divided into three modules. They are preprocessing, translation
and post-processing. So, the results and the errors depends on the preprocessing stages
of English sentence, factored SMT, and postprocessing. In preprocessing, English
language sentence are transformed using reordering and compoundeing phases.
Preprocessing use rules for reordering and compounding and English parser.

The accuracy of preprocessing depends on parser’s output and the rules developed. The
next stage is factored SMT system. The output of factored SMT system is depends on
size and quality of corpora. Parameter tuning and decoding steps are also plays a major
role in producing output for SMT system. Finally the output is given to the post-
processing stage. Tamil morphological generator is utilized in post processing. Therfore
the accuracies of morphological generator decides the precision of postprocesing stage.
Sample outputs of three modules are shown in appendix. Finally the output of this
machine translation system is evaluated by using BLEU and NIST metrics. Different
models are also developed to compare the results of developed translation engine.

9.2 EXPERIMENTAL SETUP AND RESULTS

This section describes the experimental setup and data used in the English to Tamil
statistical machine translation system. The training data consists of approximately 16K
English to Tamil parallel sentences. Health domain English-Tamil parallel corpora of
EILMT (English to Indian Language Machine Translation, Project funded by DIT)
project is used in experiments. The training set is built with 13,500 parallel sentences
and a test set is constructed with 1532 sentences. 1000 parallel sentences are used for
tuning the system. For language model, sizes of 90k Tamil sentences are used.

240 
 
Average word length of sentences in baseline and factored parallel corpora used in
these experiments are shown in the Table 9.1 and 9.2.

Table 9.1 Details of Baseline Parallel corpora

Total Average Word Length


Sentences English Tamil
Training 13500 22.047 17.145
Tuning 1000 21.8 15.334
Testing 1532 22.169 -

Table 9.2 Details of Factored Parallel corpora

Total Average Word Length


Sentences English Tamil
Training 13500 17.955 17.145
Tuning 1000 17.674 15.334
Testing 1532 18.039 -

Nine different types of model are trained, tuned and tested with the help of parallel
corpora. The general categories of the models are Baseline and Factored systems. The
detailed models are,

1. Baseline (BL)
2. Baseline with Automatic Reordering (BL+AR)
3. Baseline with Rule based Reordering (BL+RR)
4. Factored system + Morph-Generator (Fact)
5. Factored system + Auto Reordering +Morph-Generator (Fact+AR)
6. Factored system +Rule based Reordering + Morph-Generator (Fact+RR)
7. Factored system + Compounding + Morph-Generator (Fact+Comp)
8. Factored system + Auto Reordering +Compounding +Morph-Generator
(Fact+AR+Comp)
9. Factored system +Rule based Reordering +Compounding+ Morph-
Generator (Fact+RR+Comp)

241 
 
For a baseline system, a standard phrase based system is built using the surface forms
of the words without any additional linguistic knowledge and with a 4-gram LM in the
decoder. Cleaned raw parallel corpus is used for training the system. Lexicalized
reordering model  (msd-bidirectional-fe) is used in the baseline with automatic
reordering model. Another baseline system is built with the use of rule based
reordering. In all the developed factored models, Tamil morphological generator is
used in post processing stage. Instead of using the surface form of the word, a root,
part-of-speech and morphological information are included into the word as an
additional factors. A factored parallel corpus is used for training the system. English
factorization is done by using Stanford Parser tool and for Tamil, POS Tagger and
Morphological analyzers are used to factor the sentence. In this factored model system,
a token/word is represented with four factors as Surface|Root|Wordclass|Morphology.
Where, Morphology factor contains morphological information and function words on
English side, and morphological tags on Tamil side. In factored model with rule based
reordering and compounding (Fact+RR+Comp), English words are factored and
reordered. In addition to this Compounding is also performed in English side.

All the developed models are evaluated with the same test-set which contains 1532
English sentences. The well known Machine Translation metrics BLEU [144] and
NIST [145] are used to evaluate the developed models. In addition to that the existing
“Google Translate” online English-Tamil machine translation system is also evaluated
to compare with the developed models. The results are in terms of BLEU-1, BLEU-4
and NIST score and it is shown in Table 9.3. In figure 9.1 and 9.2, X axis represents the
various machine translation models and Y axis denotes the BLEU-1 and BLEU-4
scores. Figure 9.3 shows the NIST scores of developed models. From the graphs in the
figures, it is clearly shown that the proposed system (Fact+RR+Compounding)
improves the BLEU and NIST score compare to other developed models and “Google
Translate” system. The Google translation system’s output is shown in the Figure 9.4.
In this output, both sentences are failed to produce a noun verb agreement. Case
markers are also not identified in second sentence. The grammatically correct output is
not available in alternate translations also. Detailed output comparison is shown in
Appendix-B. The developed F-SMT based system in this thesis handles the noun-verb
agreement also. This is an important and challenging job for translating into
morphologically rich languages like Tamil.  

242 
 
Table 9.3 BL
LEU and NIST Scores

Models BLEU-1 BLEU-4 NIST

BL 0.2924 0.0368 2.7221


BASELINE
E BL+AR
R 0.2929 0.0403 2.7488

BL+RR
R 0.2594 0.0258 2.4148

Fact+M
Mgen 0.6406 0.3722 3.9831

Fact+A
AR+Mgen 0.6405 0.3725 3.9876

Fact+R
RR+Mgen 0.6285 0.3653 3.3887
F
FACTORED
D
Fact+C
Comp 0.6239 0.3573 3.9626

Fact+A
AR+Comp+
+Mgen 0.6237 0.3577 3.9673

Fact+R
RR+Comp+
+Mgen 0.6753 0.3894 4.2667

Googlee Translate 0.3105 0.3125 2.9526

B
BLEU‐1
0.8

0.7

0.6
BLEU‐1 Score
BLEU 1 Score

0.5

0.4

0.3

0.2

0.1

Variouss MT models

Figu
ure 9.1 BLE
EU-1 Scores for Various Models

243
BLEU‐4
0.45

0.4

0.35

0.3
BLEU‐4 Score

0.25

0.2

0.15

0.1

0.05

Variou
us MT models

Figu
ure 9.2 BLE
EU-4 Scores for Various Models

4.5
NIIST Score
es 4.2
2667

3.5
2.9526
3
NIST Scores

2.5

1.5

0.5

Various MT modells

Figu
ure 9.3 NIST
T Scores forr Various M
Models

244
Figure 9.4 Google Translation System1
9.3 SUMMARY
This chapter described the results which are used to test the effectiveness of English
to Tamil Machine translation systems. These results of the evaluation performed clearly
confirm that the new techniques proposed in this thesis are definitely significant. The
pre-processing and post-processing allows the developed system to achieve a relative
improvement in BLEU and NIST score. Furthermore, the Tamil linguistic tools which
are the modules of translation system are also implemented in this research. The
preprocessing techniques developed in this work helps to increase the translation
quality. BLEU and NIST evaluation scores clearly shows that the factored model with
an integration of linguistic knowledge gives better result for English to Tamil Statistical
Machine translation system.

                                                            
1
 Google Translate Output is Tested on 31‐07‐12 

245 
 
 

CHAPTER 10
CONCLUSION AND FUTURE WORK

In this chapter, the main contributions and most significant achievements of this
research work is summarized. The conclusion, which follows after the summary,
attempts to highlight the research contributions in the field of Tamil language
processing. At the same time, the limitations and future scope of the developed systems
are also mentioned, so that researchers who are interested in extending any of this work
can easily explore the possibilities.

10.1 SUMMARY

Due to the multilingual nature of the present information society, Human Language
Processing and Machine Translation have become essential for languages. Several
linguistic features (both morphological and syntactical) makes, translation a truly
challenging issue. The importance of machine translation will grow as the need for
translating resources of knowledge from one language to other increases.

Machine Translation is a challenging task for languages which are different in


morphological structure and word order. In many applications only small amounts of
bilingual corpora are available for the desired domain and language pair. Using
linguistic knowledge in SMT can reduce the need for massive amounts of data by
raising the level of generalization, and thereby providing a basis for more efficient data
exploitation. This is especially desirable for language pairs (like English and Tamil)
where massive amounts of parallel corpora are not available. For training the SMT
system, both monolingual and bilingual sentence-aligned parallel corpora of significant
size are essential.

This thesis presents the novel methods for incorporating linguistic knowledge in
SMT to achieve an enhancement in English to Tamil machine translation system. Most
of the technique presented in this thesis can be applied directly to other language pairs
especially for translating from morphologically simple language to morphologically
rich language.

The precision of the translation system depends on the performance of each and
every modules and linguistic tools used in the system. The experimental results clearly

246 
 
 

demonstrate that the new techniques proposed in this thesis are definitely significant.
Four different machine translation models are experimented and the BLEU and NIST
scores are compared. The developed model (Factored SMT with pre and post-
processing) has reported a 4.2667 NIST score for English to Tamil translations. Adding
pre and post processing in factored SMT provided about 0.38 BLEU-1 improvement
over a word based baseline system and 0.03 score improvement of a factored baseline
system. Finally this score is compared with “Google Translate” online machine
translation system. The developed model has reported 1.3 score improvement in NIST
score and about 0.36 improvement in BLEU-1 score. Improvement in BLEU and NIST
evaluation scores shows that this proposed approach is appropriate for English to Tamil
Machine Translation system.

10.2 SUMMARY OF WORK DONE

In this section, the research work done is summarized with reference to the objectives
of the proposed work. This thesis mainly highlights the five different developments in
the field of language processing which are listed below.

• Pre-processing Module (Reordering, Compounding and Factorization) for


English language sentence is developed to transform the sentence structure to
more similar to that of Tamil The compounding step in pre-processing is the
novel method that concentrate specific to English-to-Tamil machine translation
system.

• Tamil POS Tagger is developed using SVM based machine learning tool. The
major challenge for developing the statistical POS tagger for Indian languages
is that unavailability of annotated (tagged) corpus. The developed Tamil POS
tagger has 5 lakh POS annotated words. This tagged corpus is also built as a
part of this research. This tagged corpus is a major resource for Tamil language
processing.

• Morphological Analyzer Tool is built for Tamil language using Machine


learning approach. Morphological analyzer problem is redefined as a
classification problem. SVM based tool is used for training the system with the
size of 6 lakh morphologically tagged verbs and nouns. This tagged corpus is
also developed as a part of this research work. These morphologically
247 
 
 

segmented and tagged words are also the most significant resource for
analyzing the Tamil word forms. The same methodology is successfully
implemented for other Dravidian languages like Malayalam, Telugu, and
Kannada.

• Morphological Generator is developed for Tamil language using a new suffix


based algorithm. Post-processing module is developed to tackle the challenges
in agreement and word form generation. Tamil morphological generator tool is
developed and used in post-processing to assist the surface word generation.
This algorithm is capable of automatically generating more than ten thousand
word forms of a single Tamil verb.

• English to Tamil Factored Statistical Machine Translation is developed by


integrating different modules and various linguistic tools. The unavailability of
more training data increases the advantage of using word-level information in
more linguistically motivated models. Mapping translation factors in the
factored model aids in the disambiguation of source words and improves the
grammar of target word factors. It is shown that the developed system improves
the accuracy of translation over a base line system and other factored machine
translation models.

The major outcomes of the research is mapped to the publications which have
resulted from this work is as shown in Table 10.1

Table 10.1 Mapping of Major Research Outcome to Publication

Major Outcomes Publications


Tamil Part-of-Speech tagger based on
Development of Tamil POS
SVMTool
Tagger
POS Tagger and Chunker for Tamil Language
A Sequence Labelling Approach to
Development of Morphological Analyzer for Tamil
Morphological Analyzer for Morphological Analyzer for Agglutinative
Tamil language. Languages Using Machine Learning
Approaches

248 
 
 

A Novel Data Driven Algorithm for Tamil


Development of Morphological Generator. (IJCA)
Morphological Generator for
Tamil language. A Novel Algorithm for Tamil Morphological
generator. (ICON)

Morphology based factored Statistical


Machine Translation system from English to
Development of English to Tamil.
Tamil Statistical Machine
Factored Statistical Machine Translation
Translation system.
System for English to Tamil using Tamil
Linguistic Tools. (Accepted for Publication)

10.3 CONCLUSION

The major achievement of this research has been the development of Factored
Statistical Machine Translation System for English to Tamil language by integrating
linguistic tools. Linguistic tools like POS Tagger and Morphological analyzer are also
developed as part of this research work. Developing these linguistic tools are
challenging and demanding tasks especially for highly agglutinative language like
Tamil.

The performance of the statistical and machine learning methods mainly depends
on the size and correctness of the corpus. If the corpus consists of all types of surface
word forms, word categories and sentence structures, then it is possible for a learning
algorithm to extract all required features. Preprocessing systems are automated for
creating factored parallel and monolingual corpora. Factored corpora are an essential
resource for developing a Factored Machine Translation system from English to Tamil.

The proposed work aims at incorporating more lexical information of Tamil


language and generates language processing models to solve the problem more
effectively. The methods presented in this thesis such as Machine learning based
morphological analyzer, Suffix based Morphological generator and compounding in the
English preprocessing are the novel methods in language processing research.

249 
 
 

The Tamil linguistic tools can be used in future to implement Machine translation
system between Tamil to any other language, especially for Dravidian languages like
Telugu, Malayalam and Kannada. The applications of these developed linguistic tools
and annotated corpus can also be used in other language processing tasks such as
Information Retrieval and Extraction, Speech processing, Question Answering and
Word Sense Disambiguation etc.

Finally the conclusion is that morphologically rich languages needs an extensive


morphological preprocessing before the SMT training to make the source language
structurally similar to target language and it also needs an efficient post-processing in
order to generate the surface word correctly.

10.4 FUTURE DIRECTIONS

This thesis addresses the technique to improve the quality of Machine Translation
by separating root and morphological information of surface word. The main limitation
of the approach presented here is that it is not directly applicable in the reverse
direction (Tamil to English). All this developed computational linguistic tools and MT
systems are domain specific and scalable, so that researchers who are interested in
extending any of this work can easily explore the possibilities. There are a number of
possible directions for future work, based on the findings in this thesis. Some of the
directions are given bellow.

• Increasing the size of parallel corpora always help to improve the accuracy of
the system. Adding different sentence structures and handling the idioms and
phrases externally would help to improve the system.

• The tools and methodologies which are developed are used to perform on
translation between English to Tamil. It would be interesting to apply the
similar methods for translating English to other morphologically rich languages.

• The tools and methodologies which are developed are can be used to develop a
translation system that translate other languages into Tamil.

250 
 
 

• The reordering rules and compounding rules suggested in this thesis are
relatively simple, and do not perform very well on large sentences. It would be
possible to replace the handcrafted rules with automatically learned rules.

• Applying the advanced SMT models like syntax based model, hierarchal model
and hybrid approaches would improve the system.

• In future, advancement of this system lies in integrating with speech recognition


systems. Additionally, there is a possibility to convert this system into mobile
environment for translating simple sentences.

• It would be useful to perform a thorough error analysis of the translation output.


Such an analysis would give improvements in future.

251 
 
APPENDIX-A

A.1 TAMIL TRANSLITERATION

Tamil Roman ேக்ஷ xE ேஙா ngO


அ a ைக்ஷ xai ெஙௗ ngau
அ a ெக்ஷா xo ச sa
ஆ A ேக்ஷா xO ச் s
இ i கா kA சா sA
ஈ I கி ki சி si
உ u கீ kI சீ sI
ஊ U கு ku சு su
எ e கூ kU சூ sU
ஏ E ெக ke ெச se
ஐ ai ேக kE ேச sE
ஒ o ைக kai ைச sai
ஓ O ெகா ko ெசா so
ஔ au ேகா kO ேசா sO
ஃ q ெகௗ kau ெசௗ sau
க ka ங nga ஞ nja
க் k ங் ng ஞ் nj
க்ஷ xa ஙா ngA ஞா njA
x ஙி ngi ஞி nji
க்ஷா xA ஙீ ngI ஞீ njI
க்ஷி xi ஙு ngu nju
க்ஷீ xI ஙூ ngU njU
க்ஷு xu ெங nge ெஞ nje
க்ஷூ xU ேங ngE ேஞ njE
ெக்ஷ xe ைங ngai ைஞ njai
ெக்ஷௗ xau ெஙா ngo ெஞா njo

252
ேஞா njO தா thA pU
ெஞௗ njau தி thi ெப pe
ட da தீ thI ேப pE
ட் d thu ைப pai
டா dA thU ெபா po
di ெத the ேபா pO
டீ dI ேத thE ெபௗ pau
du ைத thai ம ma
dU ெதா tho ம் m
ெட de ேதா thO மா mA
ேட dE ெதௗ thau மி mi
ைட dai ந wa மீ mI
ெடா do ந் w mu
ேடா dO நா wA mU
ெடௗ dau நி wi ெம me
ண Na நீ wI ேம mE
ண் N wu ைம mai
ணா NA wU ெமா mo
ணி Ni ெந we ேமா mO
ணீ NI ேந wE ெமௗ mau
Nu ைந wai ய ya
NU ெநா wo ய் y
ெண Ne ேநா wO யா yA
ேண NE ெநௗ wau யி yi
ைண Nai ப pa யீ yI
ெணா No ப் p yu
ேணா NO பா pA yU
ெணௗ Nau பி pi ெய ye
த tha பீ pI ேய yE
த் th pu ைய yai

253
ெயா yo வ் v Lu
ேயா yO வா vA LU
ெயௗ yau வி vi ெள Le
ர ra vI ேள LE
ர் r vu ைள Lai
ரா rA vU ெளா Lo
ாி ri ெவ ve ேளா LO
ாீ rI ேவ vE ெளௗ Lau
ru ைவ vai ற Ra
rU ெவா vo ற் R
ெர re ேவா vO றா RA
ேர rE ெவௗ vau றி Ri
ைர rai ழ za றீ RI
ெரா ro ழ் z Ru
ேரா rO ழா zA RU
ெரௗ rau ழி zi ெற Re
ல la ழீ zI ேற RE
ல் l zu ைற Rai
லா lA zU ெறா Ro
li ெழ ze ேறா RO
லீ lI ேழ zE ெறௗ Rau
lu ைழ zai ன na
lU ெழா zo ன் n
ெல le ேழா zO னா nA
ேல lE ெழௗ zau னி ni
ைல lai ள La னீ nI
ெலா lo ள் L nu
ேலா lO ளா LA nU
ெலௗ lau ளி Li ென ne
வ va ளீ LI ேன nE

254
ைன nai ெஸௗ Sau
ெனா no ஷ sha
ேனா nO ஷ் sh
ெனௗ nau ஷா shA
ஜ ja ஷி shi
ஜ் j ஷீ shI
ஜா jA ஷு shu
ஜி ji ஷூ shU
ஜீ jI ெஷ she
ஜு ju ேஷ shE
ஜூ jU ைஷ shai
ெஜ je ெஷா sho
ேஜ jE ேஷா shO
ைஜ jai ெஷௗ shau
ெஜா jo ஹ ha
ேஜா jO ஹ் h
ெஜௗ jau ஹா hA
ஸ Sa ஹி hi
ஸ் S ஹீ hI
sri ஹு hu
ஸா SA ஹூ hU
Si ெஹ he
ஸீ SI ேஹ hE
ஸு Su ைஹ hai
ஸூ SU ெஹா ho
ெஸ Se ேஹா hO
ேஸ SE ெஹௗ hau
ைஸ Sai
ெஸா So
ேஸா SO

255
A.2 DETAILS OF AMIRTA POS TAGS
The major POS tags are noun, verb, adjective and adverb. The noun tag is further
classified into 10 tag categories; verb tag is classified into 7 tag categories; the rest are adjective,
adverb and others.

Noun Tags

Nouns are the words which denote a person, place, thing, time, etc. In Tamil language,
nouns are inflected for the number and case in the morphological level. However on
phonological level, four types of suffixes can occur with noun stem.

Noun (+ number) (+ case)


Example: pUk-kaL-ai <NN>
Flower-plural-accusative case suffix
   Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Example: pUk-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix
 

As mentioned earlier, distinct tags based on grammatical information are avoided. So


plurality and case suffixation can be obtained from a morph analyzer. This brings the number of
tags down, and helps to achieve simplicity, consistency and better machine learning. Therefore,
only two tags common noun <NN> and common compound noun <NNC> are used without
taken into consideration further based on the grammatical information contained in the noun
word.

Example for Common Nouns (NN)


paRavai <NN>
‘bird’
paRavai-kaL <NN>
‘bird-s’
paRavai-kku <NN>
‘to- bird’
paRavai-yAl <NN>

256
‘by- bird’
Example for Compound Nouns (NNC)
UrAdsi <NNC> thalaivar <NNC>
‘Township leader’
vanap <NNC> pakuthi <NNC>
‘forest area’

Proper Noun tag

Proper Nouns are the words which denote a particular person, place, or thing. Indian
languages, unlike English, do not have any specific marker for proper nouns in its orthographic
convention. English proper nouns begin with a capital letter which distinguishes them from
common nouns. Most of the words which occur as proper nouns in Indian languages can also
occur as common nouns. For example in English, John, Harry, Mary occur only as proper nouns
whereas in Tamil, thAmarai, maNi, pissai, arasi etc are used as proper nouns as well as common
nouns. Given below, is a list of Tamil words with their grammatical category and English
glosses. These words can be occurred in the text as common and proper nouns.
thAmarai noun lotus
maNi noun bell
pissai noun beg
arasi noun queen
Two tags have been used for proper noun,
• Proper noun <NNP>
• Compound proper noun <NNPC>

Example for Proper Nouns (NNP)


raja <NNP> wERRu vawthAn.
“Raja came yesterday”

Example for Compound Proper Nouns (NNPC)


apthul<NNPC> kalAm <NNPC> inRu cennai varukiRAr.
“Today, Abdul kalam is coming to Chennai”

257
Cardinal Tag
Any word denoting a cardinal number is tagged as <CRD>.

Example for Cardinals <CRD>

enakku 150 <CRD> rupAy vENdum.


“I need 150 ruppes”.
mUnRu <CRD > waparkaL angku amarwthirukkiRArkaL.
“Three people were sitting there”

Ordinal Tag
Ordinals are an extension of the natural numbers different from integers and from
cardinals. In Tamil, Ordinals are formed by adding the suffixes Am and Avathu. Expressions
denoting ordinals are marked as <ORD>.

Example for Ordinals <ORD>


muthalAm <ORD> vakuppu.
“first class”
12-Am <ORD> wURRANdu
“12th century”
Pronoun Tag
Pronouns are the words that take the place of nouns. One uses a pronoun in place of a
noun to refer the same entity. Linguistically, a pronoun is a variable which is functionally a
noun; seperate tag for pronouns will be helpful for anaphora resolution.

Example for personal Pronouns (PRP)


avan <PRP> weRRu ingku vawthAn.
“He came here yesterday”

Adjective Tag

Adjectives are the noun modifiers. In modern Tamil, simple and derived adjectives are
present. The derived adjectives in Tamil are formed by adding the suffixes (Ana) to the noun
root. <ADJ> tag is used for representing adjectives.

258
Example for Adjectives <ADJ>
iwtha walla <ADJ> paiyan
“this nice boy”
oru azakAna <ADJ> peN
“A beautiful girl”

Adverb Tag
Adverbs are words which tell more about the verbs. In modern Tamil, simple and derived
adverbs are present. The derived adverbs in Tamil are formed by adding the suffixes (Aka and
Ay) to the verb root. The temporal and spatial entities are also tagged as adverbs. <ADV> tag is
used for representing adverbs.

Example for Adverbs <ADV>


Avan adikkadi <ADV> vidumuRai eduththAn.
“He took leave frequently”
kuthirai vEkam-Aka <ADV> Odiyathu.
“The horse ran fast”’
VerbTags

Verbs are defined as the action word, which can take tense suffixes, person, number,
gender suffixes and few other verbal suffixes. Tamil verb forms can be distinguished into finite
and non finite verbs.

Finite verb tag


Finite verbs as the predicate of the main clause occur at the end of the sentence.
Example for Finite Verb <VF>
avan paSsil vawthAn <VF>
“he came by bus”

Non finite verb tag

Tamil distinguishes between four types of non-finite verb forms.They are verbal
participle <VNAV>, adjectival participle <VNAJ>, infinitive <VINT> and conditional <CVB>.

259
Example for verbal participle <VNAV>
      wowthu<VNAV> poo
“become vexed”

Example for adjectival participle <VNAJ>


vawtha <VNAJ> paiyan
“‘the boy who came”

Example for infinitive <VINT>


awtha maraththil idi viza <VINT>

“may thunder fall on the tree”

Example for conditional <CVB>


wI wEraththOdu vawthAl <CVB>
“If you would come in time”

Nominalized verb tag


Nominalized verbal forms are verbal nouns, participle nouns and adjectival nouns.
<VBG> tag is used for all the forms of nominalized verbs.

Example for Nominalized verb forms <VBG>


seythal <VBG> (doing)
seyAthu <VBG> (a neuter which is doing)
seythavan <VBG> (a man who did)

Anaw enna seyvathu <VBG>?


“What shall (we) do Anand?”

Auxiliary verb tag


Auxiliary verbs are the verbs which lose their original syntactic and semantic properties
when they collocate with the main verbs. They signify various grammatical meanings which are
auxiliary to the main verbs in a sentence. <VAX> tag is used for denoting auxiliary verb.

Example for auxiliary verb <VAX>


260
sItha angkE iruwthAL “Seetha was there” - Main Verb
sItha angkE vawthirukkiRAL “Seetha has come there” – Auxiliary verb

Other Tags

Postposition tag
Tamil has a few free forms which are referred as postpositions. They are added
after the case marker.

Example
kovil kiri vIddukku pakkaththil <PPO> uLLathu

“The temple is near to Giri’s house”

In the above example, the post position pakkaththil 'near' occurs after the dative noun
phrase vIddukku. Here the form pakkaththil is not considered as a suffix, it is a free form and
because of its place of occurrence it is termed as postposition. Postpositions are historically
derived from verbs. Schiffman (1999) describes various postpositions. Postpositions are
conditioned by the nouns inflected for the case they follow. In Tamil, some postpositions are
simple and some are compound. <PPO> tag is used for representing postposition words.

Example for Postposition <PPO>


avan wAyaip pOl <PPO> kaththinAn.
“He cried like a dog”

Conjunction tag
Co-ordination or conjunction in Tamil is mainly realized by certain noun forms, verb
forms and clitics. <CNJ> tag is used for representing conjunctions.

Example for Conjunction <CNJ>


ciRiYA AnAl <CNJ > walla pen.
“a small but nice girl”

Determiner tag
A determiner is a noun-modifier that expresses the reference of a noun or noun-phrase in
a context, rather than attributes expressed by adjectives. This function is usually performed by
261
articles, demonstratives or possessive determiners. <DET> tag is used for annotating
determiners.

Example for determiners <DET>


awthath <DET>thittam.
“That plan”

Complimentizer
A complementizer is a syntactic category roughly equivalent to the term subordinating
conjunction in traditional grammar. For example, the word “that” is a complimentizer in the
following English sentence. “Mary believes that it is raining” . <COM> is used for tagging
complimentizer.

Example for complimentizer <COM>


avan vawthAn enRu<COM> kELvippattEn.
“I heard that he had come”

Emphasis tag
Force or intensity of expression that gives impressiveness or importance to the
word category is termed emphasis. <EMP > tag is used for annotating emphasis.

Example for Emphasis <EMP>


avan than<EMP> sonnAn.
“He only said”

Echo word tag


The tag < ECH> is used to denote echo words.

Example for Echo words (ECH)


kAppi kIppi <ECH>
“coffee keeffee”

Reduplication word tag


Reduplication words are the same word which is written twice for various purposes such
as indicating emphasis, deriving a category from another category. <RDW> tag is used for
262
tagging reduplication words.

Example for Reduplication words <RDW>


palapala <RDW> thiddam.
“many plans”

Question word and Question marker


Tags <QW> and <QM> are used for question word and question marker.

Example for question word and marker <QW> and <QM>


Avan vawthAnA <QW>? <QM>.
“Did he come?”

Symbol tags

Only two symbols dot <DOT> tag and comma <COMM> tag are considered in the
corpus. <DOT> tag is used to show the sentence separation. <COMM> tag is used in-between
the multiple nouns and proper noun.

Example for <DOT>


avan thAn sonnAn .<DOT>
“He only said”

Example for <COMMA>


rithaniyA, <COMM> tharunyAudan vawthaL.
“Rethanya came with Dharunya”

263
APPENDIX-B

B.1 PENN TREEBANK POS TAGS

S.No Tag Description


1 CC Coordinating conjunction
2 CD Cardinal number
3 DT Determiner
4 EX Existential there
5 FW Foreign word
6 IN Preposition or subordinating conjunction
7 JJ Adjective
8 JJR Adjective, comparative
9 JJS Adjective, superlative
10 LS List item marker
11 MD Modal
12 NN Noun, singular or mass
13 NNS Noun, plural
14 NNP Proper noun, singular
15 NNPS Proper noun, plural
16 PDT Predeterminer
17 POS Possessive ending
18 PRP Personal pronoun
19 PRP$ Possessive pronoun
20 RB Adverb
21 RBR Adverb, comparative
22 RBS Adverb, superlative
23 RP Particle
24 SYM Symbol
25 TO To
26 UH Interjection
27 VB Verb, base form
28 VBD Verb, past tense
29 VBG Verb, gerund or present participle
30 VBN Verb, past participle
31 VBP Verb, non-3rd person singular present
32 VBZ Verb, 3rd person singular present
33 WDT Wh-determiner
34 WP Wh-pronoun
35 WP$ Possessive wh-pronoun
36 WRB Wh-adverb
 

264
B.2 DEPENDENCY TAGS  

Depend Depende
ency Meaning ncy Meaning
Label Label
dep Dependent acomp adjectival complement
aux Auxiliary agent Agent
auxpass passive auxiliary ref Referent
cop Copula expl expletive (expletive there)
conj Conjunct mod Modifier
cc Coordination advcl adverbial clause modifier
arg Argument purpcl purpose clause modifier
subj Subject tmod temporal modifier
nsubj nominal subject rcmod relative clause modifier
nsubjpass passive nominal subject amod adjectival modifier
csubj clausal subject infmod infinitival modifier
comp Complement partmod participial modifier
obj Object num numeric modifier
dobj direct object number element of compound number
iobj indirect object appos appositional modifier
pobj object of preposition nn noun compound modifier
attr Attributive abbrev abbreviation modifier
ccomp clausal complement+internal Sub advmod adverbial modifier
xcomp clausal complement+external Sub neg negation modifier
compl Complementizer poss possession modifier
mark marker ( introducing an advcl) possessive possessive modifier (’s)
rel relative (introducing a rcmod) prt phrasal verb particle
acomp adjectival complement det Determiner
agent Agent prep prepositional modifier
xsubj controlling subject sdep semantic dependent
 

265
B.3 TAMIL VERB MLI FILE
Null  PT+RP 
PT+3SM  PRT+RP
PT+3SF  NM+RP 
PT+3SE  PT+RP+3SM 
PT+3SE+PL  PT+RP+3SF 
PT+3SN  PT+RP+3SE
PT+NOM_athu  PRT+RP+3SM 
PT+RP+3SN  PRT+RP+3SF 
PT+3PN  PRT+RP+3SE
PT+RP+3PN  PRT+3SN 
PT+NOM_ana  PRT+3SN 
PT+2S  PRT+NOM_athu 
PT+2EH  FT+RP+3SM
PT+2EH+PL  FT+RP+3SF 
PT+1S  FT+RP+3SE 
PT+1P  NM+RP+3SM 
PRT+3SM  NM+RP+3SF
PRT+3SF  NM+RP+3SE 
PRT+3SE  NM+RP+NOM_athu 
PRT+3SE+PL  PT+RP+3PN_vai 
PRT+3SN  NM+NOM_ana
PRT+RP+3SN  VP 
PRT+NOM_athu  INF 
PRT+3PN  NM+3SN
PRT+RP+3PN  NM+VP
PRT+NOM_ana  NM+VP_aamal 
PRT+2S  NOM_thal 
PRT+2EH  NOM_al
PRT+2EH+PL  NOM_kai 
PRT+1S  NM+NOM_mai 
PRT+1P  NOM_kkal+MOOD_aam 
FT+3SM  NOM_kkal+MOOD_aakum 
FT+3SF  NOM_kkal+NM+3SN 
FT+3SE  VP+PAAR_AUX+PT+3SM 
FT+3SE+PL  VP+PAAR_AUX+PT+3SF 
FT+3SN  VP+PAAR_AUX+PT+3SE 
FT+RP+3SN  VP+PAAR_AUX+PT+2S 
FT+NOM_athu  VP+PAAR_AUX+PT+1P 
FT+3PN  VP+PAAR_AUX+PT+1S
FT+RP+3PN  VP+PAAR_AUX+PRT+3SE 
FT+NOM_ana  VP+PAAR_AUX+PRT+3SE+PL 
FT+2S  VP+PAAR_AUX+PRT+3SF 
FT+2EH  VP+PAAR_AUX+PRT+3SM 
FT+2EH+PL  VP+PAAR_AUX+PRT+1P 
FT+1S  VP+PAAR_AUX+PRT+2S 
FT+1P  VP+PAAR_AUX+PRT+1S 
FT_3SN  VP+PAAR_AUX+FT+3SE 
RP_UM  VP+PAAR_AUX+FT+3SF 

266
VP+PAAR_AUX+FT+3SM  VP+KODU_AUX+FT+2EH 
VP+PAAR_AUX+FT+2S  VP+KODU_AUX+FT+2EH+PL 
VP+PAAR_AUX+FT+1S  VP+KODU_AUX+FT+1S 
VP+PAAR_AUX+FT+1P  VP+KODU_AUX+FT+1P 
VP+IRU_AUX+PT+3SE  VP+POO_AUX+PT+3SM 
VP+IRU_AUX+PT+3SF  VP+POO_AUX+PT+3SF 
VP+IRU_AUX+PT+3SM  VP+POO_AUX+PT+3SE 
VP+IRU_AUX+PT+2S  VP+POO_AUX+PT+3SE+PL 
VP+IRU_AUX+PT+1S  VP+POO_AUX+PT+3SN
VP+IRU_AUX+PT+1P  VP+POO_AUX+PT+3PN 
VP+IRU_AUX+PRT+3SE  VP+POO_AUX+PT+2S 
VP+IRU_AUX+PRT+3SF  VP+POO_AUX+PT+2EH 
VP+IRU_AUX+PRT+3SM  VP+POO_AUX+PT+2EH+PL 
VP+IRU_AUX+PRT+1S  VP+POO_AUX+PT+1S 
VP+IRU_AUX+PRT+1P  VP+POO_AUX+PT+1P 
VP+IRU_AUX+FT+3SE  VP+POO_AUX+PRT+3SM 
VP+IRU_AUX+FT+3SF  VP+POO_AUX+PRT+3SF 
VP+IRU_AUX+FT+3SM  VP+POO_AUX+PRT+3SE 
VP+IRU_AUX+FT+2S  VP+POO_AUX+PRT+3SE+PL 
VP+IRU_AUX+FT+1S  VP+POO_AUX+PRT+3SN 
VP+IRU_AUX+FT+1P  VP+POO_AUX+PRT+3PN 
VP+KODU_AUX+PT+3SM  VP+POO_AUX+PRT+2S 
VP+KODU_AUX+PT+3SF  VP+POO_AUX+PRT+2EH 
VP+KODU_AUX+PT+3SE  VP+POO_AUX+PRT+2EH+PL 
VP+KODU_AUX+PT+3SE  VP+POO_AUX+PRT+1S 
VP+KODU_AUX+PT+3SN  VP+POO_AUX+PRT+1P 
VP+KODU_AUX+PT+3PN  VP+POO_AUX+FT+3SM 
VP+KODU_AUX+PT+2S  VP+POO_AUX+FT+3SF
VP+KODU_AUX+PT+2SE  VP+POO_AUX+FT+3SE 
VP+KODU_AUX+PT+2SE+PL  VP+POO_AUX+FT+3SE+PL 
VP+KODU_AUX+PT+1S  VP+POO_AUX+FT+3SN
VP+KODU_AUX+PT+1P  VP+POO_AUX+FT+3PN
VP+KODU_AUX+PRT+3SM  VP+POO_AUX+FT+2S 
VP+KODU_AUX+PRT+3SF  VP+POO_AUX+FT+2EH 
VP+KODU_AUX+PRT+3SE  VP+POO_AUX+FT+2EH+PL 
VP+KODU_AUX+PRT+3SE+PL  VP+POO_AUX+FT+1S 
VP+KODU_AUX+PRT+3SN  VP+POO_AUX+FT+1P 
VP+KODU_AUX+PRT+3PN  VP+VIDU_AUX+PT+3SE 
VP+KODU_AUX+PRT+2S  VP+VIDU_AUX+PT+3SF
VP+KODU_AUX+PRT+2EH  VP+VIDU_AUX+PT+3SM 
VP+KODU_AUX+PRT+2EH+PL  VP+VIDU_AUX+PT+2S 
VP+KODU_AUX+PRT+1S  VP+VIDU_AUX+PT+1P 
VP+KODU_AUX+PRT+1P  VP+VIDU_AUX+PT+1S
VP+KODU_AUX+FT+3SM  VP+VIDU_AUX+PRT+3SE 
VP+KODU_AUX+FT+3SF  VP+VIDU_AUX+PRT+3SM 
VP+KODU_AUX+FT+3SE  VP+VIDU_AUX+PRT+3SF 
VP+KODU_AUX+FT+3SE+PL  VP+VIDU_AUX+PRT+2S 
VP+KODU_AUX+FT+3SN  VP+VIDU_AUX+PRT+1P 
VP+KODU_AUX+FT+3PN  VP+VIDU_AUX+PRT+1S 
VP+KODU_AUX+FT+2S  VP+VIDU_AUX+FT+3SE

267
VP+VIDU_AUX+FT+3SF  VP+THOLAI_AUX+PRT+3SM 
VP+VIDU_AUX+FT+3SM  VP+THOLAI_AUX+PRT+2S 
VP+VIDU_AUX+FT+2S  VP+THOLAI_AUX+PRT+1P 
VP+VIDU_AUX+FT+1S  VP+THOLAI_AUX+PRT+1S 
VP+VIDU_AUX+FT+1P  VP+THOLAI_AUX+FT+3SE 
VP+KI_AUX+PT+3SE  VP+THOLAI_AUX+FT+3SM 
VP+KI_AUX+PT+3SM  VP+THOLAI_AUX+FT+3SF 
VP+KI_AUX+PT+3SF  VP+THOLAI_AUX+FT+2S 
VP+KI_AUX+PT+2S  VP+THOLAI_AUX+FT+1S 
VP+KI_AUX+PT+1P  VP+THOLAI_AUX+FT+1P 
VP+KI_AUX+PT+1S  VP+THALLU_AUX+PT+3SE 
VP+KI_AUX+PRT+3SE  VP+THALLU_AUX+PT+3SF 
VP+KI_AUX+PRT+3SF  VP+THALLU_AUX+PT+3SM 
VP+KI_AUX+PRT+3SM  VP+THALLU_AUX+PT+2S 
VP+KI_AUX+PRT+2S  VP+THALLU_AUX+PT+1P 
VP+KI_AUX+PRT+1P  VP+THALLU_AUX+PT+1S 
VP+KI_AUX+PRT+1S  VP+THALLU_AUX+FT+3SE 
VP+KI_AUX+FT+3SE  VP+THALLU_AUX+FT+3SM 
VP+KI_AUX+FT+3SM  VP+THALLU_AUX+FT+3SF 
VP+KI_AUX+FT+3SF  VP+THALLU_AUX+FT+2S 
VP+KI_AUX+FT+2S  VP+THALLU_AUX+FT+1S 
VP+KI_AUX+FT+1P  VP+THALLU_AUX+FT+1P 
VP+KI_AUX+FT+1S  VP+KIZI_AUX+PT+3SE 
VP+AUX_aayiRRu  VP+KIZI_AUX+PT+3SM
VP+FT+POODU_AUX+PT+3SE  VP+KIZI_AUX+PT+3SF 
VP+FT+POODU_AUX+PT+3SM  VP+KIZI_AUX+PT+2S 
VP+FT+POODU_AUX+PT+3SF  VP+KIZI_AUX+PT+1S 
VP+FT+POODU_AUX+PT+2S  VP+KIZI_AUX+PT+1P
VP+FT+POODU_AUX+PT+1S  VP+KIZI_AUX+FT+3SE 
VP+FT+POODU_AUX+PT+1P  VP+KIZI_AUX+FT+3SM 
VP+FT+POODU_AUX+PRT+3SE  VP+KIZI_AUX+PRT+3SF
VP+FT+POODU_AUX+PRT+3SM VP+KIZI_AUX+FT+2S
VP+FT+POODU_AUX+PRT+3SF  VP+KIZI_AUX+FT+1S 
VP+FT+POODU_AUX+PRT+2S  VP+KIZI_AUX+FT+1P 
VP+FT+POODU_AUX+PRT+1S  VP+KIZI_AUX+PRT+3SE
VP+FT+POODU_AUX+PRT+1P  VP+KIZI_AUX+PRT+3SM 
VP+FT+POODU_AUX+FT+3SE  VP+KIZI_AUX+PRT+3SF 
VP+FT+POODU_AUX+FT+3SM  VP+KIZI_AUX+PRT+2S 
VP+FT+POODU_AUX+FT+3SF  VP+KIZI_AUX+PRT+1S
VP+FT+POODU_AUX+FT+2S  VP+KIZI_AUX+PRT+1P 
VP+FT+POODU_AUX+FT+1P  VP+KIZI_AUX+FT+3SE 
VP+FT+POODU_AUX+FT+1S  VP+KIZI_AUX+FT+3SM 
VP+THOLAI_AUX+PT+3SE  VP+KIZI_AUX+FT+3SF
VP+THOLAI_AUX+PT+3SM  VP+KIZI_AUX+FT+2S 
VP+THOLAI_AUX+PT+3SF  VP+KIZI_AUX+FT+1S 
VP+THOLAI_AUX+PT+2S  VP+KIZI_AUX+FT+1P
VP+THOLAI_AUX+PT+1P  VP+KIDA_AUX+PT+3SE
VP+THOLAI_AUX+PT+3SM  VP+KIDA_AUX+PT+3SM 
VP+THOLAI_AUX+PRT+3SE  VP+KIDA_AUX+PT+3SF 
VP+THOLAI_AUX+PRT+3SF  VP+KIDA_AUX+PT+2S

268
VP+KIDA_AUX+PT+1S  INF+AUX_attum
VP+KIDA_AUX+PT+1P  INF+MOD_vendum 
VP+KIDA_AUX+PRT+3SE  INF+MOD_vendam 
VP+KIDA_AUX+PRT+3SM  INF+MOD_koodum 
VP+KIDA_AUX+PRT+3SF  INF+MOD_koodathu
VP+KIDA_AUX+PRT+2S  INF+MAATTU_AUX+3SM 
VP+KIDA_AUX+PRT+1S  INF+MAATTU_AUX+3SF 
VP+KIDA_AUX+PRT+1P  INF+MAATTU_AUX+3SE 
VP+KIDA_AUX+FT+3SE  INF+MAATTU_AUX+1S
VP+KIDA_AUX+FT+3SM  INF+MAATTU_AUX+2S 
VP+KIDA_AUX+FT+3SF  INF+MAATTU_AUX+1P 
VP+KIDA_AUX+FT+2S  INF+MOD_illai 
VP+KIDA_AUX+FT+1S  INF+IYAL_AUX+RP_UM 
VP+KIDA_AUX+FT+1P  INF+IYAL_AUX+FT_3SN 
VP+THEER_AUX+PT+3SE  INF+IYAL_AUX+PT+3SN 
VP+THEER_AUX+PT+3SM  INF+IYAL_AUX+PRT+3SN 
VP+THEER_AUX+PT+3SF  INF+MUDI_AUX+PT+3SN 
VP+THEER_AUX+PT+2S  INF+MUDI_AUX+PRT+3SN 
VP+THEER_AUX+PT+1P  INF+MUDI_AUX+FT_3SN 
VP+THEER_AUX+PT+1S  INF+MUDI_AUX+RP_UM 
VP+THEER_AUX+PRT+3SE  INF+IRU_AUX+PT+3SM 
VP+THEER_AUX+PRT+3SM  INF+IRU_AUX+PT+3SF 
VP+THEER_AUX+PRT+3SF  INF+IRU_AUX+PT+3SE 
VP+THEER_AUX+PRT+2S  INF+POO_AUX+PT+3SM 
VP+THEER_AUX+PRT+1P  INF+POO_AUX+PT+3SF 
VP+THEER_AUX+PRT+1S  INF+POO_AUX+PT+3SE 
VP+THEER_AUX+FT+3SE  INF+VAA_AUX+PT+3SM 
VP+THEER_AUX+FT+3SM  INF+VAA_AUX+PT+3SF
VP+THEER_AUX+FT+3SF  INF+VAA_AUX+PT+3SE 
VP+THEER_AUX+FT+2S  INF+PAAR_AUX+PT+3SM 
VP+THEER_AUX+FT+1P  INF+PAAR_AUX+PT+3SF 
VP+THEER_AUX+FT+1S  INF+PAAR_AUX+PT+3SE 
VP+MUDI_AUX+PT+3SE  INF+VAI_AUX+PT+3SE 
VP+MUDI_AUX+PT+3SM  INF+VAI_AUX+PT+3SM 
VP+MUDI_AUX+PT+3SF  INF+VAI_AUX+PT+3SF
VP+MUDI_AUX+PT+2S  INF+PANNU_AUX+PT+3SM 
VP+MUDI_AUX+PT+1P  INF+PANNU_AUX+PT+3SF 
VP+MUDI_AUX+PT+1S  INF+PANNU_AUX+PT+3SE 
VP+MUDI_AUX+PRT+3SE  INF+SEY_AUX+PT+3SM 
VP+MUDI_AUX+PRT+3SM  INF+SEY_AUX+PT+3SF 
VP+MUDI_AUX+PRT+3SF  INF+SEY_AUX+PT+3SE 
VP+MUDI_AUX+PRT+2S  INF+PERU_AUX+PT+3SM 
VP+MUDI_AUX+PRT+1P  INF+PERU_AUX+PT+3SF 
VP+MUDI_AUX+PRT+1S  INF+PERU_AUX+PT+3SE 
VP+MUDI_AUX+FT+3SE  INF+PADU_AUX+PT+3SN 
VP+MUDI_AUX+FT+3SM  INF+PADU_AUX+PT+3SM 
VP+MUDI_AUX+FT+3SF  INF+PADU_AUX+PT+3SF 
VP+MUDI_AUX+FT+2S  INF+PADU_AUX+PT+3SE 
VP+MUDI_AUX+FT+1S  INF+PADU_AUX+PRT+3SN 
VP+MUDI_AUX+FT+1P  INF+PADU_AUX+PRT+3SM 

269
INF+PADU_AUX+PRT+3SF  VP+PAAR_AUX+INF+MOD_vendum 
INF+PADU_AUX+PRT+3SE  VP+IRU_AUX+INF+MOD_vendum 
INF+PADU_AUX+FT_3SN  VP+KODU_AUX+INF+MOD_vendu 
INF+PADU_AUX+RP_UM  VP+POO_AUX+INF+MOD_vendum 
INF+VVA_AUX+PT+3SN  VP+VIDU_AUX+INF+MOD_vendum 
INF+VVA_AUX+PRT+3SN  VP+KI_AUX+INF+MOD_vendum 
INF+VVA_AUX+FT_3SN  VP+POODU_AUX+INF+MOD_vendu 
INF+VVA_AUX+RP_UM  VP+THALLU_AUX+INF+MOD_vendum 
INF+VI_AUX+PT+3SN  VP+KIDA_AUX+INF+MOD_vendum 
INF+VI_AUX+PT+3SN  VP+THEER_AUX+INF+MOD_vendum 
INF+VI_AUX+PRT+3SN  VP+MUDI_AUX+INF+MOD_vendum 
INF+VI_AUX+PRT+3SN  VP+PAAR_AUX+INF+MOD_vendum 
INF+VI_AUX+FT_3SN  VP+SEY_AUX+INF+MOD_vendum 
INF+VI_AUX+RP_UM  VP+KAADDU_AUX+INF+MOD_venda 
VP+KAADDU_AUX+PT+3SM  VP+VAI_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PT+3SF  VP+THOLAI_AUX+INF+MOD_venda 
VP+KAADDU_AUX+PT+3SE  VP+KIZI_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PT+2S  VP+PAAR_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PT+1S  VP+IRU_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PT+1P  VP+KODU_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PRT+3SM  VP+POO_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PRT+3SF  VP+VIDU_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PRT+3SE  VP+KI_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PRT+2S  VP+POODU_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+PRT+1P  VP+THALLU_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+FT+3SE  VP+KIDA_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+FT+3SM  VP+THEER_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+FT+3SF  VP+MUDI_AUX+INF+MOD_venda 
VP+KAADDU_AUX+FT+2S  VP+PAAR_AUX+INF+MOD_vendam 
VP+KAADDU_AUX+FT+1P  VP+SEY_AUX+INF+MOD_vendam 
VP+VAI_AUX+PT+3SE  VP+PAAR_AUX+INF+MOD_illai 
VP+VAI_AUX+PT+3SM  VP+IRU_AUX+INF+MOD_illai 
VP+VAI_AUX+PT+3SF  VP+KODU_AUX+INF+MOD_illai 
VP+VAI_AUX+PT+2S  VP+POO_AUX+INF+MOD_illai 
VP+VAI_AUX+PT+1P  VP+VIDU_AUX+INF+MOD_illai 
VP+VAI_AUX+PT+1S  VP+KI_AUX+INF+MOD_illai 
VP+VAI_AUX+PRT+3SE  VP+POODU_AUX+INF+MOD_illai 
VP+VAI_AUX+PRT+2S  VP+THOLAI_AUX+INF+MOD_illai 
VP+VAI_AUX+PRT+3SM  VP+THALLU_AUX+INF+MOD_illai 
VP+VAI_AUX+PRT+3SF  VP+KIZI_AUX+INF+MOD_illai 
VP+VAI_AUX+PRT+1P  VP+KIDA_AUX+INF+MOD_illai 
VP+VAI_AUX+FT+3SE  VP+THEER_AUX+INF+MOD_illai 
VP+VAI_AUX+FT+3SM  VP+MUDI_AUX+INF+MOD_illai 
VP+VAI_AUX+FT+3SF  VP+VAA_AUX+INF+MOD_illai 
VP+VAI_AUX+FT+2S  VP+VAI_AUX+INF+MOD_illai 
VP+VAI_AUX+FT+1P  VP+KAADDU_AUX+INF+MOD_illai 
VP+KAADDU_AUX+INF+MOD_vend VP+KODU_AUX+INF+MOD_illai 
VP+VAI_AUX+INF+MOD_vendum  INF+VAA_AUX+INF+MOD_illai 
VP+THOLAI_AUX+INF+MOD_vendu  INF+VAI_AUX+INF+MOD_illai 
VP+KIZI_AUX+INF+MOD_vendum INF+SEY_AUX+INF+MOD_illai 

270
INF+PANNU_AUX+INF+MOD_illai INF+PADU_AUX+VP+VIDU_AUX+RP_UM
INF+PERU_AUX+INF+MOD_illai  INF+PADU_AUX+VP+VIDU_AUX+PT+3SN 
INF+PADU_AUX+INF+MOD_illai  VP+KII_AUX+PRT+1S 
INF+VENDU_AUX+VP+NOM_athu+ VP+KII_AUX+PT+1S 
VP+KII_AUX+FT+1S
INF+VENDU_AUX+VP+3SN+MOD_illai  VP+KII_AUX+PRT+1P 
FT+NOM_athu+MOD_illai  VP+KII_AUX+PT+1P 
FT+3SN+MOD_illai  VP+KII_AUX+FT+1P 
FT+NOM_athu+CM_dat+MOD_illai  VP+KII_AUX+PRT+2S
INF+UL_AUX+RP+3SM  VP+KII_AUX+PT+2S 
INF+UL_AUX+RP+3SF  VP+KII_AUX+FT+1S 
INF+UL_AUX+RP+3PE  VP+KII_AUX+PRT+3SM 
INF+UL_AUX+RP+3SN  VP+KII_AUX+PT+3SM
Ungal  VP+KII_AUX+FT+3SM 
INF+CL_um  VP+KII_AUX+PRT+3SF 
INF+PADU_AUX+VP+IRU_AUX+PT+3S  VP+KII_AUX+PT+3SF
INF+PADU_AUX+VP+IRU_AUX+PT+3 VP+KII_AUX+FT+3SF 
VP+KII_AUX+PRT+3SN 
INF+PADU_AUX+VP+IRU_AUX+PT+3 VP+KII_AUX+PT+3SN 
VP+KII_AUX+FT+3SN
INF+PADU_AUX+VP+IRU_AUX+PT+2 VP+KII_AUX+PRT+3PE 
VP+KII_AUX+PT+3PE 
INF+PADU_AUX+VP+IRU_AUX+PT+1s  VP+KII_AUX+FT+3PE 
INF+PADU_AUX+VP+IRU_AUX+PT+1S VP+IRU_AUX+PRT+1P
INF+PADU_AUX+VP+IRU_AUX+PRT+3SE  VP+IRU_AUX+PRT+2S 
INF+PADU_AUX+VP+IRU_AUX+P3SE+PL  VP+IRU_AUX+PT+2S 
INF+PADU_AUX+VP+IRU_AUX+PRT+3SF  VP+IRU_AUX+PRT+3SN 
INF+PADU_AUX+VP+IRU_AUX+PRT+3S VP+IRU_AUX+PT+3SN
VP+IRU_AUX+FT+3SN 
INF+PADU_AUX+VP+IRU_AUX+PRT+2S VP+KI_AUX+PRT+3SN 
INF+PADU_AUX+VP+IRU_AUX+PRT+1P  VP+KI_AUX+PT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+1S  VP+KI_AUX+FT+3SN
INF+PADU_AUX+VP+IRU_AUX+PRT+3S VP+IRU_AUX+PRT+3PE 
VP+IRU_AUX+PT+3PE 
INF+PADU_AUX+VP+IRU_AUX+FT+3SE VP+IRU_AUX+FT+3PE
INF+PADU_AUX+VP+IRU_AUX+FT+3SM VP+KI_AUX+PRT+3PE 
INF+PADU_AUX+VP+IRU_AUX+FT+3SF  VP+KI_AUX+PT+3PE 
INF+PADU_AUX+VP+IRU_AUX+FT+2S  VP+KI_AUX+FT+3PE 
INF+PADU_AUX+VP+IRU_AUX+FT+1P VP+IRU_AUX+FT_3SN
INF+PADU_AUX+VP+IRU_AUX+FT+1S  VP+IRU_AUX+RP_UM 
INF+PADU_AUX+VP+IRU_AUX+PT+3SN  VP+KI_AUX+FT_3SN 
INF+PADU_AUX+VP+IRU_AUX+FT_3SN  VP+KI_AUX+RP_UM 
INF+PADU_AUX+VP+IRU_AUX+RP_UM VP+KII_AUX+FT_3SN
INF+PADU_AUX+VP+IRU_AUX+INF+MOD  VP+KII_AUX+RP_UM 
INF+PADU_AUX+VP+IRU_AUX+INF+MOD 
INF+PADU_AUX+VP+UL_AUX+RP+3PE 
INF+PADU_AUX+VP+VIDU_AUX+FT_3SN
 

271
B.4 TAMIL NOUN WORD FORMS

நரம்ைப நரம்பின் லம்


நரம்பிைன நரம்பின லம்
நரம்பினத்ைத நரம்பதன் லம்
நரம்ேபா நரம் ப்பக்கம்
நரம்பிேனா நரம்பின்பக்கம்
நரம்பினத்ேதா நரம்பதின்பக்கம்
நரம்பினால் நரம்பண்ைட
நரம்பால் நரம்பினண்ைட
நரம் க்கு நரம் க்கண்ைட
நரம்பிற்கு நரம்பின ேக
நரம்பின் நரம் க்க ேக
நரம்ப நரம்பதன ேக
நரம்பின நரம்ப கில்
நரம்பின்கண் நரம்பின கில்
நரம்ப கண் நரம்பதன கில்
நரம் க்காக நரம் க்க கில்
நரம்பாலான நரம் கிட்ட
நரம் ைடய நரம்பின்கிட்ட
நரம்பி ைடய நரம் ேமல்
நரம்பில் நரம்பின்ேமல்
நரம்பினில் நரம் க்குேமல்
நரம் டன் நரம்பின ேமல்
நரம்பி டன் நரம் ேமேல
நரம் வைரக்கும் நரம்பின்ேமேல
நரம் வைரயில் நரம் க்குேமேல
நரம்பின்வைர நரம்ப ேமேல
நரம்பில்லாமல் நரம்பின ேமேல
நரம்ேபா ல்லாமல் நரம்பின்கீழ்
நரம் க்கில்லாமல் நரம் க்குங்கீழ்
நரம் டனில்லாமல் நரம்பதின்கீழ்
நரம்பாட்டம் நரம்பின்கீேழ
நரம் க்காட்டம் நரம் க்குங்கீேழ
நரம் தல் நரம்பதின்கீேழ
நரம் வழியாக நரம்ைபப்பற்றி
நரம்பின்வழியாக நரம்பிைனப்பற்றி
நரம்பின வழியாக நரம்பினத்ைதப்பற்றி
நரம்பிடம் நரம்ைபக்குறித்

272
நரம்பிைனக்குறித் நரம்பிைனப்ேபால்
நரம்பினத்ைதக்குறித் நரம்பினத்ைதப்ேபால்
நரம்ைபப்பார்த் நரம் மாதிாி
நரம்பிைனப்பார்த் நரம்ைபமாதிாி
நரம்பினத்ைதப்பார்த் நரம்ைபவிட
நரம்ைபேநாக்கி நரம்பிைனவிட
நரம்பினத்ைதேநாக்கி நரம்பினத்ைதவிட
நரம்பிைனேநாக்கி நரம் க்குப்பதிலாக
நரம்ைபச்சுற்றி நரம் க்காக
நரம்பினத்ைதச்சுற்றி நரம்பி க்காக
நரம்பிைனச்சுற்றி நரம் க்குப்பிறகு
நரம்ைபத்தாண் நரம் க்கப் றம்
நரம்பிைனத்தாண் நரம் க்கப்பால்
நரம்பினத்ைதத்தாண் நரம் க்குேமல்
நரம்ைபத்தவிர்த் நரம் க்குேமேல
நரம்பிைனத்தவிர்த் நரம் க்குங்கீழ்
நரம்பினத்ைதத்தவிர்த் நரம் க்குங்கீேழ
நரம்ைபத்தவிர நரம் க்குள்
நரம்பிைனத்தவிர நரம் க்குள்ேள
நரம்பினத்ைதத்தவிர நரம் க்குெவளியில்
நரம்ெபாழிய நரம் க்குெவளிேய
நரம்ைபெயாழிய நரம் க்க யில்
நரம்பினத்ைதெயாழிய நரம் க்க கில்
நரம்ைபெயாட் நரம் ன்
நரம்பிைனெயாட் நரம் க்கு ன்
நரம்பினத்ைதெயாட் நரம் ன்னால்
நரம்ைபக்ெகாண் நரம் க்கு ன்னால்
நரம்பிைனக்ெகாண் நரம் பின்னால்
நரம்பினத்ைதக்ெகாண் நரம் க்குப்பின்னால்
நரம்ைபைவத் நரம் க்குப்பின்
நரம்பினத்ைதைவத் நரம் க்குப்பிந்தி
நரம்பிைனைவத் நரம் க்குகு க்ேக
நரம்ைபவிட் நரம் க்குள்
நரம்பிைனவிட் நரம்பி ள்
நரம்பினத்ைதவிட் நரம் க்குள்ேள
நரம்ைபப்ேபால நரம்பி க்குள்ேள
நரம்பிைனப்ேபால நரம்ெபதிேர
நரம்பினத்ைதப்ேபால நரம் க்ெகதிேர
நரம்ைபப்ேபால் நரம்ெபதிர்க்கு

273
நரம்ெபதிாில் நரம் கள்வைரக்கும்
நரம் க்ெகதிாில் நரம் கள்வைரயில்
நரம் க்ெகதிர்த்தாற்ேபால் நரம் களில்லாமல்
நரம் க்கிைடயில் நரம் களல்லாமல்
நரம் க்குந வில் நரம் களாட்டம்
நரம் க்க த்தாற்ேபால் நரம் கள் தல்
நரம்பி ந் நரம் களின்ப
நரம் டன் நரம் களின்வழியாக
நரம்பிடமி ந் நரம் களிடம்
நரம் லமி ந் நரம் களின் லம்
நரம் ப்பக்கமி ந் நரம் களின்பக்கம்
நரம்பண்ைடயி ந் நரம் களினண்ைட
நரம்ப ேகயி ந் நரம் கள ேக
நரம் க்க ேகயி ந் நரம் களின ேக
நரம்ப கி ந் நரம் களின கில்
நரம் க்க கி ந் நரம் கள்கிட்ட
நரம் கிட்டயி ந் நரம் களின்ேமல்
நரம் ேம ந் நரம் களின்ேமேல
நரம் க்குேம ந் நரம் களின்கீழ்
நரம் ேமேலயி ந் நரம் களின்கீேழ
நரம் க்குேமேலயி ந் நரம் கைளப்பற்றி
நரம் க்குங்கீழி ந் நரம் கைளக்குறித்
நரம்பின்கீழி ந் நரம் கைளப்பார்த்
நரம்பதின்கீழி ந் நரம் கைளேநாக்கி
நரம்பின்கீேழயி ந் நரம் கைளச்சுற்றி
நரம்பதன்கீேழயி ந் நரம் கைளத்தாண்
நரம் கள் நரம் கைளத்தவிர்த்
நரம் கைள நரம் கைளத்தவிர
நரம் களிைன நரம் கைளெயாழிய
நரம் களின் நரம் கைளெயாட்
நரம் க க்காக நரம் கைளக்ெகாண்
நரம் க க்கான நரம் கைளைவத்
நரம் க க்கு நரம் கைளவிட்
நரம் களில் நரம் கைளப்ேபால்
நரம் க டன் நரம் கைளப்ேபால
நரம் கேளா நரம் கைளமாதிாி
நரம் கள நரம் கைளவிட
நரம் களால் நரம் க க்குப்பதிலாக
நரம் க டன் நரம் க க்காக

274
B.5 TAMIL VERB WORD FORMS

ப ப க்கின்ற
ப த்தான் ப க்காத
ப த்தாள் ப த்தவன்
ப த்தார் ப த்தவள்
ப த்தார்கள் ப த்தவர்
ப த்த ப த்த
ப த்தன ப க்கின்றவன்
ப த்தாய் ப க்கின்றவள்
ப த்தீர் ப க்கின்றவர்
ப த்தீர்கள் ப க்கின்ற
ப த்ேதன் ப ப்பவன்
ப த்ேதாம் ப ப்பவள்
ப க்கிறான் ப ப்பவர்
ப க்கிறாள் ப ப்ப
ப க்கிறார் ப க்காதவன்
ப க்கிறார்கள் ப க்காதவள்
ப க்கின்ற ப க்காதவர்
ப க்கின்றன ப க்காத
ப க்கின்றாய் ப த்தைவ
ப க்கின்றீர் ப க்காதன
ப க்கின்றீர்கள் ப த்
ப க்கின்ேறன் ப க்க
ப க்கின்ேறாம் ப க்கா
ப ப்பான் ப க்காமல்
ப ப்பாள் ப த்தல்
ப ப்பார் ப க்காைம
ப ப்பார்கள் ப க்கலாம்
ப ப்ப ப க்கலாகும்
ப ப்பன ப க்கலாகா
ப ப்பாய் ப த் ப்பார்த்தான்
ப ப்பீர் ப த் ப்பார்த்தாள்
ப ப்பீர்கள் ப த் ப்பார்த்தார்
ப ப்ேபன் ப த் ப்பார்த்தாய்
ப ப்ேபாம் ப த் ப்பார்த்ேதாம்
ப க்கும் ப த் ப்பார்த்ேதன்
ப த்த ப த் ப்பார்க்கின்றார்

275
ப த் ப்பார்க்கின்றார்கள் ப த் க்ெகா த்ேதாம்
ப த் ப்பார்க்கின்றாள் ப த் க்ெகா க்கிறான்
ப த் ப்பார்க்கின்றான் ப த் க்ெகா க்கிறாள்
ப த் ப்பார்க்கின்ேறாம் ப த் க்ெகா க்கிறார்
ப த் ப்பார்க்கின்றாய் ப த் க்ெகா க்கிறார்கள்
ப த் ப்பார்க்கின்ேறன் ப த் க்ெகா க்கின்ற
ப த் ப்பார்ப்பார் ப த் க்ெகா க்கின்றன
ப த் ப்பார்ப்பாள் ப த் க்ெகா க்கின்றாய்
ப த் ப்பார்ப்பான் ப த் க்ெகா க்கின்றீர்
ப த் ப்பார்ப்பாய் ப த் க்ெகா க்கின்றீர்கள்
ப த் ப்பார்ப்ேபன் ப த் க்ெகா க்கின்ேறன்
ப த் ப்பார்ப்ேபாம் ப த் க்ெகா க்கின்ேறாம்
ப த்தி ந்தார் ப த் க்ெகா ப்பான்
ப த்தி ந்தாள் ப த் க்ெகா ப்பாள்
ப த்தி ந்தான் ப த் க்ெகா ப்பார்
ப த்தி ந்தா ப த் க்ெகா ப்பார்கள்
ப த்தி ந்ேதன் ப த் க்ெகா ப்ப
ப த்தி ந்ேதாம் ப த் க்ெகா ப்பன
ப த்தி க்கின்றார் ப த் க்ெகா ப்பாய்
ப த்தி க்கின்றாள் ப த் க்ெகா ப்பீர்
ப த்தி க்கின்றான் ப த் க்ெகா ப்பீர்கள்
ப த்தி க்கின்ேறன் ப த் க்ெகா ப்ேபன்
ப த்தி க்கின்ேறாம் ப த் க்ெகா ப்ேபாம்
ப த்தி ப்பார் ப த் ப்ேபானான்
ப த்தி ப்பாள் ப த் ப்ேபானாள்
ப த்தி ப்பான் ப த் ப்ேபானார்
ப த்தி ப்பாய் ப த் ப்ேபானார்கள்
ப த்தி ப்ேபன் ப த் ப்ேபான
ப த்தி ப்ேபாம் ப த் ப்ேபாயின
ப த் க்ெகா த்தான் ப த் ப்ேபானாய்
ப த் க்ெகா த்தாள் ப த் ப்ேபானீர்
ப த் க்ெகா த்தார் ப த் ப்ேபானீர்கள்
ப த் க்ெகா த்தார் ப த் ப்ேபாேனன்
ப த் க்ெகா த்த ப த் ப்ேபாேனாம்
ப த் க்ெகா த்தன ப த் ப்ேபாகிறான்
ப த் க்ெகா த்தாய் ப த் ப்ேபாகிறாள்
ப த் க்ெகா த்தீர் ப த் ப்ேபாகிறார்
ப த் க்ெகா த்தீர்கள் ப த் ப்ேபாகிறார்கள்
ப த் க்ெகா த்ேதன் ப த் ப்ேபாகின்ற

276
ப த் ப்ேபாகின்றன ப த் க்ெகாண் ந்ேதாம்
ப த் ப்ேபாகின்றாய் ப த் க்ெகாண் ந்ேதன்
ப த் ப்ேபாகின்றீர் ப த் க்ெகாண் க்கிறார்
ப த் ப்ேபாகின்றீர்கள் ப த் க்ெகாண் க்கிறாள்
ப த் ப்ேபாகின்ேறன் ப த் க்ெகாண் க்கிறான்
ப த் ப்ேபாகின்ேறாம் ப த் க்ெகாண் க்கிறாய்
ப த் ப்ேபாவான் ப த் க்ெகாண் க்கிேறாம்
ப த் ப்ேபாவாள் ப த் க்ெகாண் க்கிேறன்
ப த் ப்ேபாவார் ப த் க்ெகாண் ப்பார்
ப த் ப்ேபாவார்கள் ப த் க்ெகாண் ப்பான்
ப த் ப்ேபாவ ப த் க்ெகாண் ப்பாள்
ப த் ப்ேபாவன ப த் க்ெகாண் ப்பாய்
ப த் ப்ேபாவாய் ப த் க்ெகாண் ப்ேபாம்
ப த் ப்ேபா ர் ப த் க்ெகாண் ப்ேபன்
ப த் ப்ேபா ர்கள் ப த்தாயிற்
ப த் ப்ேபாேவன் ப த் ப்ேபாட்டார்
ப த் ப்ேபாேவாம் ப த் ப்ேபாட்டான்
ப த் விட்டார் ப த் ப்ேபாட்டாள்
ப த் விட்டாள் ப த் ப்ேபாட்டாய்
ப த் விட்டான் ப த் ப்ேபாட்ேடன்
ப த் விட்டாய் ப த் ப்ேபாட்ேடாம்
ப த் விட்ேடாம் ப த் ப்ேபா கிறார்
ப த் விட்ேடன் ப த் ப்ேபா கிறான்
ப த் வி கின்றார் ப த் ப்ேபா கிறாள்
ப த் வி கின்றான் ப த் ப்ேபா கிறாய்
ப த் வி கின்றாள் ப த் ப்ேபா கிேறன்
ப த் வி கின்றாய் ப த் ப்ேபா கிேறாம்
ப த் வி கின்ேறாம் ப த் ப்ேபா வார்
ப த் வி கிேறன் ப த் ப்ேபா வான்
ப த் வி வார் ப த் ப்ேபா வாள்
ப த் வி வாள் ப த் ப்ேபா வாய்
ப த் வி வான் ப த் ப்ேபா ேவாம்
ப த் வி வாய் ப த் ப்ேபா ேவன்
ப த் வி ேவன் ப த் த்ெதாைலத்தார்
ப த் வி ேவாம் ப த் த்ெதாைலத்தான்
ப த் க்ெகாண் ந்தார் ப த் த்ெதாைலத்தாள்
ப த் க்ெகாண் ந்தான் ப த் த்ெதாைலத்தாய்
ப த் க்ெகாண் ந்தாள் ப த் த்ெதாைலேதாம்
ப த் க்ெகாண் ந்தாய் ப த் த்ெதாைலேதான்

277
ப த் த்ெதாைலகிறார் ப த் க்கிழிக்கிறாய்
ப த் த்ெதாைலகிறாள் ப த் க்கிழிக்கிேறன்
ப த் த்ெதாைலகிறான் ப த் க்கிழிக்கிேறாம்
ப த் த்ெதாைலகிறாய் ப த் க்கிழிப்பார்
ப த் த்ெதாைலகிேறாம் ப த் க்கிழிப்பான்
ப த் த்ெதாைலகிேறன் ப த் க்கிழிப்பாள்
ப த் த்ெதாைலப்பார் ப த் க்கிழிப்பாய்
ப த் த்ெதாைலப்பான் ப த் க்கிழிப்ேபன்
ப த் த்ெதாைலப்பாள் ப த் க்கிழிப்ேபாம்
ப த் த்ெதாைலப்பாய் ப த் க்கிடந்தார்
ப த் த்ெதாைலப்ேபன் ப த் க்கிடந்தான்
ப த் த்ெதாைலப்ேபாம் ப த் க்கிடந்தாள்
ப த் த்தள்ளினார் ப த் க்கிடந்தாய்
ப த் த்தள்ளினாள் ப த் க்கிடந்ேதன்
ப த் த்தள்ளினான் ப த் க்கிடந்ேதாம்
ப த் த்தள்ளினாய் ப த் க்கிடக்கிறார்
ப த் த்தள்ளிேனாம் ப த் க்கிடக்கிறான்
ப த் த்தள்ளிேனன் ப த் க்கிடக்கிறாள்
ப த் த்தள் வார் ப த் க்கிடக்கிறாய்
ப த் த்தள் வான் ப த் க்கிடக்கிேறன்
ப த் த்தள் வாள் ப த் க்கிடக்கிேறாம்
ப த் த்தள் வாய் ப த் க்கிடப்பார்
ப த் த்தள் ேவன் ப த் க்கிடப்பான்
ப த் த்தள் ேவாம் ப த் க்கிடப்பாள்
ப த் க்கிழித்தார் ப த் க்கிடப்பாய்
ப த் க்கிழித்தான் ப த் க்கிடப்ேபன்
ப த் க்கிழித்தாள் ப த் க்கிடப்ேபாம்
ப த் க்கிழித்தாய் ப த் த்தீர்த்தார்
ப த் க்கிழித்ேதன் ப த் த்தீர்த்தான்
ப த் க்கிழித்ேதாம் ப த் த்தீர்த்தாள்
ப த் க்கிழிப்பார் ப த் த்தீர்த்தாய்
ப த் க்கிழிப்பான் ப த் த்தீர்த்ேதாம்
ப த் க்கிழிக்கிறாள் ப த் த்தீர்த்ேதன்
ப த் க்கிழிப்பாய் ப த் த்தீர்க்கிறார்
ப த் க்கிழிப்ேபன் ப த் த்தீர்க்கிறான்
ப த் க்கிழிப்ேபாம் ப த் த்தீர்க்கிறாள்
ப த் க்கிழிக்கிறார் ப த் த்தீர்க்கிறாய்
ப த் க்கிழிக்கிறான் ப த் த்தீர்க்கிேறாம்
ப த் க்கிழிக்கிறாள் ப த் த்தீர்க்கிேறன்

278
ப த் த்தீர்ப்பார் ப க்க ந்த
ப த் த்தீர்ப்பான் ப க்க கிற
ப த் த்தீர்ப்பாள் ப க்க ம்
ப த் த்தீர்ப்பாய் ப க்கயி ந்தான்
ப த் த்தீர்ப்ேபாம் ப க்கயி ந்தாள்
ப த் த்தீர்ப்ேபன்
ப த் த்தார்
ப த் த்தான்
ப த் த்தாள்
ப த் த்தாய்
ப த் த்ேதாம்
ப த் த்ேதன்
ப த் க்கிறார்
ப த் க்கிறான்
ப த் க்கிறாள்
ப த் க்கிறாய்
ப த் க்கிேறாம்
ப த் க்கிேறன்
ப த் ப்பார்
ப த் ப்பான்
ப த் ப்பாள்
ப த் ப்பாய்
ப த் ப்ேபன்
ப த் ப்ேபாம்
ப க்கட் ம்
ப க்கேவண் ம்
ப க்கேவண்டாம்
ப க்கக்கூ ம்
ப க்கக்கூடா
ப க்கமாட்டான்
ப க்கமாட்டாள்
ப க்கமாட்டார்
ப க்கமாட்ேடன்
ப க்கமாட்டாய்
ப க்கமாட்ேடாம்
ப க்கவில்ைல
ப க்கயிய ம்
ப க்கயியன்ற
ப க்கவிய கிற

279
B.6 MOSES INSTALLATION AND TRAINING

This subsection explains the installation of Moses tool kit and the issues which are
occurred while training the system. Remedies for the issues are also given in detail.

The required packages are in the packages folder.


mkdir smt/moses/tools

copy & paste gizapp into tools and extract Then,

cd smt/moses/tools/gizapp
make

if error, then
1) yum install gcc
2) yum install gcc-c++
3) yum install glibc-static (for can't find -lm error)
4) yum install libstdc++-static (for can't find -lstdc++
error)

cd ../
mkdir bin

cp giza-pp/GIZA++-v2/GIZA++ bin/
cp giza-pp/mkcls-v2/mkcls bin/
cp giza-pp/GIZA++-v2/snt2cooc.out bin/

copy & paste srilm into tools and extract

cd srilm

change the path SRILM to srilm dir in srilm/MakeFile

in srilm/commom, open makefile.machine.i686


-> under TCL_support
--> comment other two options
--> add NO_TCL = X

yum install automake


yum install zlib-devel
yum install boost-devel

install c shell package i.e., tcsh package

make World
make all

280
export
PATH=/home/anand/smt/moses/tools/srilm/bin/i686:/home/js
chroe1/demo/tools/srilm/bin:$PATH

cd .. cp&p moses and extract it

yum install libtool

./regenerate-makefiles.sh

./configure --with-
srilm=/home/anand/smt/moseso/tools/srilm --with-
irstlm=/home/anand/smt/moses/tools/irstlm
make -j 2
To confirm setup:

cd /home/anand/smt/moses

mkdir data
cp sample-models

cd sample-models/phrase-model/
../../../tools/moses/moses-cmd/src/moses -f moses.ini <
in > out

two sentences -> o/p: this is a small house .

compile moses support scripts-

cd ../../../tools/

mkdir moses-scripts
cd moses/scripts

Then edit makefile


13,14c13,14

< TARGETDIR?=/home/s0565741/terabyte/bin
< BINDIR?=/home/s0565741/terabyte/bin
---
> TARGETDIR?=/home/anand/smt/moses/tools/moses-scripts
> BINDIR?=/home/anand/smt/moses/tools/bin

in moses/scripts/makefile, line 79 ./ -> perl check-


dependencies.pl
then, make release

export
SCRIPTS_ROOTDIR=/home/anand/smt/moses/tools/moses-
scripts/scripts-YYYYMMDD-HHMM

281
Additional Scripts
cd ../../
extract scripts
also cp mteval.v11b.pl

For Training errors

Use of uninitialized value $a in scalar chomp at


tools/moses-scripts/scripts-20110204-
0333/training/train-model.perl line 1079.

Use of uninitialized value $a in split at tools/moses-


scripts/scripts-20110204-0333/training/train-model.perl
line 1082.

type perl in line 1019 before $GIZA2BAL in train-


model.perl

For Tuning errors

if error -> sh: /home/anand/smt-moses/tools/moses-


scripts/scripts-20101111-0136//training/cmert-0.5/score-
nbest.py: /usr/bin/python^M: bad interpreter: No such
file or directory

in /home/anand/smt/moses/tools/moses-scripts/scripts-
20110204-0333/training/mert-moses.pl, type python before
$cmertdir i.e., in this line $SCORENBESTCMD =
"$cmertdir/score-nbest.py" if ! defined $SCORENBESTCMD;

Creating language model

ngram-count -order 4 -interpolate -kndiscount -text SMT-


project/smt-cmd/lm/monolingual -lm SMT-project/smt-
cmd/lm/monolingual.lm

./ngram-count -order 5 -text 300.morph.txt -lm


300.morph.lm

Training

SMT-project/moses/bin/moses-scripts/scripts-20090302-
0358/training/train-factored-phrase-model.perl -scripts-
root-dir SMT-project/moses/bin/moses-scripts/scripts-
20090302-0358/ -root-dir . -corpus SMT-project/smt-
cmd/corpus/corpus -f en -e ma -alignment grow-diag-final
-reordering msd-bidirectional-fe -lm 0:4:SMT-
project/smt-cmd/lm/monolingual.lm:0

282
Testing

SMT-project/moses/moses-cmd/src/moses -config SMT-


project/smt-cmd/model/moses.ini -input-file SMT-
project/smt-cmd/testing/input > SMT-project/smt-
cmd/testing/output

Training factored model with alignment and


translation factors

bin/moses-scripts/scripts-20100311-1743/training/train-
factored-phrase-model.perl -scripts-root-dir bin/moses-
scripts/scripts-20100311-1743/ -root-dir running_files -
corpus pc/twofactfour -f eng -e tam -lm
0:4:/root/Desktop/D5/srilm/bin/i686/300.lm:0 -lm
2:5:/root/Desktop/D5/srilm/bin/i686/300.pos.lm:0 -lm
3:5:/root/Desktop/D5/srilm/bin/i686/300.morph.lm:0 --
alignment-factors 0,1,2,3-0,1,2,3 --translation-factors
0-0+1-1+2-2+3-3 --reordering-factors 0-0+1-1+2-2+3-3 --
generation-factors 3-2+3,2-1+1,2,3-0 --decoding-steps
t3,t2,t1,g0,g1,g2

Moses Config file

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0
1
2
3

# mapping steps
[mapping]
0 T 0
0 T 1

# translation tables: source-factors, target-factors,


number of scores, file
[ttable-file]
1,2 1,2 5
/media/DISK5/tools/trunk/scripts/running_files/model/phr
ase-table.1,2-1,2.gz
2,3 3 5
/media/DISK5/tools/trunk/scripts/running_files/model/phr
ase-table.2,3-3.gz

283
# no generation models, no generation-file section

# language models: type(srilm/irstlm), factors, order,


file
[lmodel-file]
0 1 4 /root/Desktop/D5/srilm/bin/i686/300.lem.lm
0 2 4 /root/Desktop/D5/srilm/bin/i686/300.pos.lm
0 3 4 /root/Desktop/D5/srilm/bin/i686/300.morph_.lm

# limit on how many phrase translations e for each


phrase f are loaded
# 0 = all elements loaded
[ttable-limit]
20
0

# distortion (reordering) weight


[weight-d]
0.6

# language model weights


[weight-l]
0.1667
0.1667
0.1667

# translation model weights


[weight-t]
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2

# no generation models, no weight-generation section


# word penalty
[weight-w] -1

[distortion-limit] 6

Testing factored model

./moses-cmd/src/moses -report-all-factors -config


./scripts/running_files/model/moses.ini -input-file
./scripts/test/test.eng > ./scripts/test/test.tam

284
B.7 COMPARISION WITH GOOGLE OUTPUT
 

English Sentences Google Output


I went to his home. நான் அவர ட் ற்கு ெசன்றார்.
அவள் நண்பர்க டன் விைளயா
She is playing with her friends.
ெகாண் க்கிறார்.
He is a doctor. அவர் ஒ டாக்டர்.
அவர்கள் என் பள்ளியில் பயின்
They are studying in my school.
வ கிறார்கள்.
The book is on the table. த்தகம் அட்டவைண உள்ள .
They will sing a song. அவர்கள் ஒ பாட்ைட பாட ேபாகிறார்.
The rat was killed by the cat . எ ைன ெகால்லப்பட்டார்.
She is studying with me . அவள் என் டன் ப த் ெகாண் க்கிறார்.
She did not come with me. அவள் என்ைன வரவில்ைல.
I deposited money to the bank. நான் வங்கிக்கு பணம் ெடபாசிட்.
We will study today. இன் நாம் ப ப்ேபாம்.

English Sentences F-SMT Output


I went to his home. நான் அவன ட் ற்கு ேபாேனன்.
அவள் நண்பர்க டன் விைளயா
She is playing with her friends.
ெகாண் க்கிறாள்.
He is a doctor. அவர் ஒ டாக்டர்.
அவர்கள் என் பள்ளியில் ப த்
They are studying in my school.
வ கிறார்கள்.
The book is on the table. த்தகம் ேமைசயின் ேமல் உள்ள .

They will sing a song. அவர்கள் ஒ பாட்ைட பாட ேபாகிறார்கள்.

The rat was killed by the cat . எ ைனயால் ெகால்லப்பட்ட .


அவள் என் டன் ப த்
She is studying with me .
ெகாண் க்கிறாள்.

She did not come with me. அவள் என் டன் வரவில்ைல.

I deposited money to the bank. நான் வங்கிக்கு பணம் ெசன்ேறன்.


We will study today. இன் நாம் ப ப்ேபாம்.

285
B.8 GRAPHICAL USER INTERFACES
 

Figure B.1 Tamil POS-Tagger GUI

286
 

Figure B.2 Tamil Morphological Analyzer GUI

287
 

 
 

Figure B.3 Tamil Morphological Generator GUI

   

288
 

Figure B.4 English to Tamil


T Mach
hine Translation System
m GUI

289
REFERENCES

[1] Lee, L. (2004). ‘‘I’m sorry Dave, I’m afraid I can’t do that’’: Linguistics,
statistics, and natural language processing circa 2001. In On the Fundamentals
of Computer Science: Challenges C, Opportunities CS, Telecommunications
Board NRC (Eds.), Computer science: Reflections on the field, reflections from
the field (pp. 111–118). Washington, DC: The National Academies Press.

[2] Jurafsky Daniel and Martin James H (2005), “An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition”,
Prentice Hall, ISBN: 0130950696, contributing writers: Andrew Kehler, Keith
Vander Linden, and Nigel Ward.

[3] Hutchins John 2001, Machine translation and human translation: in competition
or in complementation?, International Journal of Translation, 13, 1-2, p. 5-20

[4] Allen James (1995), “Natural Language Processing”, 1-2.Redwood Benjamin/


Cummings.

[5] http://www.pangea.com.mt/en/q2-why-statistical-mt/

[6] http://cordis.europa.eu/fp7/ict/language-technologies/project-
euromatrixplus_en.html

[7] Ma, Xiaoyi. 1999. Parallel text collections at the Linguistic Data Consortium. In
Machine Translation Summit VII, Singapore.

[8] Durgesh Rao. 2001. Machine Translation in India:A Brief Survey. In


“Proceedings of SCALLA 2001 Conference”, Banglaore, India

[9] http://en.wikipedia.org/wiki/Tamil_language

[10] Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proc.
EMNLP+CoNLL, pages 868–876, Prague

[11] http://en.wikipedia.org/wiki/Subject%E2%80%93verb%E2%80%93object

290
 
[12] Jes´us Gim´enez and Llu´ıs M`arquez,2006 ,SVMTool:Technical manual v1.3,
August 2006.

[13] S.Rajendran, Arulmozi, S., Ramesh Kumar, Viswanathan, S. 2001.


“Computational morphology of verbal complex “.Language in india Volume 3 :
4 April 2003.

[14] Bahl L and Mercer R. L (1976), “Part-Of-Speech assignment by a statistical


decision algorithm”, IEEE International Symposium on Information Theory.

[15] Church, K. W. (1988). A stochastic parts program and noun phrase parser for
unrestricted text. In Proceedings of the Second Conference on Applied Natural
Language Processing, pages 136–143.

[16] Cutting D, Kupiec J, Pederson J and Nipun P (1992), “A Practical Part-of-


speech Tagger”, Proceedings of the 3rd Conference of Applied Natural
Language Processing, ANLP, 1992, pp. 133-140.

[17] DeRose, S. (1988). 'Grammatical category disambiguation by statistical


optimization. Computational Linguistics 14, 31-39

[18] Schmid H (1994), “Probabilistic Part-Of-Speech Tagging using Decision


Trees”, Proceedings of the International Conference on new methods in
language processing, Manchester, UK.

[19] Brill E (1992), “A simple rule based part of speech tagger”, Proceedings of the
Third Conference on Applied 5atural Language Processing, ACL, Trento, Italy.

[20] Brill E (1993), “Automatic grammar induction and parsing free text: A
transformation based approach”, Proceedings of 31st Meeting of the
Association of Computational Linguistics, Columbus.

[21] Brill E (1993), “Transformation based error driven parsing”, Proceedings of the
Third International Workshop on Parsing Technologies, Tilburg, The
Netherlands.

291
 
[22] Brill E (1994), “Some advances in rule based part of speech tagging”,
Proceedings of The Twelfth 5ational Conference on Artificial Intelligence
(AAAI- 94), Seattle, Washington.

[23] Prins R, and Van Noord G (2001), “Unsupervised Pos- Tagging Improves
Parsing Accuracy And Parsing Efficiency”, Proceedings of the International
Workshop on Parsing Technologies.

[24] Pop M (1996), “Unsupervised Part-of-speech Tagging”, Department of


Computer Science, Johns Hopkins University.

[25] Brill E (1997), “Unsupervised Learning of Disambiguation Rules for Part of


Speech Tagging”, In Proceeding of The Natural Language Processing Using
Very Large Corpora, Boston.

[26] Kimmo Koskenniemi (1985), Compilation of automata from morphological


two-level rules. In F. Karlsson (ed.), Papers from the fifth Scandinavian
Conference of Computational Linguistics, Helsinki, pp. 143-149.

[27] Greene, B. B. & Rubin, G. M. 1971. Automatic grammatical tagging of English.


Technical Report, Brown University. Providence, RI.

[28] Francis, W.N., & Kucera, H. (1982). Frequency analysis of English usage:
Lexicon and grammar. Boston: Houghton Mifflin.

[29] Derouault A M and Merialdo B (1986), “Natural language modeling for


phoneme-to-text transcription”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, PAMI-8, pp. 742-749.

[30] Jelinek, F. 1985. Self-organized language modeling for speech recognition.


Technical report, IBM T.J. Watson Research Center, Continuous Speech
Recognition Group, Yorktown Heights, NY.

[31] Kupiec J M (1992), “Robust part-of-speech tagging using a Hidden Markov


Model”, Computer Speech and Language, pp.113-118.

292
 
[32] Yahya O and Mohamed Elhadj (2004), “Statistical Part-of-Speech Tagger for
Traditional Arabic Texts”, Journal of Computer Science 5 (11): 794-800, ISSN
1549-3636.

[33] Weischedel R, Schwartz R, Pahnueci J, Meteer M and Ramshaw L (1993),


“Coping with Ambiguity and Unknown words through Probabilistic Models”.
Computational Linguistics, pp. 19(2):260-269.

[34] Ratnaparkhi Adwait (1996), “A Maximum Entropy Model for Part-Of-Speech


Tagging”, Proceedings of the Empirical Methods in Natural Language
Processing Conference (EMNLP-1996), University of Pennsylvania.

[35] Brill E (1995), “Transformation-based Error-driven Learning and Natural


Language Processing: A Case Study in Part-of-speech Tagging”. Computational
Linguistics, 1995, pp 21 (4):543-565.

[36] Ray Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Adaptive Language
Modeling Using The Maximum Entropy Prin- ciple. In Proceedings of the
Human Language Technology Workshop, pages 108-113. ARPA.

[37] Quinlan J R (1986), “Induction of Decision Trees”, Machine Learning, 1:81-


106.

[38] Jes´us Gim´enez and Llu´ıs M`arquez. (2004), “SVMTool: A general POS
tagger generator based on support vector machines”, Proceedings of the 4th
LREC Conference.

[39] Brants, T. (2000b). TnT – A Statistical Part-of-Speech Tagger. In Proceedings


of the Sixth Conference on Applied Natural Language Processing ANLP-2000.
Seattle, WA

[40] Smriti Singh, Kuhoo Gupta, Manish Shrivastava and Pushpak Bhattacharyya
(2006), “Morphological richness offsets resource demand – experiences in
constructing a pos tagger for Hindi”, Proceedings of the COLING/ACL 2006,
Sydney, Australia Main Conference Poster Sessions, pp. 779–786.

293
 
[41] Manish Shrivastava and Pushpak Bhattacharyya, Hindi POS Tagger Using
Naive Stemming: Harnessing Morphological Information Without Extensive
Linguistic Knowledge, International Conference on NLP (ICON08), Pune,
India, December, 2008.

[42] Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi
Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”,
Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech
and Chunking for Indian Languages.

[43] Karthik Kumar G, Sudheer K and Avinesh P V S (2006), “Comparative study of


various Machine Learning methods For Telugu Part of Speech tagging”,
Proceedings of NLPAI Machine Learning Workshop on Part Of Speech and
Chunking for Indian Languages.

[44] Nidhi Mishra Amit Mishra (2011), “Part of Speech Tagging for Hindi Corpus”,
International Conference on Communication Systems and Network
Technologies.

[45] Pradipta Ranjan Ray, Harish V., Sudeshna Sarkar and Anupam Basu, (2003)
“Part of Speech Tagging and Local Word Grouping Techniques for Natural
Language Parsing in Hindi” , Indian Institute of Technology, Kharagpur,
INDIA 721302. www.mla.iitkgp.ernet.in/papers/hindipostagging.pdf.

[46] Sivaji Bandyopadhyay, Asif Ekbal and Debasish Halder (2006), “HMM based
POS Tagger and Rule-based Chunker for Bengali”, Proceedings of NLPAI
Machine Learning Workshop on Part Of Speech and Chunking for Indian
Languages.

[47] RamaSree, R.J and Kusuma Kumari, P (2007), “Combining Pos Taggers For
Improved Accuracy To Create Telugu Annotated Texts For Information
Retrieval”, Available at http://www.ulib.org/conference/2007/RamaSree.pdf.

[48] Sandipan Dandapat (2007), “Part Of Speech Tagging and Chunking with
Maximum Entropy Model”, Proceedings of IJCAI Workshop on Shallow
Parsing for South Asian Languages.

294
 
[49] Antony P.J and K.P. Soman. 2010. Kernel based part of speech tagger for
kannada. In Machine Learning and Cybernetics (ICMLC), 2010 International
Conference on, volume 4, pages 2139 –2144, july.

[50] Manju K, Soumya S, and Sumam Mary Idicula (2009), “Development of a POS
Tagger for Malayalam - An Experience”, International Conference on Advances
in Recent Technologies in Communication and Computing, pp.709-713.

[51] Antony P J, Santhanu P Mohan and Soman K P (2010), “SVM Based Parts
Speech Tagger for Malayalam”, International Conference on-Recent Trends in
Information,Telecommunication and Computing (ITC 2010).

[52] Arulmozhi P, Sobha L, Kumara Shanmugam. B (2004), “Parts of Speech


Tagger for Tamil”, Proceedings of the Symposium on Indian Morphology,
Phonology & Language Engineering, Indian Institute of Technology,
Kharagpur.

[53] Arulmozhi P and Sobha L (2006), “A Hybrid POS Tagger for a Relatively Free
Word Order Language”, Proceedings of MSPIL-2006, Indian Institute of
Technology, Bombay.

[54] Lakshmana Pandian S and Geetha T V (2008), “Morpheme based Language


Model for Parts-of-Speech Tagging”, POLIBITS – Research Journal on
Computer Science and Computer Engineering with applications, Volume 38,
Mexico. pp. 19-25.

[55] Vasu Renganathan,(2001),“Development of Part-of-Speech Tagger for Tamil”,


Tamil Internet 2001 Conference, Kuala Lumpur, August 26-28, 2001

[56] Ganesan M (2007), “Morph and POS Tagger for Tamil” (Software), Annamalai
University, Annamalai Nagar.

[57] Lakshmana Pandian S and Geetha T V (2009), “CRF Models for Tamil Part of
Speech Tagging and Chunking “, Proceedings of the 22nd ICCPOL.

[58] M. Selvam, A.M. Natarajan (2009), “Improvement of Rule Based


Morphological Analysis and POS Tagging in Tamil Language via Projection

295
 
and Induction Techniques”, International Journal of Computers, Issue 4,
Volume 3, 2009.

[59] Canasai Kruengkrai, Virach Sornlertlamvanich and Hitoshi Isahara (2006), “A


Conditional Random Field Framework for Thai Morphological Analysis”,
Proceedings of the Fifth International Conference on Language Resources and
Evaluation (LREC-06), Genoa, Italy.

[60] Daelemans Walter, Zavrel J, Van den Bosch A and Van der Sloot K (2003),
“MBT: Memory Based Tagger, version 2.0, reference guide”, Technical Report
ILK 03-13, ILK Research Group, Tilburg University.

[61] Alon Itai and Erel Segal (2003), “A Corpus Based Morphological Analyzer for
Unvocalized Modern Hebrew”, Department of Computer Science Technion—
Israel Institute of Technology, Haifa, Israel.

[62] John Goldsmith (2001), “Unsupervised Learning of the Morphology of a


Natural Language”, Computational Linguistics, 27(2):153–198.

[63] Asanee Kawtrakul and Chalatip Thumkanon (1997), “A statistical approach to


thai morphological analyzer”, Proceedings of the 5th Workshop on Very Large
Corpora, M. Young, The Technical Writer's Handbook. Mill Valley, CA:
University Science.

[64] John Lee (2008), “A Nearest-Neighbour Approach to the Automatic Analysis of


Ancient Greek Morphology”, CoNLL-2008: Proceedings of the 12th
Conference on Computational Natural Language Learning, Manchester.

[65] T.N. Vikram & Shalini R, (2007), “Development of Prototype Morphological


Analyzer for the South Indian Language of Kannada”, Lecture Notes In
Computer Science: Proceedings of the 10th international conference on Asian
digital libraries: looking back 10 years and forging new frontiers. Vol.
4822/2007, 109-116.

[66] Shambhavi. B.R, Dr. Ramakanth Kumar P, Srividya K, Jyothi B J , Spoorti


Kundargi, Varsha Shastri, International Journal of Computer Science and

296
 
Network Security (IJCSNS) Vol 11 No. 1, Jan 2011, "Kannada Morphological
Analyser and Generator using Trie" pp 112-116

[67] Uma maheswara Rao G, Parameshwari K: CALTS, University of Hyderabad,


„On the description of morphological data for morphological analyzers and
generators: A case of Telugu, Tamil and Kannada 2010.

[68] K. Narayana Murthy, "Issues in the Design of a Spell Checker for


Morphologically Rich Languages", 3rd International Conference on South
Asian Languages, ICOSAL-3, 4th to 6th January 2001, University of
Hyderabad

[69] Sajib Dasgupta and Vincent Ng.Unsupervised Morphological Parsing of


Bengali. In the journal of Language Resources and Evaluation, 2007. 40:3-4, pp
311-330

[70] Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya
Morphological Analyser: Some Tests with OriNet. In Proceeding of symposium
on Indian Morphology, phonology and Language Engineering, IIT Kharagpur

[71] Girish Nath Jha., Muktanand Agarwal., Subash., Sudhir K Mishra.,


DiwakarMishra., Manji Bhadra Surjit K Singh. 2007. Inflectional Morphology
for Sanskrit. In Proceedings of First International Symposium on Sanskrit
Computational Linguistics. 46-77.

[72] Anandan P, Ranjani Parthasarathy and Geetha T.V (2002), “Morphological


Analyzer for Tamil”, ICON 2002, RCILTS-Tamil, Anna University, India.

[73] Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S.
and Vijay Shanker, K. (2003). A Tamil Morphological Analyser, Proceedings of
the International Conference On Natural language processing ICON 2003,
Central Institute of Indian Languages, Mysore, India, pp. 31–39.

[74] Parameshwari K, “An Implementation of APERTIUM Morphological Analyzer


and Generator for Tamil”, Language in India www.languageinindia.c o m. 11:5
M ay 2011, Special Volume: Problems of Parsing in Indian Languages, 2011

297
 
[75] Vijay Sundar Ram R, Menaka S and Sobha Lalitha Devi (2010), Tamil
Morphological Analyser, In Mona Parakh (ed.) Morphological Analyser For
Indian Languages, CIIL, Mysore, pp. 1 -18.

[76] Akshar Bharat, Rajeev Sangal, S. M. Bendre, Pavan Kumar and Aishwarya,
“Unsupervised improvement of morphological analyzer for inflectionally rich
languages,” Proceedings of the NLPRS, pp. 685-692, 2001.

[77] Duraipandi(2002), “The Morphological Generator and. Parsing Engine for


Tamil Verb Forms ”, Tamil Internet Conference 2002.

[78] Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi
Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”,
Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech
and Chunking for Indian Languages.

[79] K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, B. Palaniappan,


Automatic classification of Tamil documents using vector space model and
artificial neural network, Expert Systems with Applications 36 (2009) 10914 –
10918.

[80] Naskar, S. and S. Bandyopadhyay, 2002. Use of machine translation in India:


Current status. Proceeding of the 7th EAMT Workshop on Teaching Machine
Translation, (TMT’02), MT- Archive, Manchester, UK., pp: 23-32

[81] R.M.K.Sinha, Jain R. and Jain A,“Translation from English to Indian languages,
ANGLABHARTI Approach,” In proceedings of Symposium on Translation
Support System STRANS2001, IIT Kanpur, India, Feb 15-17, 2001.

[82] Murthy, B. K. and Deshpande, W. R.,“Language technology in India: past,


present and future,”1998.

[83] AksharBharati, Vineet Chaitanya, Amba P Kulkarni, and Rajeev


Sangal,“Anusaaraka: Machine Translation in Stages,”In Vivek: A Quarterly in
Artificial Intelligence, Vol. 10, No.3, pp. 22-25, 1997.

298
 
[84] Durgesh Rao,“Machine Translation in India: A Brief Survey,” In Proceedings of
the SCALLA 2001 Conference, Bangalore, India, 2001.

[85] Murthy, B. K. and Deshpande, W. R.,“Language technology in India: past,


present and future,”1998.

[86] Bharati A., R. Moona, P. Reddy, B. Sankar and D.M.Sharma,“Machine


translation: The Shakti approach,” In Proceeding of the 19th International
Conference on Natural Language Processing, India, pp: 1-7, Dec. 2003.

[87] Bandyopadhyay, S., “ANUBAAD, the translator for English to Indian


languages,”In Proceedings of the 7th State Science and Technology Congress
(SSTC’00), Calcutta, India, pp. 1-9, 2000.

[88] R. Mahesh K. Sinha and Anil Thakur,“Machine translation of bilingual Hindi-


English (Hinglish) text,”In Conference Proceedings: the tenth Machine
Translation Summit,Phuket, Thailand, pp.149-156, September 13-15, 2005.

[89] Lata Gore and Nishigandha Patil,“English To Hindi - Translation System,”In


Proceedings of Symposium on Translation Support Systems STRANS-2002,
IIT Kanpur, 15-17March, 2002.

[90] G.S. Josan and G.S. Lehal,“Punjabi to Hindi machine translation system,”In
Proceedings of the 22nd International Conference on Computational
Linguistics, MT-Archive, Manchester, UK., pp. 157-160,Aug. 21-24, 2001.

[91] Vishal Goyal and Gurpreet Singh Lehal,“Hindi to Punjabi Machine Translation
System,”Springer Berlin Heidelberg, Information Systems for Indian
Languages, Communications in Computer and Information Science, Vol: 139,
pp: 236-241. 2011.

[92] Prashanth Balajapally, Phanindra Pydimarri, Madhavi Ganapathiraju, N.


Balakrishnan and Raj Reddy,“Multilingual Book Reader: Transliteration, Word-
to-Word Translation and Full-text Translation,”In Proceedings of VALA 2006:
13th Biennial Conference and Exhibition, Melbourne, Australia, February 8-10,
2006.

299
 
[93] Ruvan Weerasinghe. 2004. A statistical machine translation approach to Sinhala
Tamil language translation. In SCALLA 2004.

[94] Vasu Renganathan, (2002) “An ineractive approach to development of English


to Tamil machine translation system on the web”. INFITT, (TI2002).

[95] Germann, U. (2001). “Building a statistical machine translation system from


scratch: how much bang for the buck can we expect?”, In Proceedings of the
workshop on Data-driven methods in machine translation, 1–8, ACL,
Morristown, NJ, USA.

[96] Fedric C. Gey,“Prospects for Machine Translation of the Tamil Language”, in


the proceedings of Tamil Internet 2002, California, USA.

[97] Chella muthu (2001) “ Russian language to Tamil machine translation system ”.
INFITT, (TI2001).

[98] R. Harshawardhan, Mridula Sara Augustine and K.P.Soman, Phrase based


English – Tamil Translation System by Concept Labeling using Translation
Memory,International Journal of Computer Applications (0975 – 8887),
Volume 20– No.3, April 2011.

[99] Saravanan S., Menon A. G., Soman K. P. (2010), English to Tamil Machine
Tanslation System, INFITT 2010, at Coimbatore.

[100] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and
P. Roossin,“A statistical approach to machine translation,”InJournal
ofComputational Linguistics, 16(2):79-85, 1990.

[101] F.J.Och. “An Efficient method for determining bilingual word classes,” In
Proceedings of Ninth Conference of the European Chapter of the Association
for Computational Linguistics (EACL), 1999.

[102] Daniel Marcu and William Wong,“A Phrase-Based, Joint Probability Model for
Statistical Machine Translation,” In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP-2002),
Philadelphia, PA, July 6-7, 2002.

300
 
[103] Philipp Koehn, Franz Josef Och, and Daniel Marcu,“Statistical Phrase-Based
Translation,”In Proceedings ofHLT/NAACL, 2003.

[104] Kenji Yamada and Kevin Knight, “A Syntax-based Statistical Translation


Model,”InProceedings ofACL 2001,pp.523-530, 2001.

[105] J. Graehl and K. Knight,“Training tree transducers,”In Proceedings ofHLT-


NAACL 2004: Main Proc., pp. 105–112, Boston,Massachusetts, USA, May 2 -
May 7, 2004.

[106] Melamed. “Statistical machine translation by parsing,” In the Companion


Volume to the Proc. of 42nd Annual Meeting of the Association for
Computational Linguistics, pp. 653–660, 2004.

[107] K. Imamura, H. Okuma, and E. Sumita, Practical approach to syntax-based


statistical machine translation, In Proceedings of MT Summit X, pp. 267–274,
2005.

[108] Sonja Nießen and Hermann Ney,“Statistical Machine Translation with Scarce
Resources Using Morpho-syntactic Information,” In Journal ofComputational
Linguistics, 30(2), pp. 181–204, 2004.

[109] Maja Popovic and Hermann Ney,“Statistical Machine Translation with a Small
Amount of Bilingual Training Data,”5th LREC SALTMIL Workshop on
Minority Languages, pp. 25–29, 2006.

[110] Michael Collins, Philipp Koehn, and Ivona Kucerova,“Clause Restructuring for
Statistical Machine Translation,”In Proceedings of ACL, pp. 531–540, 2006.

[111] Sahar Ahmadi and Saeed Ketabi. “Translation Procedures and problems of
Color Idiomatic Expressions in English and Persian,” In the Journal of
International Social Research, Volume: 4 Issue: 17, 2011.

[112] Martine Smets, Joseph Pentheroudakis and Arul Menezes, “Translation of


verbal idioms,”Microsoft Research, 2005.

301
 
[113] Breidt, E., Segond F and Valetto G.,“Local grammars for the description of
multi-word lexemes and their automatic recognition in texts,”In Proceedings of
COMPLEX96, Budapest, 1996.

[114] P. Karageorgakis, A. Potamianos, and I. Klasinas, “Towards incorporating


language morphology into statistical machine translation systems,” in Proc.
Automatic Speech Recogn. and Underst. Workshop (ASRU), 2005.

[115] P. Koehn,(2002) “Europarl: A Multilingual Corpus for Evaluation of Machine


Translation,” Draft, Unpublished.

[116] Sultan, Soha. Applying morphology to English-Arabic statistical machine


translation. Diss. Master's Thesis Nr. 11 ETH Zurich in collaboration with
Google Inc., 2011.

[117] Adria de Gispert Ramis (2006). Introducing Linguistic knowledge into


statistical Machine Translation. Ph.D. thesis, TALP Research Center, Speech
Processing Group Department of Signal Theory and Communications,
Universitat Polit`ecnica de Catalunya.

[118] Ann Clifton, (2010) Unsupervised Morphological Segmentation For Statistical


Machine Translation. Master of Science thesis, Simon Fraser University.

[119] Sara Stymne, (2009) Compound Processing for Phrase-Based Statistical


Machine Translation. Licentiate thesis, Linköping University, Sweden.

[120] Rabih M. Zbib (2010), Using Linguistic Knowledge in Statistical Machine


Translation, Ph.D. thesis, Massachusetts Institute Of Technology, September
2010.

[121] Lee, Y. S. (2004). Morphological analysis for statistical machine translation.


Defense Technical Information Center.

[122] Elena Irimia, Alexandru Ceausu, Dependency-based translation equivalents for


factored machine translation, 11th International Conference on Intelligent Text
Processing and Computational Linguistics - CICLing 2010

302
 
[123] Sriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali, A
Discriminative Approach for Dependency Based. Statistical Machine
Translation (2010).

[124] Loganathan R (2010). English-Tamil Machine Translation System. Master of


Science by Research Thesis, Amrita Vishwa Vidyapeetham, Coimbatore.

[125] Kumaran A and Tobias Kellner (2007) A Generic Framework for Machine
Transliteration. Proceedings in 30th annual international ACM-SIGIR
conference on Research and development in information retrieval, Pages 721-
722.

[126] Mohammad Afraz and Sobha L (2008), "English to Dravidian Language


Machine Transliteration: A Statistical Approach Based on N-grams", In the
Proceedings of International Seminar on Malayalam and Globalization,
Trivandrum, Kerala,

[127] Srinivasan Janarthanam, Sethuramalingam S and Udhyakumar Nallasamy,


Named Entity Transliteration for Cross Language Information Retrieval using
Compressed Word Format Algorithm, 2nd International ACM Workshop
Improving Non-English Web Searching (iNEWS-08), California.

[128] Vijaya M. S., Shivapratap G., Soman K. P (2010), ‘English to Tamil Trans-
literation using One Class Support Vector Machine’, International Journal of
Applied Engineering Research, Volume 5, Number 4, 641-652.

[129] Rajendran S (2006), “Parsing in Tamil –Present State of Art”, Language in


India, www.languageinindia.com, Volume 6 : 8.

[130] Sobha, L., Vijay Sundar Ram. R, (2006) "Noun Phrase Chunker for Tamil", In
Proceedings of Symposium on Modeling and Shallow Parsing of Indian
Languages, Indian Institute of Technology, Mumbai, pp 194-198.

[131] Menon A. G.; Saravanan S; Loganathan R; Soman K. P. (2009): Amrita Morph


Analyzer and Generator for Tamil: A Rule-Based Approach, TIC, Cologne,
Germany, pp. 239-243.

303
 
[132] http://en.wikipedia.org/wiki/Tamil_grammar.

[133] kiriyAvin thaRkAla thamiz akarAthi (2006), Cre-A, pp. 230-231.

[134] Lehmann Thomas (1983), “A Grammar of Modern Tamil”, Pondicherry:


Pondicherry Institute of Linguistics and Culture.

[135] http://en.wikipedia.org/wiki/Machine_learning

[136] Vapnik, V. (1998). Statistical Learning Theory. Wiley. & Sons, Inc., New York.

[137] K.P. Soman, Shyam Diwakar, V. Ajay, Insight into Data Mining, Theory and
Practice. Prentice Hall of India, Pages174-198, 2008.

[138] Dr. K.P. Soman, Ajav. V, Loganathan R., "Machine Learning with SVM and
other Kernel Methods", Prentice-Hall India, ISBN: 978-81-203-3435-9, 2009.

[139] Harris Z S (1962), String Analysis of Sentence Structure. Mouton, The Hague.

[140] Voutilainen Atro (1995), A syntax-based part-of-speech analyzer, EACL 1995.

[141] Schmid H (1994), “Probabilistic Part-Of-Speech Tagging using Decision


Trees”, Proceedings of the International Conference on new methods in
language processing, Manchester, UK.

[142] Stanfill C, and Waltz D (1986),“Toward memory-based reasoning”,


Communications of the ACM, Vol. 29, pp. 1213-1228.

[143] Jakob Elming (2008), Syntactic Reordering In Statistical Machine Translation,


PhD Thesis. Copenhagen Business School.

[144] Kishore Papineni, Salim Roukos, Todd Ward & Wei-Jing Zhu (2002). BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
40th Meeting of the Association for Computational Linguistics (ACL’02) (pp.
311–318). Philadelphia, PA.

[145] George Doddington (2002). Automatic evaluation of machine translation


quality using n-gram co-occurrence statistics. In Proceeding of the ARPA
Workshop on Human Language Technology.

304
 
[146] V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions
and reversals. Soviet Physics Doklady, 10(8), pp. 707–710, February.

[147] S. Nießen, F. J. Och, G. Leusch, and H. Ney. 2000. An evaluation tool for
machine translation: Fast evaluation for MT research. In Proc. Second Int. Conf.
on Language Resources and Evaluation, pp. 39–45, Athens, Greece, May.

[148] Christoph Tillmann, Stefan Vogel, Hermann Ney & Alex Zubiaga (1997). A
DP-based search using monotone alignments in statistical translation. In
Proceedings of the 35th Meeting of the Association for Computational
Linguistics and 8th Conference of the European Chapter of the Association for
Computational Linguistics (pp. 289–296). Somerset, New Jersey.

[149] Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul: 2006, `A Study
of Translation Edit Rate with Targeted Human Annotation'. In: Proceedings of
Association for Machine Translation in the Americas. pp. 223-231.

[150] http://nlp.stanford.edu/software/lex-parser.shtml

[151] Fei Xia and Michael McCord. Improving a statistical MT system with
automatically learned rewrite patterns. In Proceedings of the 20th International
Conference on Computational Linguistics, COLING ’04, pages 508–514,
Geneva, Switzerland, August 2004. Association for Computational Linguistics.

[152] Michael Collins, Philipp Koehn, and Ivona Kuˇcerová. Clause restructuring for
statistical machine translation. In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, ACL ’05, pages 531–540, Ann
Arbor, Michigan, USA, June 2005. Association for Computational Linguistics.

[153] Marta Ruiz Costa-juss` (2006). On developing novel reordering algorithms for
Statistical Machine Translation. Ph.D. thesis, Speech Processing Group
Department of Signal Theory and Communications, Universitat Polit`ecnica de
Catalunya.

[154] Marta Ruiz Costa-juss and J. A. R. Fonollosa, “State-of-the-art word reordering


approaches in statistical machine translation,” IEICE Transactions on
Information and Systems, vol. 92, no. 11, pp. 2179–2185, November 2009.

305
 
[155] Ananthakrishnan Ramanathan, Pushpak Bhattacharya, Jayprasad Hegde, Ritesh
M.Shah, and Sasikumar M. 2008. Simple Syntactic and Morphological
Processing Can Help English-Hindi Statistical Machine Translation. In IJCNLP
2008, Hyderabad, India. Rochester, NY, April

[156] Schiffman, Harold (1999) A Reference Grammar of Spoken Tamil. Cambridge


University Press.

[157] Simon Zwarts and Mark Dras. 2007. Syntax-Based Word Reordering in Phrase-
Based Statistical Machine Translation: Why Does it Work? In Proceedings of
MT Summit XI, pages 559–566.

[158] Marie-Catherine de Marneffe and Christopher D. Manning (2008), “Stanford


typed dependencies manual”.

[159] Rajendran S, (2007) Complexity of Tamil in POS tagging, Language in India,


Jan 2007.

[160] Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma and Lakshmi Bai (2006),
“AnnCorra: Annotating Corpora Guidelines for POS and Chunk Annotation for
Indian Languages”, Language Technologies Research Centre IIIT, Hyderabad.

[161] http://www.au-kbc.org/research_areas/nlp/projects/postagger.html.

[162] http://www.infitt.org/ti2001/papers/vasur.pdf.

[163] http://shiva.iiit.ac.in/SPSAL2007/SPSAL-Proceedings.pdf.

[164] http://www.ldcil.org/up/conferences/pos%20tag/presentation.html.

[165] http://www.infitt.org/ti2009/papers/ganesan_m_final.pdf

[166] http://tdil.mit.gov.in/Tamil-AnnaUniversity-ChennaiJuly03.pdf

[167] Rajan K (2002), “Corpus Analysis And Tagging for Tamil”, Annamalai
University, Annamalai nagar.

[168] http://www.cs.waikato.ac.nz/~ml/weka/

306
 
[169] Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.)2004 ,
Morphology. A Handbook on Inflection and Word Formation, Berlin and New
York: Walter De Gruyter, 1893-1900.

[170] N. Ramaswami, 2001, Lexical Formatives and Word Formation Rules In Tamil.
Volume 1: 8 December 2001.

[171] Hal Daume (2006), http://nlpers.blogsp-ot.com/2006/11/-getting-started-in-


sequencelabeling.html.

[172] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and
P. Roossin, A statistical approach to machine translation, In Journal of
Computational Linguistics, 16(2):79-85, 1990.

[173] Jurafsky, Daniel, and James H. Martin, Speech and Language Processing: An
Introduction to Natural Language Processing, Speech Recognition, and
Computational Linguistics, 2nd edition. Prentice-Hall, 2009.

[174] Philipp Koehn and Kevin Knight, Knowledge Sources for Word-Level
Translation Models, In Proceedings of EMNLP, 2001.

[175] Philipp Koehn, Franz Josef Och, and Daniel Marcu, “Statistical Phrase-Based
Translation, In Proceedings of HLT/NAACL, 2003.

[176] Kenji Yamada and Kevin Knight, A Syntax-based Statistical Translation Model,
In Proceedings of ACL 2001, pp.523-530, 2001.

[177] J. Graehl and K. Knight, ―Training tree transducers, In Proceedings of HLT-


NAACL 2004: Main Proc., pp. 105–112, Boston, Massachusetts, USA, May 2 -
May 7, 2004.

[178] Melamed. Statistical machine translation by parsing, In the Companion Volume


to the Proc. of 42nd Annual Meeting of the Association for Computational
Linguistics, pp. 653–660, 2004.

[179] K. Imamura, H. Okuma, and E. Sumita, Practical approach to syntax-based


statistical machine translation, In Proceedings of MT Summit X, pp.267–274,
2005.

307
 
[180] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, et.al,
Moses: Open source toolkit for statistical machine translation, In Proceedings of
ACL, Demonstration Session, 2007.

[181] F.J.Och. An Efficient method for determining bilingual word classes, In


Proceedings of Ninth Conference of the European Chapter of the Association
for Computational Linguistics (EACL), 1999.

[182] F.J.Och and H.Ney. A systematic comparison of various statistical alignment


models, In Journal of Computational Linguistics, 29(1):19-51, 2003.

[183] A.Stolcke. SRILM – an extensible language modelling toolkit, In Proceedings


of the ICSLP, 2002.

[184] Garrido Alicia, Amaia Iturraspe, Sandra Montserrat, et.al (1999). “A compiler
for morphological analysers and generators based on finite state transducers”.
Procesamiento del Lenguaje Natural, 25:93–98.

[185] Guido Minnen, John Carroll, and Darren Pearce. 2000. “Robust applied
morphological generation.” Proceedings of the First International Natural
Language Generation Conference, pages 201.208, 12.16 June.

[186] Goyal, V, Singh Lehal, G. “Hindi Morphological Analyzer and Generator ”


Emerging Trends in Engineering and Technology, 2008. ICETET '08.

[187] Madhavi Ganapathiraju and Lori Levin, 2006 “TelMore: Morphological


Generator for Telugu Nouns and Verbs”. Proc. Second International Conference
on Universal Digital Library, Vol Alexandria, Egypt, Nov 17-19, 2006.

[188] Reyyan Yeniterzi and Kemal Oflazer, Syntax-to-Morphology Mapping in


Factored Phrase-Based Statistical Machine Translation from English to Turkish,
in Proceedings of ACL 2010, Uppsala, Sweden, July 2010.

[189] Hoifung Poon, Colin Cherry, and Kristina Toutanova, Unsupervised


Morphological Segmentation with Log-Linear Models, in Proceedings of
NAACL-HLT, Association for Computational Linguistics, June 2009.

308
 
[190] Jan Hajič, Barbora Vidová-Hladká: Tagging Inflective Languages: Prediction of
Morphological Categories for a Rich, Structured Tagset, In: Proceedings of the
Conference COLING - ACL `98. 1998

309
 
PUBLICATIONS
International Journals

1. Anand Kumar M., Dhanalakshmi V., Rekha R. U., Soman K. P, and


Rajendran S., A Novel Data Driven Algorithm for Tamil Morphological
Generator, International Journal of Computer Applications(IJCA) - Foundation
of Computer Science, 6(12):52,56, 2010.

2. Anand Kumar M., Dhanalakshmi V., Soman K. P. and Rajendran S., A


Sequence Labeling Approach to Morphological. Analyzer for Tamil Language,
International Journal on Computer Science and Engineering (IJCSE), Vol. 02,
No. 06, 2201-2208, 2010.

3. Dhanalakshmi V., Anand Kumar M., Soman K. P. and Rajendran S., A


Natural Language Processing Tools for Tamil Grammar Learning and Teaching,
International Journal of Computer Applications(IJCA) - Foundation of
Computer Science, October 2010.

4. Dhanalakshmi V, Anand Kumar M, Shivapratap G, Soman K.P and Rajendran


S. (2009), “Tamil POS Tagging using Linear Programming”, International
Journal of Recent Trends in Engineering, Vol. 1, No. 2, ISSN 1797-9617.

5. Antony P. John, Anand Kumar M., Soman K. P., A Paradigm Based


Morphological Analyzer for English to Kannada using a Machine Learning
Approach, Research India Publication(RIP), October 2010.

6. Poornima C, Dhanalakshmi V, Anand Kumar M, Soman K P, Rule based


Sentence Simplification for English to Tamil Machine Translation System.
International Journal of Computer Applications (IJCA) - Foundation of
Computer Science, August 2011.

7. Tirumeni , Anand Kumar M, Dhanalakshmi V and Soman K. P., An Approach


to handle Idioms and Phrasal verbs in English to Tamil Machine Translation
System International Journal of Computer Applications(IJCA) (Impact
Factor-0.835) - Foundation of Computer Science, July 2011.

310
 
8. Anand Kumar M, Dhanalakshmi V, Soman K.P, Factored Statistical Machine
Translation System for English to Tamil using Tamil Linguistic Tools, Journal
of Computer Science, Science publications. [Indexed by IET- ISI Thomson
Scientific Index, SCOPUS]. (Accepted for Publication).

International Conferences

1. Anand Kumar M, Dhanalakshmi.V ,Rekha R U,Soman K P and Rajendran S


(2010), “A Novel Algorithm for Tamil Morphological generator”, 8th
International Conference on Natural Language Processing 2010( ICON2010),
IIT-Kharagpur, India. (Receives Best Second Paper award)

2. Anand Kumar M, Dhanalakshmi.V , Soman K P and Rajendran S (2009) , “A


Novel Approach for Tamil Morphological Analyzer”, Proceedings of the 8th
Tamil Internet Conference 2009, Cologne, Germany.

3. Dhanalakshmi V, Anand Kumar M, Vijaya M.S, Loganathan R, Soman K.P


and Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”,
Proceedings of International Conference on Asian Language Processing 2008
(IALP 2008), Chiang Mai, Thailand .

4. Dhanalakshmi.V, Anand Kumar M, Rekha R U, Arun kumar C, Soman K P


and Rajendran S (2009), “Morphological Analyzer for Agglutinative
Languages Using Machine Learning Approaches”, Proceedings of International
Conference on Advances in Recent Technologies in Communication and
Computing ,India . Papers are archived in the IEEE Xplore and IEEE CS
Digital Library. (Indexed in SCOPUS)

5. Dhanalakshmi.V , Anand Kumar M, Padmavathy P, Soman K P and


Rajendran S, “Chunker for Tamil”, Proceedings of International Conference on
Advances in Recent Technologies in Communication and Computing(
ARTCom 2009), India. Papers are archived in the IEEE Xplore and IEEE CS
Digital Library. (Indexed in SCOPUS)

6. Dhanalakshmi.V, Anand Kumar M, Soman K P and Rajendran S (2009),


“Postagger and Chunker for Tamil Language”, Proceedings of the 8th Tamil
Internet Conference, Cologne, Germany.

311
 
7. Dhanalakshmi.V , Padmavathy P, Anand Kumar M, Soman K P and
Rajendran S (2009), “Chunker for Tamil using Machine Learning”, 7th
International Conference on Natural Language Processing 2009( ICON2009),
IIIT Hyderabad.

8. Anand Kumar M, Dhanalakshmi.V, R U Rekha, , Soman K P and Rajendran S


(2010), “Morphological generator for Tamil a new data driven approach”, 9th
Tamil Internet Conference, Chemmozhi Maanaadu, Coimbatore, India.

9. Dhanalakshmi.V , Anand Kumar M, Rekha R U, Soman K P and Rajendran


S (2010), “Grammar Teaching Tools for Tamil” Technology for Education
Conference (T4E), IIT Bombay, India. Papers are archived in the IEEE
Computer Society. (Indexed in SCOPUS)

10. Rekha R U, Anand Kumar M, Dhanalakshmi.V, Soman K P and Rajendran S


(2010), “ A Novel Approach to Morphological Generator for Tamil”, 2nd
International Conference on Data Engineering and Management (ICDEM
2010), Trichy, India. Conference proceedings published by Lecture Notes in
Computer Science (LNCS), Springer Verlag-Germany. (Indexed in
SCOPUS)

11. Abeera V P, Aparna S, Rekha R U, Anand Kumar M, Dhanalakshmi.V,


Soman K P and Rajendran S (2010), “Morphological Analyzer for Malayalam
Using Machine Learning ”, 2nd International Conference on Data Engineering
and Management (ICDEM 2010) , Trichy, India. Conference proceedings
published by Lecture Notes in Computer Science (LNCS), Springer Verlag-
Germany. (Indexed in SCOPUS)

12. Kiranmai G, Mallika K, Anand Kumar M, Dhanalakshmi.V and Soman K P


(2010), “Morphological analyzer for Telugu using Support Vector Machine”,
International Conference on Advances in Information and Communication
Technologies (ICT 2010), Kochi, India. The Proceedings are published by
Springer CCIS and it will be available in the Springer Digital Library.
(Indexed in SCOPUS)

312
 
13. Anand Kumar M, Dhanalakshmi.V , Soman K P and Rajendran S (2011) ,
“Morphology based factored Statistical Machine Translation system from
English to Tamil”, INFITT 2011, Conference was held at the University of
Pennsylvania, Philadelphia, USA during June 17-19, 2011.

14. Dhanalakshmi.V , Anand Kumar M, Soman K P and Rajendran S (2011) ,


“Shallow Parser for Tamil”, INFITT 2011 Conference was held at the
University of Pennsylvania, Philadelphia, USA during June 17-19, 2011.

15. Keerthana S, Dhanalakshmi V, Anand Kumar M and Soman K. P , Tamil To


Hindi Machine Transliteration Using Support Vector Machines, in
International Joint Conference on Advances in Signal Processing and
Information Technology – SPIT. 2011 The Proceedings are published by
Springer and it is available in the Springer Digital Library .

16. Dhivya R, Dhanalakshmi V, Anand Kumar M and Soman K. P. Clause


Boundary Identification For Tamil Language Using Dependency Parsing, , in
International Joint Conference on Advances in Signal Processing and
Information Technology – SPIT. 2011 The Proceedings are published by
Springer and it is available in the Springer Digital Library .

Following paper has accepted but not published


 
1. Anand Kumar M, Dhanalakshmi V, Soman K.P and Rajendren S, English to
Tamil Factored-Statistical Machine Translation using Morphological Tools ,
International Conference on Asian Language Processing 2011 (IALP 2011) will
be jointly organized by Chinese and Oriental Languages Information Processing
Society (COLIPS) of Singapore, IEEE Singapore Computer Chapter.
Conference proceedings will be included in the IEEE Xplore digital library .

313
 

You might also like