Statistical Tools for Linguists

        Cohan Sujay Carlos
           Aiaioo Labs
           Bangalore
Text Analysis and Statistical Methods

 • Motivation
 • Statistics and Probabilities
 • Application to Corpus Linguistics
Motivation
• Human Development is all about Tools
  – Describe the world
  – Explain the world
  – Solve problems in the world
• Some of these tools
  – Language
  – Algorithms
  – Statistics and Probabilities
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1000 teachers, 100 students per
   class, and 3 years of teaching per student

   –12000 years
 • If we had 100,000 teachers

   –120 years
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1 teacher, 10 students per class,
   and 3 years of teaching per student.
 • Then each student teaches 10 more students.

   – about 30 years
 • We could turn the whole world literate in

   – about 34 years
Motivation – Algorithms for Education Policy


 Difference:

 Policy 1 is O(n) time
 Policy 2 is O(log n) time
Motivation – Statistics for Linguists

 We have shown that:
 Using a tool from computer science, we can
 solve a problem in quite another area.

                   SIMILARLY

 Linguists will find statistics to be a handy tool
 to better understand languages.
Applications of Statistics to Linguistics


     • How can statistics be useful?
     • Can probabilities be useful?
Introduction to Aiaioo Labs
• Focus on Text Analysis, NLP, ML, AI
• Applications to business problems
• Team consists of
  – Researchers
     • Cohan
     • Madhulika
     • Sumukh
  – Linguists
  – Engineers
  – Marketing
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet
  – Google terabyte corpus (with annotations?)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet (set of rules about the real world)
  – Google terabyte corpus (real world)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
    – Wordnet (not countable)
    – Google terabyte corpus (countable)



For training machine learning algorithms, the latter might be more valuable,
just because it is possible to tally up evidence on the latter corpus.

Of course I am simplifying things a lot and I don’t mean that the former is not
valuable at all.
Approach to corpus construction
 So if you are constructing a corpus on
 which machine learning methods might
 be applied, construct your corpus so that
 you retain as many examples of surface
 forms as possible.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Problem : Spelling

1.   Field
2.   Wield
3.   Shield
4.   Deceive
5.   Receive
6.   Ceiling

                       Courtesy of http://norvig.com/chomsky.html
Rule-based Approach


    “I before E except after C”

-- an example of a linguistic insight




                 Courtesy of http://norvig.com/chomsky.html
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
  and ‘cei’ in a large corpus

P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005

                         Courtesy of http://norvig.com/chomsky.html
Words where ie occur after c
•   science
•   society
•   ancient
•   species




                   Courtesy of http://norvig.com/chomsky.html
But you can go back to a Rule-based
             Approach


  “I before E except after C only if C is not
               preceded by an S”

    -- an example of a linguistic insight


                      Courtesy of http://norvig.com/chomsky.html
What is a probability?

• A number between 0 and 1
• The sum of the probabilities on all outcomes is 1

Heads                  Tails




• P(heads) = 0.5
• P(tails) = 0.5
Estimation of P(IE)



P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
What is Estimation?



P(“UN”) = C(“UN”) / C(all words in my corpus)
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong



      It depends on whether you care about
      precision or recall or both.
What are Precision and Recall



 Classification metrics used to compare ML
 algorithms.
Classification Metrics

     Politics                   Sports

The UN Security            Warwickshire's Clarke
Council adopts its first   equalled the first-class
clear condemnation of      record of seven

    How do you compare two ML algorithms?
Classification Quality Metrics
               Point of view = Politics

                            Gold - Politics       Gold - Sports

Observed - Politics   TP (True Positive)      FP (False Positive)


Observed - Sports     FN (False Negative)     TN (True Negative)
Classification Quality Metrics
               Point of view = Sports

                           Gold - Politics       Gold - Sports

Observed - Politics   TN (True Negative)     FN (False Positive)


Observed - Sports     FP (False Negative)    TP (True Positive)
Classification Quality Metric - Accuracy
                   Point of view = Sports

                               Gold - Politics      Gold – Sports

    Observed - Politics   TN (True Negative)     FN (False Positive)


    Observed - Sports     FP (False Negative)    TP (True Positive)
Metrics for Measuring Classification Quality
                      Point of View – Class 1

                              Gold Class 1               Gold Class 2

   Observed Class 1     TP                          FP


   Observed Class 2     FN                          TN




                Great metrics for highly unbalanced corpora!
Metrics for Measuring Classification Quality




  F-Score = the harmonic mean of Precision and Recall
F-Score Generalized


            1
 F
       1           1
        (1   )
       P           R
Precision, Recall, Average, F-Score

                 Precision          Recall         Average           F-Score

 Classifier 1   50%              50%             50%                50%


 Classifier 2   30%              70%             50%                42%


 Classifier 3   10%              90%             50%                18%




                 What is the sort of classifier that fares worst?
How do you annotate?
So if you are constructing a corpus for a
machine learning tool where only
precision matters, all you need is a corpus
of presumed positives that you mark as
right or wrong (or the label and other).

If you need to get good recall as well, you
will need a corpus annotated with all the
relevant labels.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How much data should you annotate?
  • The problem: ‘named entity classification’
  • What is better?
    – 2000 words per category (each of Per, Org,
      Loc, Prod, Time)
    – 5000 words per category (each of Per, Org,
      Loc, Prod, Time)
Small Corpus – 4 Fold Cross-Validation

          Split          Train Folds         Test Fold

       First Run    • 1, 2, 3          • 4


       Second Run   • 2, 3, 4          • 1


       Third Run    • 3, 4, 1          • 2


       Fourth Run   • 4, 1, 2          • 3
Statistical significance in a paper


                              significance         estimate
                                                         variance




    Remember to take Inter-Annotator Agreement into account
How much do you annotate?
So you increase the corpus size till that
the error margins drop to a value that the
experimenter considers sufficient.

The smaller the error margins, the finer
the comparisons the experimenter can
make between algorithms.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the
    corpus
Avoid Mistakes
• The problem: ‘train a classifier’
• What is better?
  – Train with all the data that you have, and
    then test on all the data that you have?
  – Train on half and test on the other half?
Avoid Mistakes
• Training a corpus on a full corpus and
  then running tests using the same corpus
  is a bad idea because it is a bit like
  revealing the questions in the exam
  before the exam.
• A simple algorithm that can game such a
  test is a plain memorization algorithm
  that memorizes all the possible inputs
  and the corresponding outputs.
Corpus Splits

        Split            Percentage

Training        • 60%


Validation      • 20%


Testing         • 20%


Total           • 100%
How do you avoid mistakes?
Do not train a machine learning algorithm on the
‘testing’ section of the corpus.

During the development/tuning of the algorithm,
do not make any measurements using the
‘testing’ section, or you’re likely to ‘cheat’ on the
feature set, and settings. Use the ‘validation’
section for that.

I have seen researchers claim 99.7% accuracy on
Indian language POS tagging because they failed
to keep the different sections of their corpus
sufficiently well separated.

More Related Content

PPTX
Lapu-Lapu-NHS.pptx
DOCX
assesing reading
PPTX
Communicative language teaching
DOC
Aspects Of Connected Speech
PPT
Head Movement and verb movement
PPTX
#language contact and language choice
PPT
sociolinguistic
PPT
Chapter 3(designing classroom language tests)
Lapu-Lapu-NHS.pptx
assesing reading
Communicative language teaching
Aspects Of Connected Speech
Head Movement and verb movement
#language contact and language choice
sociolinguistic
Chapter 3(designing classroom language tests)

What's hot (20)

PDF
Deductive and inductive_grammar_teaching
PPT
Speech Act Apologies
PDF
Processes of Word Formation - Morphology-LANE 333-2012- dr. shadia
PPTX
The purpose of language
PDF
Assessing Grammar (Summary ch 1, 8 & 9)
PPTX
Bilingualism, code switching, and code mixing
PPT
sentence style and structure
PPTX
First Language Acquisition Part 2
PPTX
Testing writing (for Language Teachers)
PPTX
Grice Maxims
PPT
Linguistics hanoi university
DOCX
Methods of Teaching in Speaking
PPTX
Test Techniques
PPTX
TYPES AND USES OF LANGUAGE TESTING & NORM-REFERENCED TEST AND CRITERION-REFER...
PPTX
Achieving beneficial blackwash
PDF
Assesing speaking skills
DOCX
Entailments presupposition activities
PDF
Morphological typology
PPTX
Content based syllabus
PPT
Code switching and code mixing.ppt
Deductive and inductive_grammar_teaching
Speech Act Apologies
Processes of Word Formation - Morphology-LANE 333-2012- dr. shadia
The purpose of language
Assessing Grammar (Summary ch 1, 8 & 9)
Bilingualism, code switching, and code mixing
sentence style and structure
First Language Acquisition Part 2
Testing writing (for Language Teachers)
Grice Maxims
Linguistics hanoi university
Methods of Teaching in Speaking
Test Techniques
TYPES AND USES OF LANGUAGE TESTING & NORM-REFERENCED TEST AND CRITERION-REFER...
Achieving beneficial blackwash
Assesing speaking skills
Entailments presupposition activities
Morphological typology
Content based syllabus
Code switching and code mixing.ppt
Ad

Similar to Statistics for linguistics (20)

PPTX
NLP_KASHK:Evaluating Language Model
PDF
ISSTA'16 Summer School: Intro to Statistics
PPTX
To requirements and beyond...
PDF
Making Machine Learning Work in Practice - StampedeCon 2014
PPTX
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PDF
Search quality in practice
PDF
CSE 291 – AI Agents 2/11 – Attention and Language Modelingattention_language_...
PDF
To requirements and beyond...
PPT
Language Modeling Putting a curve to the bag of words
PPTX
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
PPT
Analysing & interpreting data.ppt
PPTX
VOC real world enterprise needs
PPTX
Emerging Techniques in Machine Learning, Data Science and Internet of Things
PPTX
R Data Structures (Part 1)
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Lepor: augmented automatic MT evaluation metric
PDF
Do you Mean what you say? Recognizing Emotions.
PDF
Survey Research in Software Engineering
PPTX
Multimodal Learning Analytics
NLP_KASHK:Evaluating Language Model
ISSTA'16 Summer School: Intro to Statistics
To requirements and beyond...
Making Machine Learning Work in Practice - StampedeCon 2014
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Pptphrase tagset mapping for french and english treebanks and its application...
Search quality in practice
CSE 291 – AI Agents 2/11 – Attention and Language Modelingattention_language_...
To requirements and beyond...
Language Modeling Putting a curve to the bag of words
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Analysing & interpreting data.ppt
VOC real world enterprise needs
Emerging Techniques in Machine Learning, Data Science and Internet of Things
R Data Structures (Part 1)
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lepor: augmented automatic MT evaluation metric
Do you Mean what you say? Recognizing Emotions.
Survey Research in Software Engineering
Multimodal Learning Analytics
Ad

More from aiaioo (10)

PPTX
Document Analysis with Deep Learning
PPTX
Deep Learning through Pytorch Exercises
PDF
Learning Non-Linear Functions for Text Classification
PPTX
Vaklipi Text Analytics Tools
PPTX
Fun with Text - Managing Text Analytics
PPTX
Arduino for Indian Languages
PPTX
Fun with Text - Hacking Text Analytics
PDF
Vaklipi (Natural Language Programming and Queries)
PPTX
Rules engines to machine learning
PPTX
Aiaioo labs - Only Slightly Futuristic
Document Analysis with Deep Learning
Deep Learning through Pytorch Exercises
Learning Non-Linear Functions for Text Classification
Vaklipi Text Analytics Tools
Fun with Text - Managing Text Analytics
Arduino for Indian Languages
Fun with Text - Hacking Text Analytics
Vaklipi (Natural Language Programming and Queries)
Rules engines to machine learning
Aiaioo labs - Only Slightly Futuristic

Recently uploaded (20)

PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
substrate PowerPoint Presentation basic one
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
The AI Revolution in Customer Service - 2025
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Basics of Cloud Computing - Cloud Ecosystem
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
substrate PowerPoint Presentation basic one
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
giants, standing on the shoulders of - by Daniel Stenberg
The AI Revolution in Customer Service - 2025
SGT Report The Beast Plan and Cyberphysical Systems of Control
CEH Module 2 Footprinting CEH V13, concepts
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Auditboard EB SOX Playbook 2023 edition.
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
MuleSoft-Compete-Deck for midddleware integrations
LMS bot: enhanced learning management systems for improved student learning e...
Presentation - Principles of Instructional Design.pptx
4 layer Arch & Reference Arch of IoT.pdf
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf

Statistics for linguistics

  • 1. Statistical Tools for Linguists Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. Text Analysis and Statistical Methods • Motivation • Statistics and Probabilities • Application to Corpus Linguistics
  • 3. Motivation • Human Development is all about Tools – Describe the world – Explain the world – Solve problems in the world • Some of these tools – Language – Algorithms – Statistics and Probabilities
  • 4. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1000 teachers, 100 students per class, and 3 years of teaching per student –12000 years • If we had 100,000 teachers –120 years
  • 5. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1 teacher, 10 students per class, and 3 years of teaching per student. • Then each student teaches 10 more students. – about 30 years • We could turn the whole world literate in – about 34 years
  • 6. Motivation – Algorithms for Education Policy Difference: Policy 1 is O(n) time Policy 2 is O(log n) time
  • 7. Motivation – Statistics for Linguists We have shown that: Using a tool from computer science, we can solve a problem in quite another area. SIMILARLY Linguists will find statistics to be a handy tool to better understand languages.
  • 8. Applications of Statistics to Linguistics • How can statistics be useful? • Can probabilities be useful?
  • 9. Introduction to Aiaioo Labs • Focus on Text Analysis, NLP, ML, AI • Applications to business problems • Team consists of – Researchers • Cohan • Madhulika • Sumukh – Linguists – Engineers – Marketing
  • 10. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 11. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet – Google terabyte corpus (with annotations?)
  • 12. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (set of rules about the real world) – Google terabyte corpus (real world)
  • 13. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (not countable) – Google terabyte corpus (countable) For training machine learning algorithms, the latter might be more valuable, just because it is possible to tally up evidence on the latter corpus. Of course I am simplifying things a lot and I don’t mean that the former is not valuable at all.
  • 14. Approach to corpus construction So if you are constructing a corpus on which machine learning methods might be applied, construct your corpus so that you retain as many examples of surface forms as possible.
  • 15. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 16. Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling Courtesy of http://norvig.com/chomsky.html
  • 17. Rule-based Approach “I before E except after C” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 18. Probabilistic Statistical Model: • Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = 0.0177 P(EI) = 0.0046 P(CIE) = 0.0014 P(CEI) = 0.0005 Courtesy of http://norvig.com/chomsky.html
  • 19. Words where ie occur after c • science • society • ancient • species Courtesy of http://norvig.com/chomsky.html
  • 20. But you can go back to a Rule-based Approach “I before E except after C only if C is not preceded by an S” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 21. What is a probability? • A number between 0 and 1 • The sum of the probabilities on all outcomes is 1 Heads Tails • P(heads) = 0.5 • P(tails) = 0.5
  • 22. Estimation of P(IE) P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
  • 23. What is Estimation? P(“UN”) = C(“UN”) / C(all words in my corpus)
  • 24. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 25. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong
  • 26. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong It depends on whether you care about precision or recall or both.
  • 27. What are Precision and Recall Classification metrics used to compare ML algorithms.
  • 28. Classification Metrics Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven How do you compare two ML algorithms?
  • 29. Classification Quality Metrics Point of view = Politics Gold - Politics Gold - Sports Observed - Politics TP (True Positive) FP (False Positive) Observed - Sports FN (False Negative) TN (True Negative)
  • 30. Classification Quality Metrics Point of view = Sports Gold - Politics Gold - Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 31. Classification Quality Metric - Accuracy Point of view = Sports Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 32. Metrics for Measuring Classification Quality Point of View – Class 1 Gold Class 1 Gold Class 2 Observed Class 1 TP FP Observed Class 2 FN TN Great metrics for highly unbalanced corpora!
  • 33. Metrics for Measuring Classification Quality F-Score = the harmonic mean of Precision and Recall
  • 34. F-Score Generalized 1 F 1 1   (1   ) P R
  • 35. Precision, Recall, Average, F-Score Precision Recall Average F-Score Classifier 1 50% 50% 50% 50% Classifier 2 30% 70% 50% 42% Classifier 3 10% 90% 50% 18% What is the sort of classifier that fares worst?
  • 36. How do you annotate? So if you are constructing a corpus for a machine learning tool where only precision matters, all you need is a corpus of presumed positives that you mark as right or wrong (or the label and other). If you need to get good recall as well, you will need a corpus annotated with all the relevant labels.
  • 37. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 38. How much data should you annotate? • The problem: ‘named entity classification’ • What is better? – 2000 words per category (each of Per, Org, Loc, Prod, Time) – 5000 words per category (each of Per, Org, Loc, Prod, Time)
  • 39. Small Corpus – 4 Fold Cross-Validation Split Train Folds Test Fold First Run • 1, 2, 3 • 4 Second Run • 2, 3, 4 • 1 Third Run • 3, 4, 1 • 2 Fourth Run • 4, 1, 2 • 3
  • 40. Statistical significance in a paper significance estimate variance Remember to take Inter-Annotator Agreement into account
  • 41. How much do you annotate? So you increase the corpus size till that the error margins drop to a value that the experimenter considers sufficient. The smaller the error margins, the finer the comparisons the experimenter can make between algorithms.
  • 42. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 43. Avoid Mistakes • The problem: ‘train a classifier’ • What is better? – Train with all the data that you have, and then test on all the data that you have? – Train on half and test on the other half?
  • 44. Avoid Mistakes • Training a corpus on a full corpus and then running tests using the same corpus is a bad idea because it is a bit like revealing the questions in the exam before the exam. • A simple algorithm that can game such a test is a plain memorization algorithm that memorizes all the possible inputs and the corresponding outputs.
  • 45. Corpus Splits Split Percentage Training • 60% Validation • 20% Testing • 20% Total • 100%
  • 46. How do you avoid mistakes? Do not train a machine learning algorithm on the ‘testing’ section of the corpus. During the development/tuning of the algorithm, do not make any measurements using the ‘testing’ section, or you’re likely to ‘cheat’ on the feature set, and settings. Use the ‘validation’ section for that. I have seen researchers claim 99.7% accuracy on Indian language POS tagging because they failed to keep the different sections of their corpus sufficiently well separated.