0% found this document useful (0 votes)
22 views162 pages

NLP Unit-1 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its definition, applications, key tasks, and prerequisites for learning. It discusses the structure of words, including morphological analysis, tokenization, and various techniques for understanding language components. Additionally, it highlights challenges faced in NLP, such as ambiguity and language-specific issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views162 pages

NLP Unit-1 Notes

The document provides an overview of Natural Language Processing (NLP), detailing its definition, applications, key tasks, and prerequisites for learning. It discusses the structure of words, including morphological analysis, tokenization, and various techniques for understanding language components. Additionally, it highlights challenges faced in NLP, such as ambiguity and language-specific issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

1.

FINDING THE STRUCTURE OF WORDS


2. WORDS AND THEIR COMPONENTS
3. ISSUES AND CHALLENGES
4. MORPHOLOGICAL MODELS

5. FINDING THE STRUCTURE OF DOCUMENTS


6. INTRODUCTION
7. METHODS
8. COMPLEXITY OF THE APPROACHES
9. PERFORMANCES OF THE APPROACHES
NATURAL
LANGUAGE
PROCESSING
NATURAL LANGUAGE PROCESSING
• NLP is a branch of AI that allows machines to understand human
language.
• NLP is a branch of AI that focuses on the interaction between
computers and human languages.
• NLP enables machines to understand , interpret and generate human
languages in a way that is meaningful and useful.
• Natural refers to human languages like: English, Hindi, Telugu etc.
These languages are used by human being to communicate.
• Language is a structured system of communication consisting of
words, grammars, and syntax used to convey ideas, thoughts, and
information.
• Examples are spoken languages and text.
NATURAL LANGUAGE PROCESSING
•Processing is the act of analyzing, transforming or
manipulating data.
•In NLP, processing involves working on natural languages
data to extract information, understand meaning or generate
language.
•NLP refers to methods, technologies that enable machines to
process, understand and work with human languages in a
meaningful way.
•It bridges the gap between human communication and
computer understanding.
APPLICATIONS
OF
NLP
Applications of NLP
1. Chatbots and virtual assistants
2. Search engines
3. Sentiment analysis
4. Recommendation systems
5. Content moderation
6. Machine translation
7. Speech/Voice recognition
8. Spam detection
9. Text summarization
10. Customer support automation
11. Document processing and analysis
12. Healthcare application
13. Autocorrect and predictive text
14. Spells and grammar check
COMMON APPLICATION OF NLP
• 1. Chatbots(Chatgpt, google assistant) & Virtual Assistants(like SIRI &
ALEXA)
• 2. Search Engines(google)
• 3. Recommendation Systems(Netflix, amazon, YouTube, Instagram
etc.)
• 4. Customer Support Automation(Zendesk, intercom, drift, Freshdesk)
• 5. Sentiment Analysis(hootsuite, brandwatch, monkeylearn)
• 6. Machine Translation(google translate, deepL, amazon translate)
• 7. Speech Recognition(google assistant, siri, amazon alexa, Microsoft
azure speech to text)
COMMON APPLICATION OF NLP
• 8. Text Summarization(SMMRY, resoomer, chatgpt, summarizebot)
• 9. Autocorrect and Predictive Text(google keyboard, swiftkey,
Grammarly keyboard)
• 10. Content Moderation(Microsoft content moderator, hate speech
detection by IBM)
• 11. Document Processing & Analysis(hypothesis, textrazor, zotero)
• 12. Spam Detection(google gmail, Microsoft outlook, mailchimp)
• 13. Healthcare Application(ibm Watson health, google health)
KEY TASKS IN NLP

• 1. Text Preprocessing
• 2. Language Understanding
• 3. Language Generation
• 4. Speech Processing
• 5. Information Retrieval
KEY TASKS IN NLP
Text Preprocessing:
Important tasks performed here includes:
• Tokenization
• Stemming
• Lemmatization
• Removing stop words
KEY TASKS IN NLP
Language Understanding:
Tasks performs include:
• Syntax analysis(Parsing)
• Semantic analysis(understanding meaning)
• Sentiment analysis
KEY TASKS IN NLP
Language generation:
Tasks performs here include:
• Text generation
• Summarization
• Machine translation(translating between two different languages)
KEY TASKS IN NLP
Speech Processing:
Tasks performs include:
• Speech to text and text to speech systems
• Voice assistants like siri and alexa uses nlp
KEY TASKS IN NLP
Information Retrieval:
It includes tasks like:
• Search engine and question-answering systems.
PREREQUISITES
TO
LEARN
NLP
PREREQUISITES TO LEARN NLP
1. programming skills:
• Python and its libraries like Numpy, Pandas for data manipulation
• Libraries like Matplotlib and Seaborn for visualization
2. mathematics:
• Linear algebra: matrices and vectors are used in word embeddings
and neural networks
• Probability and statistics: essentials for models like naïve bayes and
language modeling
• Calculus: basics for understanding gradient descent and optimization
in ML
PREREQUISITES TO LEARN NLP
3. Machine Learning:
• ML algorithms like Logistic regression, decision trees, SVM
• Overfitting, training and testing and evaluation metrics
• ML libraries like scikit-learn
4. Linguistic Basics:
• Syntax: sentence structure
• Semantics: meaning of words and phrases
• Morphology: word formation
• Phonology: sounds in languages
PREREQUISITES TO LEARN NLP
5. NLP specific libraries:
• NLTK for text processing tasks
• SPACY for advanced NLP tasks
• Hugging face transformers for working with pretrained deep learning
models like BERT, GPT
• GENSIM for topic modeling and word embeddings
PREREQUISITES TO LEARN NLP
6. Deep Learning:
• Neural network frameworks like: Tensorflow, and PyTorch
• Understanding architecture like RNN, LSTM, and transformers
7. understanding of text data:
• Ability to preprocess and clean text data(ex: tokenization, stopword
removal, stemming. Lemmatization)
• Understanding challenges in text data such as ambiguity and context
sensitivity.
1.1
FINDING
THE
STRUCTURE
OF
WORDS
FINDING THE STRUCTURE OF WORDS
• In NLP, finding the structure of words generally refers to
understanding the morphological(words formation) components and
syntactic(structure and arrangement of words) patterns of words in
order to analyze and process text more effectively.
• Finding the structure of words involve breaking down words into their
constituent parts and identifying the relationships between those
parts. This process is known as morphological analysis and it helps
NLP systems understand the structure of language.
• Finding the structure of words involves analyzing their components
and understanding how they function in context
• This process includes identifying the morphological, syntactic and
semantic structure of words
FINDING THE STRUCTURE OF WORDS
There are several ways to find the structure of words in NLP:
1. Tokenization
2. Stemming
3. Lemmatization
4. Part of speech tagging
5. Parsing
6. Named entity recognition
7. Dependency parsing
FINDING THE STRUCTURE OF WORDS
• By finding the structure of words in text, NLP systems can perform a
wide range of tasks such as: machine translation, text classification,
sentiment analysis and information extraction.
Tokenization:
• It is nothing but splitting of text into individual words, subwords or
tokens.
Word Tokenization:
• It is the breaking of text into words
• Ex: I am happy to [I, am, happy]
FINDING THE STRUCTURE OF WORDS
Subword Tokenization:
• Dividing words into smaller units for better handling of rare or
unknown words
• Ex: byte pair encoding(BPE)
Character Tokenization:
• Breaking words into individual characters
• Ex: computer to [c,o,m,p,u,t,e,r]
FINDING THE STRUCTURE OF WORDS
Morphological Analysis:
• It is nothing but decomposing of words into their morphemes.
Morpheme: it is the smallest unit of meaning in a language. It can be a
word like cat or a part of a word like –s in cats which indicates plural.
Morphemes Are Of Two Types:
1. free morphemes: can stand alone as word ex: run, happy
2. bound morphemes: can not stand alone as a word and must attach
to other morphemes ex: -ing in running, un- in unhappy.
FINDING THE STRUCTURE OF WORDS
Morphological Analysis:
• The –s in cats is a morpheme because it carries grammatical meaning,
it indicates plurality i.e more than one cat.
• Even though –s can not stand alone as a word, it still adds meaning to
the base word cat by changing it from singular to plural.
• That makes –s a bound morpheme because it must attach to another
word to have meaning.
• a morpheme does not have to be a full word. It just has to carry
meaning whether lexical like cat or grammatical like –s for plural or
–ed for past tense etc.
FINDING THE STRUCTURE OF WORDS
Morphological Analysis:
Techniques in Morphological Analysis:
1. Stemming
2. Lemmatization
3. Affix Stripping
4. Part of Speech Tagging
5. Morphological Parsing
6. Word Embedding
7. Dependency Parsing
8. Sub-word Technique in NLP Models
FINDING THE STRUCTURE OF WORDS
Affix Stripping:
• removing prefixes, suffixes, and infixes if any to analyze the structure
of the word
Morphological Parsing:
• it is nothing but parsing a word to identify its root and affixes and
understand its grammatical features.
• Ex: input: unhappiness
• Root: happy
• Prefix: un-
• Suffix: -ness
FINDING THE STRUCTURE OF WORDS
Word embedding:
• Representing words as numerical vectors in a continuous space to
capture their structure and meaning and relationships.
Techniques of word embedding:
1. Word2Vec:
consider the sentences:
• The cat sits on the mat.
• The dog lies on the rug.
• Using Word2Vec, words with similar meanings will have closer vector
representations.
FINDING THE STRUCTURE OF WORDS
2. FastText: it uses subword information, it is useful for
morphologically rich languages like Telugu.
3. Glove Embedding example:
• Global vectors for word representation creates word embeddings
based on word co-occurrence in large text corpora.
• Ex: king and queen have similar vector representations.
• Relationships like king-man+woman=queen can be derives
King-> [.52, .68, .12, .35, …]
Queen-> [.51, .67, .11, .36, …]
Apple-> [.23, .45, .67, .32, …]
FINDING THE STRUCTURE OF WORDS
FastText embedding example:
It is similar to word2vec but it uses subword information, making it
useful for morphologically rich languages like Telugu or Hindi.
FastText understands prefixes, suffixes and word stems better than
GloVe
• Ex: apple-> [.25, .45, .67, .32, …]
Apples-> [.26, .46, .68, .31, …]
• Both are always similar because of subwords
FINDING THE STRUCTURE OF WORDS
BERT embedding:
Bidirectional encoder representations from transformers
generates contextual word embeddings, meaning the vector
representation changes depending on the sentence.
• Ex: sentence1: “he went to the bank to withdraw money”
• Sentence2: “she sat on the bank of the river”
• In bert(both bank will have different vector)
Bank in sentence1: [.78, -.45, .32, …]
Bank in sentence2: [.35, -.62, .29, …]
• But GloVe or FastText assign the same vector to both bank regardless
of context.
FINDING THE STRUCTURE OF WORDS
Model:
• GloVe, FastText, BERT
Strengths:
• GloVe: good for word similarities but context independent
• FastText: handles spelling variations and subword.
• BERT: context-aware, understands different meanings
Examples use cases:
• GloVe: sentiment analysis, topic modelling
• FastText: morphologically rich language, noisy text
• BERT: chatbots, question-answering, machine translation
FINDING THE STRUCTURE OF WORDS
Dependency parsing:
• Analyzing the syntactic structure of sentences showing how words
relate to each other.
• ex: sentence: “she loves coding”
Dependency tree:
• loves-> she(subject)
• loves-> coding(object)
FINDING THE STRUCTURE OF WORDS
Sub-word techniques in NLP models:
• Sub-word segmentation methods break words into meaningful units
to handle rare or unseen words.
Techniques:
• BPE(byte pair encoding): splits words into frequent subword units.
• Ex: unhappiness->[“un”, “happy”, “ness”]
• SentencePiece: use in modes like BERT and GPT for unsupervised
tokenization.
1.2
WORDS
AND
THEIR
COMPONENTS
Words & their components
Here we need to learn about:

•Morphemes
•Typology
•Tokens
•Lexemes
Typology:
It refers to the classification of languages based on their structural and
functional properties or grammatical features.

Types of Typological Features:


1. Word Order Typology
2. Morphological Typology
3. Phonological Typology
4. Syntactic Typology
Word Order Typology:
• Languages are categorized based on typical sequence of subject, verb,
object in their sentence.
• And for machine translation(from one language to another), the NLP
system must account(RECORD) for the differences in both the languages.
• Ex: converting from English to Telugu.
• Without understanding linguistic typology, translations might lose
grammatical or semantic accuracy.
Morphological Typology:
• It is the study of how words are formed such as Analytic,
Synthetic, or Polysynthetic Languages.
• Morphological Typology classifies languages based on how
words are formed using morphemes.
• It studies how languages use prefixes, suffixes, infixes or
standalone words to express grammatical relationships such
as tense, case, number and gender.
Morphological Typology
Languages can be classified into the following types based on their
morphological structures:
1. Analytic(isolating) language.
2. Synthetic language
3. Polysynthetic language

Analytic(Isolating) Language:
Words consist of a single morpheme with no affixation.
Grammatical relationships are expressed through word order and
auxiliary(supporting) words.
Ex: Mandarin Chinese
Morphological Typology
Synthetic Language:
• Words are made up of multiple morphemes(root + affixes)
• Grammatical information is embedded in the word through affixation.
• Ex: Spanish, Turkish
• Entire sentence can be expressed in a single word by combining several
morphemes.
Morphological typology is used in:
• Language modelling
• Machine translation
• Speech recognition
Morphological Typology
Polysynthetic Language:
Entire sentence can be expressed in a single word by combining several
morphemes.
Morphological Typology
Use of Morphological Typology in NLP:
1. Language modeling: this typology used in modeling complex
languages like Turkish
2. Machine translation: this topology is useful in translating from
one simple language to another complex language
3. Speech recognition: it is used to recognize voice based on
morphological variations correctly
1.3 ISSUES & CHALLENGES
IN FINDING
STRUCTURE OF WORDS:
ISSUES & CHALLENGES IN FINDING
STRUCTURE OF WORDS:
1. Ambiguity
2. Morphology
3. Word order
4. Informal language
5. Out of vocabulary word
6. Named entities recognition
7. Language specific challenges
8. Domain specific challenges
Issues and Challenges in Finding the
Structure of Words:
Ambiguity:
• Ex: Bank(river/financial), Bat(cricket/animal)
• I saw the man with a telescope.(telescope may be with me or with
the man )
• Anirudh saw her duck.(duck may be bird or duck may mean dancing)
• Can you pick up the ball?(it may be a request or asking the capability)
• Anirudh told Abhishek that he was leaving.(he may be Anirudh or
someone else)
• Kick the bucket(to die)
Issues & Challenges in Finding the
Structure of Words:
Morphological:
• Ex: went 🡪 go
• Different languages have different way of expressing their needs.

Word-Order:
• "The dog bit the man“
• “The man bit the dog”
• Both have completely opposite meaning.
Issues & Challenges in Finding the
Structure of Words
Informal Language:
• "Let's go to the market, bhai. Need some chai.“
• gonna grab food u coming?
• That test was 😭😭. Teacher be like 100 marks or fail.
• Wat r u doin 2mrw?
• Bruh, that movie was lit! IDK why ppl hate it
Issues & Challenges in Finding the
Structure of Words
Out Of Vocabulary Word:
• Quantum supremacy is a game-changer! :- if quantum supremacy
was not in the training data then it will not give correct output
• CRISPR-Cas9 is revolutionizing genomics. :- nlp systems might not
work well with new or scientific terms.
• That’s a whole vibe, fr. : new terms or fr means for real may not be
recognized and may misinterpret.
• That biryani was muy delicioso! :- here muy delicioso is an Spanish
word and so the system may not process properly
• Unhappiness is contagious.
Issues & Challenges in Finding the
Structure of Words
Language Specific Challenges:
1. Morphological Complexity
2. Lack of Spaces Between Words
3. Low-Resource Languages
Example (Turkish):
"ev" (house) → "evlerimden" (from my houses)
NLP Issue: Tokenization & Word Segmentation become difficult
Example (Finnish):
• "talo" (house) → "talossani" (in my house)
• NLP Issue: Many variations of a root word make vocabulary handling
complex.
Issues & Challenges in Finding the
Structure of Words
Domain Specific Challenges:
• Medical NLP challenges:
• Financial and Business NLP challenges:
• Scientific and Research NLP challenges:
• Social Media NLP challenges
• Education and e-learning NLP challenges
• Cyber Security NLP challenges
• Legal NLP challenges
Issues & Challenges in Finding the
Structure of Words
Domain Specific Challenges:
• Medical NLP challenges:
Complex medical jargon, medical acronym and abbreviation, context
dependent meaning like cold(temperature or illness)

• Financial and Business NLP challenges:


Financial terms & ratios, industry specific acronyms(yoy- year over
year), market sentiment analysis.
Issues & Challenges in
Finding Structure of Words:
Scientific and Research NLP challenges:
• Highly technical vocabulary ("DSB" = Double-Strand Breaks),
• Rapidly evolving knowledge (New terms introduced frequently)
• Context-dependent understanding (Same term may mean different
things in different subfields)
• Social Media NLP challenges
• Slang & informal language ("🔥" = amazing, "chef’s kiss" = perfection)
• Code-switching & multilingual content ("That song is dope, bhai!")
• Sarcasm & irony detection ("Oh great, another update that slows my
phone…")
Issues & Challenges in Finding
Structure of Words:
Education and e-learning NLP challenges:
• Simplifying complex concepts (Adapting content for different age
groups)
• Personalized learning paths (Recommending study materials)
• Automated grading & feedback (Evaluating open-ended answers)
Issues & Challenges in Finding
Structure of Words:
Cyber Security NLP challenges
• Log file & network analysis (Parsing structured/unstructured data)
• Anomaly detection (Identifying phishing emails, fraud patterns)
• Entity linking (Matching threats to known vulnerabilities)

Legal NLP challenges


• Complex legal language & structure
• jurisdiction specific differences(laws vary by country)
• ambiguity and multiple interpretations(liable vs not liable)
1.4 Morphological
Models:
Morphological models:
Morphological models help machines understand and generate
human language efficiently.
They analyze how words are formed and how their structure affects
meaning.
Morphological models enhance NLP applications by:
• breaking down words,
• understanding their meaning, and
• improving text analysis in tasks like machine translation, search
engines, and speech processing.
Morphological Models

There are 5 morphological models:


1. Dictionary lookup
2. Finite state morphology
3. Unification based morphology
4. Functional morphology
5. Morphology induction
Morphological models:
Key Uses of Morphological Models in NLP
• Word Segmentation & Tokenization
• Part-of-Speech (POS) Tagging
• Lemmatization & Stemming
• Machine Translation
• Speech Recognition & Text-to-Speech (TTS)
• Spell Checking & Grammar Correction
• Named Entity Recognition (NER)
Dictionary lookup
•It is a morphological model which is used to analyze the
structure of words.
•Take a word 🡪 convert to canonical form/base form 🡪
root word 🡪 search that word in dictionary 🡪 if found the
word 🡪 retrieve information about that word (it can be
part of speech, meaning, related words etc.)
•If no matches found or if multiple matches are found then
this model will combine with other model to analyze the
word.
Finite state morphology
•It is based on finite automata and formal language theory.
•It is used for generation and recognition tasks.
•Ex: Word 🡪 grace
•We will use finite state transducers to reduce this one.
•Grace, Dis-grace, Grace-ful, Dis-grace-ful
•Generation means for the stem input, we generate all
remaining related words.
•Recognition means give any input but recognizing the base
word.
Morphological Models
Unification based morphology:
Morphology: The study of how words are formed from smaller units called
morphemes (e.g., "un-happi-ness").
Unification: A powerful operation that combines linguistic information (like
features) by merging them if they are compatible, or signaling a conflict if
they are not.
When forming a sentence, unification ensures that the subject and verb
agree by matching the required features.
• She runs every day. (Unification successful: "she" → 3rd person singular,
"runs" → 3rd person singular)
• She run every day. (Unification fails: "she" → 3rd person singular, "run" is
base form)
Unification Based Morphology:

• Example: English Verb Conjugation (Feature Unification)


• Let’s consider the verb "run" in different forms:
• Lexical Entry for "run" (Base Form)
• Lexeme: run, Category: verb, Base Form: run
• Form: runs: Base: run, Tense: present, Person: 3rd, Number: singular
• Form: ran: Base: run, Tense: past
• Form: running: Base: run, Tense: present, Aspect: progressive
Functional Based Morphology
It is used to analyze the role of a morpheme in a word.
It is also used to analyze the contribution of the morpheme in
overall grammatical structure of a sentences.
It considers how morphological elements (prefixes, root, suffixes)
contribute to the overall grammatical and semantic roles of words in
communication
Ex: English past tense
The suffix -ed in worked signals past tense
Functional Based Morphology

Functional Based Morphology:


• Function-based morphology analyzes word formation based on the
grammatical function that morphemes serve in a sentence.
• Instead of just focusing on the structure of words, it looks at how
morphemes contribute to meaning and syntax.
• Example: Past Tense Formation in English
• Consider the verb "walk" and its past tense "walked".
• Lexical Base: "walk" (root verb)
• Function of "-ed": It serves as a past tense marker.
• In function-based morphology, "-ed" is not just a suffix; it is analyzed
as a functional unit that modifies the verb to indicate past tense.
Functional Based Morphology
• Sentence Example:
• Present Tense: I walk to school.
• Past Tense: I walked to school.
• Here, the function of "-ed" is to change the tense of the verb.
Function-based morphology helps understand how morphemes
contribute to sentence structure and meaning.
Morphological induction
It is used to capture the pattern of a particular word
It involves deriving the rules or structures of word formation in a
language.
It focuses on identifying patterns in how words are formed by
analyzing their morphemes, the smallest units of meaning.
Key concepts in morphological induction:
1. Morphemes
2. Word formation rules
3. Paradigms analysis
4. Unsupervised learning
Morphological induction
Morphological induction would identify:
Ex: for the words: walk, walks, walking, walked
- The root: walk
- The suffixes: -s, -ing, -ed
- Rules: add –ed for past tense
- Add –ing for progressive aspect etc.
Ex: prefixes, roots, suffixes.
Morphological induction
Morphemes: the smallest grammatical units that carry meaning.
Ex: un- in undo. –ed in walked.
Word formation rules: understanding how morphemes combine to
create words.
Ex: root + suffix
Paradigm analysis: looking at sets of related words to generalize rules
Ex: run, runs, running.
Unsupervised learning: often involves using algorithms to discover
these patterns without prior labeled data.
1.5
Finding
The
Structure
Of
Documents
Finding The Structure of Documents:
Finding the structure of documents in NLP refers to analyzing and
understanding how information is organized and presented within a
document.
The goal is to uncover the underlying framework or hierarchy which
includes identifying sections, topics, relationships between ideas, and
the overall logical flow of content.
Finding The Structure of Documents:
Understanding the document structure is critical for tasks like:

1. Summarization
2. Information extraction
3. Content categorization
4. Question-answering
5. Search engine optimization
6. Knowledge graph building
Finding The Structure of Documents:
1. Logical Hierarchy:
Understanding the high level organization of a document such as:
-Title
-Heading/Subheadings
-Paragraphs
-Bulleted or Numbered List
Finding The Structure of Documents:
2. Semantic Organization:
Semantic organization refers to how words, phrases, or
concepts are structured based on their meanings and
relationships.
In word embeddings (like Word2Vec), words with similar
meanings are grouped closer in a vector space.
For example, "king" and "queen" are semantically related,
whereas "king" and "carrot" are not.
Finding The Structure of Documents:
3. Relationships between components:
The relationship between components refers to how different
linguistic elements (like words, phrases, or entities) are
connected based on meaning and context.
In dependency parsing, a sentence like "The cat sits on the
mat." shows relationships:
•"cat" (subject) → "sits" (verb)
•"sits" (verb) → "on" (preposition) → "mat" (object)
•This helps NLP models understand sentence structure and
meaning.
Finding The Structure of Documents:
4. Formatting & Layout:
Formatting and layout refer to how text is organized and structured
for easier processing by models.
This includes aspects like punctuation, spacing, capitalization, and
paragraph breaks.
Proper formatting helps algorithms understand sentence boundaries
and relationships between words.
The sentence "Hello, world!" has punctuation, which helps the
model recognize it as a complete thought. Without punctuation, it
would be harder to identify where one idea ends and another begins.
Finding The Structure of Documents:
4. Formatting and layout in NLP:
For documents like pdf & web pages, structure includes:
• Table of contents
• Page numbers
• Margins, fonts, spacing
❖ Paragraph Structure:
1. Example:
Paragraph 1: "NLP is exciting."
Paragraph 2: "It has many applications."
Paragraph breaks and guide models in understanding the structure
and flow of the document.
Finding The Structure of Documents:
• 4. Formatting & Layout in NLP:
❖ Sentence Segmentation:
1. Example: "I love programming. It's fun!"
Proper punctuation helps models identify that these are two
separate sentences.
❖ Tokenization:
1. Example: "I can't wait!"
Tokenization breaks this into smaller parts: ["I", "can't", "wait", "!"].
The layout affects how the model breaks the text into words or
phrases.
Finding The Structure of Documents:
4. Formatting & Layout in NLP:
❖ Bullet Points/Lists:
1. Example:
1. "Named Entity Recognition"
2. "Part-of-Speech Tagging“

Lists help NLP models organize information into discrete chunks,


improving comprehension.
• Each of these layout choices helps NLP models process and
understand text more effectively.
Finding The Structure of Documents:
4. Formatting & Layout in NLP:
❖ Headings/Subheadings:
1. Example:
Title: "NLP Overview"
Subheading: "Applications“
This helps the model understand the main topic vs. specific sections
within a document.
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Using language effectively to please or persuade
❖ “ask not what your country can do for you, ask what you can do for
your country”
❖ Understanding rhetorical structure helps in text summarization,
machine translation, and discourse analysis.
Finding The Structure of Documents:
5. Rhetorical Structure:
Examples:
❖ Cause-Effect:
1. "He studied hard, so he passed the exam."
NLP can recognize the cause ("studied hard") and effect ("passed
the exam") relationship.

❖ Cause and Effect:


• Text: "The storm damaged many homes, leading to widespread power
outages."
• Rhetorical Structure: The storm (cause) results in power outages
(effect).
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Contrast:
• Text: "While the economy is growing, unemployment rates remain
high."
• Rhetorical Structure: The positive aspect (economic growth) is
contrasted with the negative one (high unemployment).
❖ Contrast:
1. "Although it was raining, they went for a walk."
NLP detects the contrast between the weather condition and the
action.
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Elaboration
Text: "It’s raining today, so don’t forget your umbrella.
"Rhetorical Structure: The first part ("It’s raining today") provides the
premise, while the second part ("don’t forget your umbrella") is the
conclusion or recommendation based on that premise.
❖ Elaboration:
1. "She bought a car. It’s a red Tesla Model 3."
NLP identifies the second sentence as an elaboration of the first.
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Problem and Solution:
Text: "The city faced severe traffic congestion. To address this, they
implemented a new public transportation system."
Rhetorical Structure: Traffic congestion (problem) is addressed by the
new transportation system (solution).
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Claim and Evidence:
• Text: "Eating vegetables improves health, as shown by numerous
studies linking plant-based diets to lower risks of chronic diseases.“
• Rhetorical Structure: The claim ("Eating vegetables improves health")
is supported by evidence (studies linking plant-based diets to lower
disease risks).
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Generalization and Example:
Text: "Many countries have implemented renewable energy solutions.
For instance, Denmark relies heavily on wind energy."
Rhetorical Structure: The general statement (many countries use
renewable energy) is supported by a specific example (Denmark using
wind energy).
Finding The Structure of Documents:
5. Rhetorical Structure:
❖ Comparison:
• Text: "Unlike traditional medicine, alternative therapies often focus
on holistic approaches."
• Rhetorical Structure: Traditional medicine is compared with
alternative therapies, emphasizing the difference in approach.

These structures help in tasks like summarization, question answering,


and sentiment analysis in NLP by clarifying how text parts relate to each
other.
Finding The Structure of Documents:
6. Syntactic Structure:
In NLP, syntactic structure refers to the arrangement of words in a
sentence according to grammar rules.
It focuses on how words and phrases are organized to convey
meaning.
Understanding the syntactic structure helps in tasks like parsing
sentences, extracting relationships, and machine translation.
Finding The Structure of Documents:
6. Syntactic Structure:
Example:
• Sentence: "The cat chased the mouse."
• Syntactic structure: "The cat" (subject) + "chased" (verb) + "the
mouse" (object).
In NLP, identifying this structure helps computers understand sentence
components and their relationships, which is crucial for accurate language
processing.
Syntactic Structure in NLP, showing how sentence components are
organized:
Finding The Structure of Documents:

6. Syntactic Structure:
Examples:
❖ Simple Sentence:
Sentence: "She reads books."
Syntactic Structure: Subject ("She") + Verb ("reads") + Object ("books")
Finding The Structure of Documents:
6. Syntactic Structure:
❖ Compound Sentence:
Sentence: "He studied hard, and he passed the exam."
• Syntactic Structure: Independent clause 1 ("He studied hard") +
Coordinating conjunction ("and") + Independent clause 2 ("he passed
the exam").
❖ Complex Sentence:
• Sentence: "Although it was raining, she went for a walk."
• Syntactic Structure: Subordinate clause ("Although it was raining") +
Main clause ("she went for a walk").
Finding The Structure of Documents:
6. Syntactic Structure:
❖ Question Sentence:
• Sentence: "Did you finish your homework?"
• Syntactic Structure: Auxiliary verb ("Did") + Subject ("you") + Verb
("finish") + Object ("your homework").

❖ Passive Sentence:
• Sentence: "The cake was eaten by the children."
• Syntactic Structure: Subject ("The cake") + Verb ("was eaten") + Agent
("by the children").
Finding The Structure of Documents:
6. Syntactic Structure:
❖ Noun Phrase and Verb Phrase:
• Sentence: "The quick brown fox jumped over the lazy dog."
• Syntactic Structure: Noun phrase ("The quick brown fox") + Verb
phrase ("jumped over the lazy dog").

In NLP, these syntactic structures are analyzed using techniques like


dependency parsing or constituency parsing to understand how words
are related and to enable tasks like sentence generation, question
answering, or machine translation.
Finding The Structure of Documents:
Some of the common techniques used to analyze and determine the
structure of documents are:
1. Preprocessing the document
2. Analyzing document layout
3. Topic and semantic analysis
4. Sentence level analysis
5. Document embeddings
6. Discourse parsing
7. Visual features
8. Graph based representations
Finding The Structure of Documents:
1. Preprocessing a document:
Preprocessing a document in NLP involves cleaning and preparing
raw text data before further analysis.
This step is important because text data is often messy, and
preprocessing helps to standardize it for tasks like text classification,
sentiment analysis, or machine translation.
This makes the document more manageable and helps algorithms
better understand the underlying structure and meaning of the text.
Finding The Structure of Documents:
1. Preprocessing a document:
Common steps in Preprocessing the document:
1. Tokenization.
2. Stop-word Removal.
3. Stemming / Lemmatization.
4. POS Tagging.
5. Lowercasing
6. Removing punctuation/special characters
Finding The Structure of Documents:
2. Analyzing document layout:
Analyzing document layout in NLP refers to understanding the
structure and organization of a document to extract meaningful
information.
It involves identifying elements like headings, paragraphs, lists,
tables, and other formatting features that give context to the content.
Examples:
1. Headings: Identifying section titles to understand the main topics of
the document (e.g., "Introduction," "Conclusion").
2. Paragraphs: Recognizing different paragraphs to understand the flow
of ideas or arguments.
Finding The Structure of Documents:
2. Analyzing document layout:
Lists: Detecting bullet points or numbered lists, which may contain
important items or steps.
Tables: Extracting data or structured information from tables for
analysis.

By analyzing the layout, NLP systems can interpret the organization of


content, making it easier to process, summarize, or extract relevant
data.
Finding The Structure of Documents:
2. Analyzing document layout:
❖ Heading Detection: identifying headings and subheadings using
formatting ex: bold text, font size
❖ Section Segmentation: breaking the document into sections based on
structure (ex: headers, bullet points, numbered lists)
❖ Hierarchical Structure: detecting nested structures such as sections,
subsections and paragraphs.
Finding The Structure of Documents:
3. Topic and Semantic Analysis:
Both topic and semantic analysis help machines understand content
at a deeper level for tasks like summarization, question answering, or
sentiment analysis.
Topic Analysis in NLP involves identifying the main subject or theme
of a document. It helps in determining what the document is about.
Document: "Climate change is impacting global weather patterns,
causing floods and droughts.“
Topic: Climate change.
Finding The Structure of Documents:
3. Topic and Semantic Analysis:
Semantic Analysis in NLP focuses on understanding the meaning of
words, phrases, or sentences, considering context.
It involves interpreting how different words relate to each other and
the overall meaning of the text.
Example:
• Sentence: "The bank was full of customers."
• Semantics: The word "bank" could refer to a financial institution or
the side of a river, and semantic analysis helps determine the correct
meaning based on context.
Finding The Structure of Documents:
4. Sentence Level Analysis:
Sentence-level analysis in NLP involves understanding the structure,
meaning, and components of individual sentences.
It helps break down the sentence into parts (like subject, verb,
object) and interpret its meaning, sentiment, or intent.
Examples:
1. Syntactic Analysis:
1. Sentence: "The cat sleeps on the mat."
Analysis: Identifies subject ("The cat"), verb ("sleeps"), and object ("on the
mat").
Finding The Structure of Documents:
4. Sentence Level Analysis:
1. Sentiment Analysis:
1. Sentence: "I love this movie!"
Analysis: Determines the sentiment is positive.
2. Dependency Parsing:
1. Sentence: "She gave him a book."
Analysis: Identifies relationships: "She" → "gave", "him" → "gave", "book" →
"gave".
Sentence-level analysis helps understand the meaning, grammar,
and relationships within a sentence for tasks like translation or text
classification.
Finding The Structure of Documents:
4. Sentence Level Analysis:
Dependency Parsing:
• Sentence: "She loves apples."
Dependency Parsing Output:
• "loves" (root verb)
• "She" (subject, depends on "loves")
• "apples" (object, depends on "loves")
Finding The Structure of Documents:
4. Sentence Level Analysis:
Conference Resolution:
It is the process of determining which mentions of a name or entity
in a document refer to the same real-world object.
It is commonly used in Natural Language Processing (NLP) to resolve
pronouns and repeated references.
• Text: "John went to the store. He bought some milk."
Resolution: "He" refers to "John.“
1. Improves text understanding in AI.
2. Helps in chatbots, search engines, and summarization.
3. Used in machine translation and information extraction.
Finding The Structure of Documents:
4. Sentence Level Analysis:
Semantic Role Labeling:
Semantic Role Labeling (SRL) is a Natural Language Processing (NLP)
technique that identifies the roles of words in a sentence to understand
who did what to whom, when, and where.
It assigns labels like Agent (doer), Action (verb), and Patient
(receiver) to different words.
1. Improves AI comprehension in chatbots, search engines, and
translation.
2. Helps extract meaning from sentences efficiently.
Finding The Structure of Documents:
4. Sentence Level Analysis:
Semantic Role Labeling:
• Sentence: "John gave Mary a book."
• John → Agent (Who did the action?)
• gave → Action (What happened?)
• Mary → Recipient (Who received it?)
• a book → Theme (What was given?)
Finding The Structure of Documents:
5. Document Embeddings:
1. Document embedding is a technique that converts an entire
document into a numerical vector. This vector captures the
meaning, context, and relationships between words in the
document, allowing machines to understand and compare texts
efficiently.
2. Fixed-Length Representation – Converts varying-length documents
into fixed-size numerical vectors.
3. Captures Semantic Meaning – Similar documents have similar
embeddings.
Helps in search engines, text classification, and recommendation
systems.
Finding The Structure of Documents:
6. Discourse Parsing:
Discourse parsing is the process of analyzing how different parts of a
text or conversation are connected logically.
It helps computers understand the structure and meaning of longer
pieces of text beyond individual sentences.
Understanding Relationships – It identifies how sentences or
phrases relate to each other, like cause-effect, contrast, or elaboration.
Finding The Structure of Documents:
6. Discourse Parsing:
Breaking Down Text – It splits a text into smaller meaningful parts
called discourse units.
Building a Structure – It organizes these units into a tree-like structure
showing how ideas flow.

• Text: "I was tired. So, I went to bed early."


Parsing Output:
• "I was tired" (Cause)
• "I went to bed early" (Effect)
Finding The Structure of Documents:
7. Visual features:
Visual features in understanding the structure of documents refer to
elements like formatting, layout, and typography that help organize and convey
meaning beyond just the text itself.
In NLP, visual features help understand the context, hierarchy, and emphasis
of the content.
• Visual features, combined with text analysis, help NLP systems
understand and process documents in a way that preserves both
structure and content.
• Tables and Images:
• Tables organize data, and images can provide visual context,
reinforcing the document's meaning.
Finding The Structure of Documents:
7. Visual features:
Examples:
1. Headings and Subheadings:
1. A document with bolded headings ("Introduction", "Conclusion") helps
identify the structure and flow of topics, indicating important sections or
changes in focus.
2. Bullet Points or Numbered Lists:
1. A list like:
1. Item one
2. Item two
2. This structure signals that the information is organized in a sequence,
highlighting key points or steps.
Finding The Structure of Documents:
7. Visual features:
• Text Formatting (Bold, Italics, Underline):
• Bolded or italicized text, like important or emphasized, indicates
emphasis or keywords that need special attention.

• Paragraph Indentation or Line Spacing:


• Different paragraph styles (like indented paragraphs) can indicate a
new section or change in topic.
Finding The Structure of Documents:
8. Graph based representation:
Graph-based representation in NLP involves modeling a document’s
structure as a graph, where words, sentences, or other components are
represented as nodes, and relationships between them are
represented as edges.
This representation helps understand the connections,
dependencies, and flow of information in the document.
Graph-based representations allow NLP systems to visualize
relationships and structures, making it easier to analyze meaning,
context, and dependencies in documents.
Finding The Structure of Documents:
8. Graph based representation:
• Dependency Parsing:
In a sentence like "The cat chased the mouse," a graph-based
representation would show:
• "cat" → "chased" (subject-verb relationship)
• "chased" → "mouse" (verb-object relationship)
This graph helps visualize how words depend on each other for
meaning.
Finding The Structure of Documents:
8. Graph based representation:
❖ Document Structure with Topics:
A document about climate change might be represented as a graph,
where nodes represent different topics (e.g., "carbon emissions,"
"global warming") and edges show relationships between them.
Example:
"Carbon emissions" → "global warming," "global warming" →
"temperature rise."
Finding The Structure of Documents:
❖ Co-occurrence Graph:
In topic modeling, words that frequently appear together in a
document can be connected in a graph.
For example, if "climate" and "change" appear together often, they
might form a connected node in the graph.
1.6
Finding
the
Structure
of
Documents:
Methods
Finding the Structure of Documents :Methods

1. Rule Based Methods(Heuristic)


2. Layout-Based Methods(Visual and Formatting Features)
3. Machine Learning Based Methods
4. NLP Based Structural Analysis
5. Deep Learning & Transformer Models
Finding the Structure of Documents: Methods
1. Rule-Based Methods (Heuristics)
A strict rule: "Always check every possible solution before choosing the
best one.“
A heuristic: "Choose the option that looks most promising based on past
experience.“
In everyday life, if you assume that expensive products are of higher
quality without researching, you’re using a heuristic.
In computer science and mathematics, heuristics are used to find
approximate solutions when exact solutions are too complex to compute.
These methods rely on predefined rules to identify structure based on
formatting, keywords, or patterns.
Finding the Structure of Documents: Methods
1. Rule-Based Methods (Heuristics):
Example: Parsing a Resume
❖ Name → First bold text at the top
❖ Education → Section with keywords like "Education" or "Degree"
❖ Experience → Section with job titles and dates

Advantages: Simple, interpretable


Disadvantages: Hard to generalize across different document formats
Finding the Structure of Documents: Methods
2. Layout-Based Methods (Visual & Formatting Features)
Documents often have headings, bullet points, tables, and font
variations, which provide structural clues.
Techniques Used:
❖ PDF Parsing: Extracting headers, footers, and layout using tools like
PDFMiner, Apache Tika
❖ HTML Parsing: Using BeautifulSoup or XPath for web pages
❖ OCR (Optical Character Recognition): For scanned documents (Tesseract,
Google Vision API)
Advantages: Works well for structured documents like invoices and reports.
Disadvantages: Struggles with inconsistent layouts
Finding the Structure of Documents: Methods
3. Machine Learning-Based Methods
Machine learning models classify or segment documents based on
patterns learned from labelled data.
Approaches:
❖ Supervised Learning: Training classifiers (SVM, Random Forest, BERT) on
labelled document sections
❖ Unsupervised Learning: Clustering similar sections (K-Means, LDA for topic
modelling)
Advantages: More adaptable to different document types
Disadvantages: Requires labelled training data
Finding the Structure of Documents: Methods
4. NLP-Based Structural Analysis
Documents are analyzed at different levels using NLP techniques.
Methods Used:
❖ Named Entity Recognition (NER): Extracts names, dates, locations
(Spacy, BERT-based models)
❖ Part-of-Speech (POS) Tagging: Identifies nouns, verbs, etc., to
structure content
❖ Dependency Parsing: Finds grammatical relationships between words
Finding the Structure of Documents: Methods
4. NLP-Based Structural Analysis
❖ Coreference Resolution: Links pronouns to entities for better
document understanding.
❖ Coreference resolution helps recognize that both refer to the same
person, improving entity tracking and information extraction.
❖ Ex: "John Doe signed the contract. He agreed to the terms and
conditions.“ Here, "He" refers to "John Doe".

Advantages: Helps extract meaning beyond simple layout detection


Disadvantages: Can struggle with ambiguous or noisy text
Finding the Structure of Documents: Methods
5. Deep Learning & Transformer Models
Advanced models like BERT, GPT, T5, LayoutLM understand the content
and layout together.
Example Applications:
❖ LayoutLM: Analyzes both text and formatting for structured documents
(e.g., invoices, forms)
❖ Document BERT (DocBERT): Extracts relationships between different parts
of a document
❖ T5 for Summarization: Learns structure implicitly for summarization and
question-answering
Advantages: Highly effective for complex and unstructured documents
Disadvantages: Requires large datasets and computational resources
Finding the Structure of Documents: Methods

Document Type Recommended Method


❖ Layout-based (PDF parsing,
❖ Structured (Invoices, Forms)
OCR)

❖ Unstructured (Articles, Blogs) ❖ NLP (NER, Dependency Parsing)

❖ Mixed (Legal, Financial


❖ Machine Learning + NLP
Reports)

❖ OCR + Deep Learning


❖ Scanned Documents
(LayoutLM)
1.7
Complexity
of
Approaches
(Finding the Structure of Documents)
Complexity of Approaches
(Finding the Structure of Documents)
Approach 🡪 Best For 🡪 Challenges
1. Rule-Based(RegEx, Keywords) 🡪 Structured documents (Invoices,
Resumes) 🡪 Hard to adapt, brittle
2. Layout-Based (PDF, OCR) 🡪 Scanned documents, Forms 🡪 Fails with
inconsistent formats
3. ML-Based (SVM, RF, BERT) 🡪 Emails, Reports, Research papers 🡪
Needs labeled data
4. NLP-Based (NER, Parsing) 🡪 Legal, Financial, Medical documents 🡪
Struggles with ambiguity
5. Transformers (BERT, LayoutLM) 🡪 Complex PDFs, Multi-format
documents 🡪 High computational cost
Complexity of Approaches
(Finding the Structure of Documents)
1. Rule-Based Methods (Heuristics):
❖ How It Works: Uses predefined patterns (e.g., regex, keyword
matching) to identify sections.
❖ Example: Extracting headings based on bold/uppercase words.
❖ Pros & Cons:
• Low computational cost (simple string matching)
• Interpretable and easy to implement
• Fails when document format changes
• Limited generalizability
❖ Best For: Fixed-format documents (resumes, invoices, structured
reports)
Complexity of Approaches
(Finding the Structure of Documents)
2. Layout-Based Methods (Visual & Formatting Features):
❖ How It Works: Uses font size, indentation, tables etc to determine
structure.
❖ Example: Extracting sections from PDFs using tools like PDFMiner or
Tesseract (OCR).
❖ Pros & Cons
• Effective for structured documents
• Works well with table-heavy documents
• Fails if formatting is inconsistent
• OCR errors can degrade performance
Best For: Scanned documents, invoices, forms, and reports
Complexity of Approaches
(Finding the Structure of Documents)
3. Machine Learning-Based Methods
❖ How It Works: Trains classifiers (SVM, Random Forest, deep learning
models) on labeled document sections.
❖ Example: Categorizing paragraphs as "Introduction," "Methods," or
"Conclusion" using logistic regression.
❖ Pros & Cons
• More adaptable than rule-based methods
• Can handle varied document types
• Needs labeled training data
• Can be computationally expensive
❖ Best For: Emails, reports, research papers, financial documents
Complexity of Approaches
(Finding the Structure of Documents)
4. NLP-Based Structural Analysis:
❖ How It Works: Uses NLP techniques like Named Entity Recognition (NER),
Dependency Parsing, and Coreference Resolution.
❖ Example: Extracting entities like "Company Name," "Date," and "Amount"
from legal documents.
❖ Pros & Cons
• Extracts semantic meaning, not just format-based
• Can process unstructured text
• Requires pre-trained models
• Struggles with ambiguous phrasing
❖ Best For: Legal documents, research articles, customer support transcripts
Complexity of Approaches
(Finding the Structure of Documents)
5. Deep Learning & Transformer Models:
❖ How It Works: Uses pre-trained models like BERT, T5, LayoutLM to
understand both content and layout.
❖ Example: LayoutLM extracts information from invoices by jointly analyzing
text and positioning.
❖ Pros & Cons
• Best generalization ability
• Handles both text & layout context
• Requires large datasets & GPUs
• Interpretability is lower than rule-based methods
❖ Best For: Complex multi-page documents, scanned PDFs, and mixed-format
reports
Complexity of Approaches
(Finding the Structure of Documents)
❖ For structured documents, rule-based & layout-based methods work
well (efficient, interpretable).
❖ For semi-structured/unstructured text, ML & NLP-based methods
provide better adaptability.
❖ For high accuracy & generalization, deep learning (BERT, LayoutLM) is
best but costly.
1.8
PERFORMANCE
OF THE
DIFFERENT APPROACHES
OF
FINDING THE STRUCTURE
OF
DOCUMENTS
Performance of Different Approaches of Finding
the Structure of Documents
The performance of different approaches for finding the structure of
documents depends on various factors, including accuracy, efficiency, scalability,
and adaptability.
The Different Approaches are:
1. Rule Based
2. Template Based
3. ML
4. DL
5. Graph Based
6. Hybrid
Performance of Different Approaches of Finding the
Structure of Documents
1. Rule-Based Methods
❖ Description: Uses predefined rules and heuristics to extract structure.
❖ Pros: Fast, interpretable, and works well for well-structured
documents.
❖ Cons: Inflexible, requires manual tuning, and struggles with variations
in document formats.
❖ Performance: High precision in controlled environments but poor
generalization.
Performance of Different Approaches of Finding the
Structure of Documents
2. Template-Based Approaches
❖ Description: Uses predefined document templates to extract
structure.
❖ Pros: High accuracy for standard document types (e.g., invoices,
forms).
❖ Cons: Limited to known templates, lacks adaptability.
❖ Performance: Good for structured documents but fails with unseen
formats.
Performance of Different Approaches of Finding the
Structure of Documents
3. Machine Learning (ML) Models
❖ Description: Uses statistical models trained on labeled data.
❖ Pros: More adaptable than rule-based methods, can handle noisy or
semi-structured documents.
❖ Cons: Requires labeled training data, computationally expensive.
❖ Performance: Good for moderately complex documents, but
struggles with highly unstructured content.
Performance of Different Approaches of Finding the
Structure of Documents
4. Deep Learning & NLP-Based Approaches (e.g., BERT, LayoutLM)
❖ Description: Uses neural networks and transformer models to learn
document structures.
❖ Pros: High accuracy, can generalize across different document types,
supports complex layouts.
❖ Cons: Requires large datasets, expensive to train and deploy.
❖ Performance: Best for unstructured and complex documents, but
high computational cost.
Performance of Different Approaches of Finding the
Structure of Documents
5. Graph-Based and Structural Parsing Methods
❖ Description: Represents document structures as graphs (e.g.,
tree-based parsing).
❖ Pros: Effective for hierarchical and relational structures.
❖ Cons: Requires preprocessing and is sensitive to document variations.
❖ Performance: Good for structured and semi-structured documents
but may struggle with freeform text.
Performance of Different Approaches of Finding the
Structure of Documents
6. Hybrid Approaches
❖ Description: Combines rule-based, ML, and deep learning techniques.
❖ Pros(Advantage): Balances accuracy and efficiency, adaptable to
different document types.
❖ Cons(Disadvantage): Complex implementation and higher
computational requirements.
❖ Performance: Generally superior, as it leverages strengths from
multiple techniques.
Approach Accuracy Flexibility Computational Best For
Cost
Rule Based Medium Low Low Structured Documents

Template Based High Low Low Standard Forms

ML High Medium Medium Semi-Structured


Documents
DL Very High High High Complex, Unstructured
Data
Graph Based High Medium Medium Hierarchical Data

Hybrid Very High High High Versatile Applications


Performance of Different Approaches of Finding
the Structure of Documents
Approach----Accuracy----Flexibility----ComputationalCost----BestFor
1. RuleBased--medium---low---low---structured-documents
2. TemplateBased—high—low—low—standard-forms
3. MLModels—high—medium---medium---semi-structured-document
s
4. DeepLearning—VeryHigh--high---high---complex, unstructured-data
5. GraphBased—high—medium—medium—hierarchical-data
6. Hybrid---VeryHigh—high—high —versatile applications
SOME
IMPORTANT
CONCEPTS
BYTE PAIR ENCODING
• Byte Pair Encoding is a clever algorithm that helps computers
understand and process language more efficiently.
• It works by finding the most frequent pairs of characters
or "bytes" in a text and merging them into a new symbol.
• This process repeats until you have a set of the most common
"subword units" (like parts of words or whole words) that
can be used to represent your text.
BPE
• Example:
Let's say we have the following sentence:
"The quick brown fox jumps over the lazy fox“
1. Start with individual characters:
Our initial vocabulary is:
{"T", "h", "e", " ", "q", "u", "i", "c", "k", "b", "r", "o", "w", "n", "f", "x",
"j", "m", "p", "s", "v", "l", "a", "z", "y"}
BPE
2. Find the most frequent pair:
• The most frequent pair is "e " (e followed by a space), which occurs 3
times.
3. Merge the pair:
• We add "e " to our vocabulary and replace all occurrences of "e " with
the new symbol.
4. Our vocabulary becomes:
• {"T", "h", "e", " ", "q", "u", "i", "c", "k", "b", "r", "o", "w", "n", "f",
"x", "j", "m", "p", "s", "v", "l", "a", "z", "y", "e "}
BPE
• Repeat:
• We continue this process, finding the next most frequent pair and
merging them.
• For example, "th" might be the next most frequent pair.
• We add "th" to our vocabulary and replace all occurrences of "th"
with it.

Continue until a limit:


We keep merging pairs until we reach a desired vocabulary size or
until no frequent pairs remain.
BPE IS USEFUL FOR…
• Handles unknown words: BPE can break down rare or unknown
words into subword units that it already knows, making it easier for
the computer to understand them.
• Reduces vocabulary size: By using subword units, BPE can
represent a large amount of text with a smaller vocabulary, which is
more efficient for computers.
• In the real world:
BPE is used in many popular language models, like those that power
chatbots and translation systems. It helps these models understand and
generate text more effectively.
LEXEME
A lexeme is the base or fundamental form of a word, which represents a
set of related word forms.
It is the abstract unit of meaning in a language, ignoring variations in
tense, number, or case.
• The verb "run" is a lexeme. It represents all its different word forms:
run (base form)
runs (third-person singular)
ran (past tense)
running (present participle)
All these variations belong to the same lexeme "run" because they share the
same core meaning.
LEMMA
A lemma is the dictionary or base form of a word.
• It is the form under which related word variations (inflected forms) are
grouped in dictionaries.
• Example:
For the verb "go", the lemma is "go", and its different inflected forms include:
go (base form)
goes (third-person singular)
went (past tense)
gone (past participle)
going (present participle)
In simple terms, the lemma is the standard form, while a lexeme refers to the
whole group of related forms.
Morphological analysis:
The process of finding all the morphemes in a word and
describing the words properties.
Morphological analysis aims to break down the words into
their constituent parts such as: root, prefixes, suffixes, and
understand their roles and meanings.
This process is essential for various NLP tasks such as
language modeling, text analysis, and machine translation.
Key techniques used in morphological analysis:
1. Stemming
2. Lemmatization
3. Morphological parsing
4. Neural network models
5. Rule based methods
6. Machine learning models
Morphological analysis application

1. Information retrieval
2. Machine translation
3. Test to speech systems
4. Named entity recognition
5. Spell checkers and grammar checkers
Lexeme and Morpheme
Lexemes: lexeme refers to a word and all the forms of that word. Their
forms change depending on subject or tense but their core meaning remains
the same.
Ex: run, running, runs etc. are all the inflected forms of the lexeme ‘run’
The headword in dictionary are all lexemes.

Morphemes: these are the smallest units of meanings in a language.


Happiness, unhappiness, ness, un, happy etc
Morpheme
Morphemes types:
1. Free morphemes: these are stand alone words like dog
2. Bound morphemes: these words cant exist independently.
Ex: -s in pens.
3. Inflectional morphemes: these are suffixes that carry grammatical
information such as tense, number, person, gender, case etc.
Stemming and Lemmatization
• Stemming reduces words to their base or root forms by removing affixes
often resulting in non-real words.
• Ex: running to run
• Better to bet

• Lemmatization reduces words to their base form(lemma) using vocabulary


and morphological analysis resulting in real words
• Ex: running to run
• Better to good

You might also like