NATURAL LANGAUGE PROCESSING 5TH AI
UNIT III SYNTACTIC ANALYSIS
Context-Free Grammars, Grammar rules for English, Treebanks, Normal Forms for grammar –
Dependency Grammar – Syntactic Parsing, Ambiguity, Dynamic Programming parsing – Shallow
parsing – Probabilistic CFG, Probabilistic CYK, Probabilistic Lexicalized CFGs - Feature
structures, Unification of feature structures.
Context-Free Grammars
Context-free grammars (CFGs) are foundational in natural language processing (NLP) for formally
describing the structure of natural language sentences. A CFG consists of a set of production rules used to
generate all well-formed sentences in a language.
Key Elements of a Context-Free Grammar
Terminals (T): The basic symbols from which strings are formed (e.g., actual words in a
language).
Non-terminals/Variables (V): Syntactic categories or placeholders for groups of terminals (e.g.,
Sentence 'S', Noun Phrase 'NP', Verb Phrase 'VP').
Production Rules (P): Rules of the form A→βA→β, where AA is a non-terminal, and ββ is a
sequence of terminals and/or non-terminals.
Start Symbol (S): A special non-terminal symbol from which generation begins (often the symbol
'S' for 'Sentence')
Formal Definition
A CFG is defined as a 4-tuple (V,T,P,S)(V,T,P,S), where:
VV: Set of non-terminals
TT: Set of terminals
PP: Set of production rules (A→βA→β)
SS: Start symbol (S∈VS∈V)
Example
A simple CFG for part of English might include:
V={S,NP,VP,Det,N}V={S,NP,VP,Det,N}
T={"the", "cat", "sat"}T={"the", "cat", "sat"}
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
S→NP VP, NP→Det N, Det→"the", N→"cat", VP→"sat"S→NP VP, NP→Det N, Det→"the", N
→"cat", VP→"sat"
S=SS=S
This CFG can generate the sentence "the cat sat" as a valid structure
Use in NLP
CFGs are vital for:
Describing syntactic structure (syntax trees) of natural language.
Building parsing algorithms to analyze sentence structure.
Modeling hierarchical and recursive constituents in language (like nested noun phrases)
Example:
Here is how the sentence **"I can go to school"** can be represented in context-free grammar (CFG)
format:
### Non-terminals
- S: Sentence
- NP: Noun Phrase
- VP: Verb Phrase
- MOD: Modal
- V: Verb
- PP: Prepositional Phrase
- P: Preposition
- N: Noun
### Terminals
- "I", "can", "go", "to", "school"
### Production Rules
S → NP VP
NP → "I"
VP → MOD V PP
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
MOD → "can"
V → "go"
PP → P N
P → "to"
N → "school"
### Derivation
-S
→ NP VP
→ "I" VP
→ "I" MOD V PP
→ "I" "can" V PP
→ "I" "can" "go" PP
→ "I" "can" "go" P N
→ "I" "can" "go" "to" N
→ "I" "can" "go" "to" "school"
This CFG generates the sentence "I can go to school" and shows the structure required to parse it in
natural language processing applications.
Grammar rules for English
Grammar rules for English in NLP provide the structural foundation to parse and generate well-formed
sentences. Here are the key types of grammar rules and concepts used in natural language processing for
English:
1. Parts of Speech (POS)
Labeling each word as a noun, verb, adjective, adverb, etc.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
Example: In "The dog runs fast", "the" is an article, "dog" is a noun, "runs" is a verb, "fast" is an
adverb.
Used for POS tagging in NLP.
2. Syntax Rules
Define the arrangement of words in a sentence.
Example: S → NP VP (A sentence can be a Noun Phrase followed by a Verb Phrase.)
In context-free grammar (CFG), these are production rules applied recursively.
3. Sentence Structure
The basic format is Subject + Verb + Object.
Example: ("Ravi" is the subject, "eats" the verb, "mangoes" the object).
NLP tools break sentences into such chunks for analysis.
4. Tense and Subject-Verb Agreement
Ensures verbs correspond correctly with the subject and tense.
Example: "He walks" (singular subject) versus "They walk" (plural subject).
5. Noun Phrases and Verb Phrases
Groups of words acting as a single noun (noun phrase) or single verb (verb phrase).
Example: "The big brown dog" (noun phrase), "is barking loudly" (verb phrase).
6. Dependency Grammar
Focuses on word-to-word relationships within a sentence (e.g., subject-verb-object
dependencies).
Helps machines understand roles and relationships.
7. Grammar Ambiguity
Some sentences can be interpreted in more than one way.
Example: "I saw the man with the telescope" (multiple possible attachments/interpretations).
Example Grammar Rules in CFG Format
S → NP VP
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
NP → Det N | Det Adj N | "I" | "he"
VP → V NP | V NP PP | V | MOD V NP
PP → P NP
Det → "the" | "a"
N → "cat" | "dog" | "school"
V → "chased" | "slept" | "go"
MOD → "can"
P → "to"
These rules enable NLP systems to check syntax, perform parsing, translation, sentence segmentation,
and other language tasks
TREEBANKS
In Natural Language Processing (NLP), a Treebank is a linguistically annotated text corpus where each
sentence is paired with a syntactic or semantic structure represented as a tree. These structures typically
represent phrase structure or dependency relations that explicitly show how words in a sentence are
syntactically related.
What is a Treebank?
A Treebank is a parsed corpus where sentences are manually or semi-automatically annotated
with their syntactic trees or dependency trees.
The tree structure encodes grammatical relations and hierarchical phrase organization, such as
noun phrases (NP), verb phrases (VP), and clauses.
Treebank’s are created from corpora already annotated with simpler linguistic information like
part-of-speech (POS) tags.
Types of Treebank’s
Syntactic Treebank’s: Annotate the syntactic structure of sentences, focusing on phrase
structure or dependencies.
Examples: Penn Treebank (phrase structure), Universal Dependencies (dependency trees).
Semantic Treebank’s: Annotate semantic relationships and roles within sentences, extending the
syntactic annotation with meaning representation.
Examples: PropBank (annotates verbal propositions and arguments), Groningen Meaning Bank.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
How Treebanks are Built
Manual annotation by expert linguists or semi-automatic methods refined by linguists.
Parsing tools generate initial trees that human annotators check and correct for accuracy.
Annotation schemes may follow specific linguistic theories or be more general.
Representations are often stored in simple bracketed text form, XML, or specialized formats.
Uses of Treebanks in NLP
Training and evaluating parsing models: Treebanks serve as "gold standard" data, enabling
supervised machine learning for syntactic parsers.
Grammar induction: Extract production rules and probabilities for statistical grammars like
probabilistic context-free grammars (PCFGs).
Linguistic research: Study syntactic phenomena, frequency of specific constructions, and test
linguistic theories.
Improving NLP systems: POS taggers, dependency parsers, semantic role labelers, and machine
translation models use treebanks extensively.
Benchmarking: Parsing accuracy is often evaluated by comparing automatic parses with gold
treebank annotation using metrics like labeled attachment score (LAS).
Examples of Notable Treebanks
Penn Treebank: Early, influential phrase-structure treebank for English.
Universal Dependencies: Multilingual project providing consistent dependency annotations.
PropBank: Focuses on semantic role annotation (who did what to whom).
Arabic Treebank, Chinese Treebank: Language-specific treebanks covering syntactic
annotation for those languages.
NORMAL FORMS FOR GRAMMAR
Normal Forms for grammars are standardized ways to represent context-free grammars (CFGs) with
restrictions on the form of production rules. These normal forms simplify parsing algorithms and
theoretical analysis.
Chomsky Normal Form (CNF)
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
A context-free grammar is in Chomsky Normal Form if every production rule is of one of the
following forms:
A→BCA→BC where A,B,CA,B,C are non-terminal symbols, and B,CB,C are not the
start symbol.
A→aA→a where aa is a terminal symbol.
S→ϵS→ϵ where SS is the start symbol and ϵϵ is the empty string (only if the language
contains the empty string).
Key properties of CNF:
Each rule either produces two non-terminals or one terminal (or the empty string for the
start symbol).
Any CFG can be converted into an equivalent CNF grammar.
CNF is widely used in parsing algorithms like the CYK algorithm.
Derivations in CNF for strings of length nn take exactly 2n−12n−1 steps, aiding parsing
efficiency.
Conversion process usually involves:
1. Eliminating null productions.
2. Eliminating unit productions.
3. Removing useless symbols.
4. Ensuring productions produce either two non-terminals or a single terminal.
5. Creating new non-terminals for terminals when they appear in longer productions.
Greibach Normal Form (GNF)
Another important normal form where each production is of the form:
A→aαA→aα, where aa is a terminal and αα is a possibly empty string of non-terminals.
It ensures the leftmost symbol on the right side of each production is always a terminal.
Useful for top-down parsing and establishing certain theoretical properties.
Importance in NLP
Normal forms like CNF simplify parsing by restricting production shapes, making algorithms
more efficient and easier to implement.
They provide a foundation for efficient syntax analysis and automated parsing tools.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
The CNF conversion aids in developing probabilistic grammars and facilitates parsers used in
syntactic analysis in natural language understanding systems.
DEPENDENCY GRAMMAR
Dependency Grammar in NLP is a framework that represents sentence structure based on direct
relationships between words. It models how words depend on one another, focusing on word-to-word
connections rather than hierarchical phrase structures. In this system, a sentence is viewed as a
dependency graph where words (nodes) are linked by directed edges indicating dependencies from a
"head" word to its "dependents."
Key concepts include:
The head of a sentence, usually the main verb, governs other words.
Each word (except the root/head) depends on another word, forming a directed graph or tree
structure.
Dependency relations are labeled to show grammatical functions such as subject, object, or
modifier.
The structure is generally flatter than phrase-structure grammars and well suited for languages
with flexible word order.
An example of dependency grammar can be illustrated with the sentence:
"The quick brown fox jumps over the lazy dog."
In its dependency structure:
"jumps" is the root (main verb).
"fox" is a dependent of "jumps" as the subject.
"The," "quick," and "brown" are dependents of "fox," acting as determiners and adjectives
describing the subject.
"over" is a preposition dependent on "jumps."
"dog" is a dependent of "over," the object of the preposition.
"The" and "lazy" are dependents of "dog," acting as determiner and adjective describing the
object.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
This structure shows how each word relates directly to another word (its head), forming labeled
dependencies such as subject, object, or modifier. The connections represent grammatical roles directly
between words rather than constituent phrases.
This relational structure helps NLP systems analyze sentence meaning and syntax efficiently by focusing
on word-to-word dependencies.
Another example is the sentence:
"Kevin can hit the baseball with a bat."
SYNTACTIC PARSING
Syntactic parsing in NLP is the process of analyzing a sentence's grammatical structure according to
formal grammar rules and constructing a representation, such as a parse tree, that shows how words and
phrases are related syntactically. This process enables machines to understand the arrangement of words,
parts of speech, and grammatical relationships within the sentence, which is fundamental for many
language understanding tasks.
There are two main types of syntactic parsing:
Constituency Parsing: Breaks a sentence into nested constituents or phrases (e.g., noun phrases,
verb phrases), forming a hierarchical tree that reflects phrase structure.
Dependency Parsing: Focuses on the relationships between individual words by building a
directed graph or tree that shows which words depend on which others.
Syntactic parsing helps resolve structural ambiguities in sentences and facilitates downstream tasks like
information extraction, semantic role labeling, machine translation, and text-to-speech. It can be rule-
based or use statistical and machine learning methods due to the complexity and ambiguity of natural
language.
Example:
Here's an example illustrating syntactic parsing using a parse tree for the sentence:
"John hit the ball."
In syntactic parsing with a constituency-based grammar:
The root of the parse tree is "S" (Sentence).
"S" branches into two main constituents: "NP" (Noun Phrase) and "VP" (Verb Phrase).
The "NP" consists of the word "John," which is a noun acting as the subject.
The "VP" consists of the verb "hit," and the noun phrase "the ball" as its object.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
The noun phrase "the ball" further branches into a determiner "the" and a noun "ball."
This hierarchical tree visually represents the grammatical structure, showing how words group into
phrases and how those phrases relate to each other. The parse tree reflects the syntactic structure
according to grammar rules, making relationships like subject-verb-object explicit.
SYNTACTIC AMBIGUITY
Syntactic ambiguity, also known as structural ambiguity, occurs in syntactic analysis when a sentence can
be parsed in multiple valid ways, leading to different interpretations of its grammatical structure. This
ambiguity arises not because of word meanings but due to the sentence’s syntax—how words and phrases
are organized and related.
Examples of Syntactic Ambiguity
Prepositional Phrase Attachment: In the sentence "I saw the man with the telescope," it is
ambiguous whether "with the telescope" modifies "saw" (meaning the instrument used to see) or
"the man" (describing which man was seen).
Attachment of Clauses: "While Don was reading the newspaper, his sister knocked on the door."
The phrase "the newspaper" could initially be interpreted as the object of "was reading," or as the
subject of a new clause ("the newspaper lay unnoticed"), causing temporary ambiguity.
Multiple Interpretations in News Headlines or Sentences: Headlines like “Kids Make
Nutritious Snacks” can be interpreted either as children producing snacks or as snacks made from
kids, showing ambiguity due to syntax.
Garden Path Sentences: Sentences that lead the reader to an initial incorrect parse requiring
reanalysis, such as "The old man the boats," where "man" is a verb, not a noun.
Syntactic ambiguity poses significant challenges for parsing algorithms because multiple parse trees
(structures) may be possible for one sentence, and disambiguation is required to find the correct meaning.
Parsers may generate a set of all possible structures (parse forests) and use semantic or statistical
information to resolve ambiguity.
Here are some classic examples of syntactic ambiguity sentences:
1. "I saw the man with the telescope."
This can mean either:
The observer used a telescope to see the man, or
The man being seen has a telescope.
2. "Kids make nutritious snacks."
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
This humorous example could mean:
Kids prepare nutritious snacks, or
Kids themselves are nutritious snacks.
3. "The woman held the baby in the green blanket."
Possible interpretations include:
The baby wrapped in the green blanket is being held,
The woman is using the green blanket to hold the baby, or
The woman herself is wrapped in the green blanket while holding the baby.
4. "Miners refuse to work after death."
5. Ambiguity:
Miners stop working because of a death, or
Miners continue refusing to work even after dying.
6. "John saw the man on the hill with a telescope."
Ambiguity about who has the telescope and whose location is on the hill.
DYNAMIC PROGRAMMING PARSING
Dynamic Programming parsing in NLP refers to parsing algorithms that efficiently construct parse trees
by breaking down the parsing task into overlapping subproblems and solving each subproblem only once,
storing and reusing the results to avoid redundant computations.
Key Concepts
Parsing is the process of analyzing a sentence's syntactic structure based on a grammar.
Dynamic programming parsing leverages a tabular approach, where partial parse results for
substrings are stored in a table (chart).
It reduces the exponential search space to polynomial time by reusing smaller sub-constituent
parses.
Common Dynamic Programming Parsing Algorithms
1. CKY (Cocke-Younger-Kasami) Algorithm
Works on grammars converted to Chomsky Normal Form (CNF).
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
Uses a bottom-up approach filling a table to recognize constituents over substrings.
Runs in O(n3⋅∣G∣)O(n3⋅∣G∣) time where nn is sentence length and ∣G∣∣G∣ grammar size.
Can produce all possible parse trees for ambiguous sentences.
2. Earley Parser
Works with any context-free grammar (no CNF restriction).
Combines top-down prediction, bottom-up recognition, and completion steps.
Uses dynamic programming to avoid redundant parsing of substructures.
Practical for natural language parsing due to flexibility with left recursion and ambiguity.
Advantages
Efficiently handles ambiguous and recursive grammars.
Avoids exponential backtracking by caching results.
Supports extraction of multiple parse trees or most probable parse in probabilistic models.
Foundation for many statistical and neural parsing algorithms.
Applications in NLP
Syntactic parsing to generate phrase structure or dependency trees.
Part-of-speech tagging and named entity recognition improvements.
Semantic parsing and machine translation.
Text understanding, question answering, and information extraction.
Shallow parsing
Shallow parsing, also known as chunking or light parsing, is an NLP technique that focuses on identifying
and extracting the main syntactic constituents or phrases from sentences without performing a full,
detailed grammatical analysis.
What is Shallow Parsing?
It segments a sentence into non-overlapping phrases or "chunks," such as noun phrases (NP), verb
phrases (VP), and prepositional phrases (PP).
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
Unlike full parsing, which generates a complete syntactic tree showing detailed relationships
between all words, shallow parsing provides a simpler, flatter structure highlighting key phrase
boundaries.
The goal is to capture important functional units for use in downstream NLP tasks efficiently.
How Shallow Parsing Works
Uses part-of-speech (POS) tagging as input to detect phrase boundaries.
Applies techniques like rule-based pattern matching, machine learning (e.g., Hidden Markov
Models, Conditional Random Fields, support vector machines), or neural networks to identify
chunks.
Can also include named entity recognition to extract entities like people, locations, and
organizations.
Advantages
Computationally less expensive and faster than full parsing.
Provides sufficient structure for many NLP applications without the complexity of deep parsing.
More robust to imperfect or ambiguous input.
Useful for large-scale text processing in real-time or near real-time scenarios.
Applications in NLP
Information Extraction: Extract key phrases and entities from unstructured text.
Text Summarization: Identify important sentence components.
Sentiment Analysis: Focus on relevant phrases to detect sentiment.
Machine Translation: Improve phrase-level translation quality.
Question Answering: Detect meaningful constituents for query understanding.
Example
For sentence: "The black cat sat on the mat."
Noun Phrase (NP): "The black cat"
Verb Phrase (VP): "sat"
Prepositional Phrase (PP): "on the mat"
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
PROBABILISTIC CFG
Probabilistic Context-Free Grammar (PCFG) is an extension of the standard Context-Free Grammar
(CFG) used in Natural Language Processing (NLP) that associates probabilities with each production
rule. It helps to address ambiguity by modeling the likelihood of different parse trees for a given sentence.
Definition
A PCFG is formally defined as a 5-tuple G=(N,T,S,R,P)G=(N,T,S,R,P), where:
NN is a set of non-terminal symbols.
TT is a set of terminal symbols.
SS is the start symbol.
RR is a set of production rules of the form A→αA→α,
where A∈NA∈N and α∈(N∪T)∗α∈(N∪T)∗.
PP is a set of probabilities assigned to each production rule, where the probabilities of
rules sharing the same left-hand side non-terminal sum to 1.
How It Works
Each production rule A→αA→α has a probability P(A→α)P(A→α) representing how likely that
rule is chosen given AA.
The probability of a particular parse tree (derivation) is the product of the probabilities of the
production rules used to generate it.
PCFGs model the uncertainty and ambiguity inherent in natural language by providing a
probabilistic ranking of possible parses.
Probabilities are often learned from annotated corpora (e.g., treebanks) using maximum
likelihood estimation or more advanced machine learning methods.
Example
Consider the PCFG with productions and probabilities like:
S→NP VPS→NPVP [1.0]
NP→Det NounNP→DetNoun [0.4]
NP→NP PPNP→NPPP [0.6]
VP→Verb NPVP→VerbNP [1.0]
etc.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
The probability of a parse tree is the product of the probabilities of all productions used in that tree.
Benefits in NLP
Helps resolve syntactic ambiguity by choosing the most probable parse.
Enables statistical parsing algorithms.
Provides a principled framework to incorporate corpus-derived statistical information.
Supports efficient parsing through dynamic programming algorithms like the probabilistic CYK
parser.
Probabilistic CYK
Probabilistic CYK (Cocke–Younger–Kasami) parsing is an extension of the classic CYK parsing
algorithm used for parsing sentences with Probabilistic Context-Free Grammars (PCFGs). It finds
the most probable syntactic parse tree for a sentence by systematically combining probabilities of
grammar rules.
How Probabilistic CYK Works
Assumes the grammar is in Chomsky Normal Form (CNF).
Uses a dynamic programming table where each cell represents a substring of the input sentence.
Each cell stores the probability of the best parse for that substring with respect to each non-
terminal.
The algorithm proceeds bottom-up, filling the table by combining smaller substrings:
For a substring wi…wjwi…wj, it considers all partitions kk where i≤k<ji≤k<j.
For each production rule A→BCA→BC, it computes:
P(A,i,j)=i≤k< jmax(P(A→BC)×P(B,i,k)×P(C,k+1,j))
It selects the partition and rule that yield the highest probability.
The final answer is the probability of the start symbol spanning the entire sentence w1…wnw1…
wn.
The algorithm can also reconstruct the best parse tree using backpointers.
Applications and Advantages
Resolves ambiguity by ranking multiple parses probabilistically.
Provides a principled approach for statistical parsing.
Widely used in NLP parsers trained on annotated treebanks.
and ∣G∣∣G∣ grammar size.
Guarantees polynomial time parsing O(n3×∣G∣)O(n3×∣G∣), where nn is sentence length
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
PROBABILISTIC LEXICALIZED CFGS
Probabilistic Lexicalized Context-Free Grammars (PLCFGs) are an advanced type of grammar used in
natural language processing that combine two ideas: probabilities of rules (like in Probabilistic CFGs) and
lexical heads (important words) within phrases.
Lexicalized: Each non-terminal symbol in the grammar is paired with a head word that
represents the key lexical item of that phrase. For example, instead of just having a noun phrase
(NP), you have NP(head word), like NP(dog) or VP(run).
Probabilistic: Each production rule has an associated probability. These probabilities specify
how likely it is that a particular rule will apply in a given context.
Together, PLCFGs model not just the structure of sentences, but also how the choice of lexical
items affects the structure.
Why Lexicalization Matters
A simple PCFG treats categories like NP or VP as abstract entities without considering important
lexical details.
However, the choice of words (e.g., a particular verb or noun) strongly influences how phrases
combine.
Lexicalized rules capture dependencies like verb subcategorization (which arguments a verb
takes) and agreements.
How it Works in Parsing
Parsing with PLCFGs means finding the most probable parse tree that respects both the grammar
rules and the lexical heads.
The grammar has many more rules because each non-terminal includes the lexical head, so the
state space grows.
Algorithms based on dynamic programming, similar to probabilistic CYK, are used but with
additional bookkeeping for the heads.
Example
For sentence: "Workers dumped sacks into a bin"
The start symbol might be S(dumped) because "dumped" is the lexical head of the sentence.
Its children might be NP(workers) and VP(dumped), showing that the VP is headed by "dumped".
This detailed lexical info helps decide between competing parses by weighting parses where
heads fit together well more highly.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
FEATURE STRUCTURES
Feature structures are a formal way to represent and organize linguistic information in Natural Language
Processing (NLP). They are commonly used in advanced grammar formalisms to describe properties of
linguistic elements, such as words or phrases, in a structured and flexible manner.
What Are Feature Structures?
A feature structure is essentially a set of attribute-value pairs, where each attribute (called a
feature) describes some property and is paired with a value.
Values can be atomic (like "singular" for number, or "nominative" for case) or they can
themselves be feature structures, allowing nested, hierarchical information.
Feature structures are often visualized as attribute-value matrices (AVMs) or directed graphs
with features as labeled arcs and their values as nodes.
Example
A noun phrase (NP) might have features for number, case, and gender like:
Feature Value
NUM singular
CASE nominative
GENDER feminine
Why Use Feature Structures?
They enable rich, precise modeling of linguistic information beyond simple category labels.
Allow encoding syntactic, semantic, and morphological information compactly.
Facilitate unification, a process that merges two feature structures and checks for compatibility
of features, used in parsing and grammar checking.
Support constraint-based grammars like Lexical Functional Grammar (LFG) and Head-driven
Phrase Structure Grammar (HPSG).
Applications in NLP
Representing word properties such as tense, number, or agreement.
Encoding relationships and constraints in parsing and generation.
Improving robustness and expressiveness of syntactic and semantic grammars.
Supporting functional and dependency analyses for deeper language understanding.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
UNIFICATION OF FEATURE STRUCTURES
Unification of feature structures in NLP is an operation that combines two feature structures into a single
one that contains all the information from both, provided they are compatible. It is a key process used to
merge and reconcile linguistic information from different sources or constraints.
What is Unification?
Unification attempts to merge two sets of attribute-value pairs (feature structures).
If the features conflict (e.g., one has number=singular and the other number=plural), unification
fails.
If compatible, unification produces a new feature structure that is more specific, integrating all
the information from the inputs.
How It Works
Features are recursively checked and merged.
For atomic attributes, values must match or be compatible.
For nested feature structures, unification is applied recursively.
It is monotonic (information only grows) and order-independent (unifying in any order yields the
same result if successful).
Example
Feature structure 1: {num: singular, person: 3rd}
Feature structure 2: {num: singular, gender: feminine}
Unification result: {num: singular, person: 3rd, gender: feminine}
Conflicting example:
FS1: {num: singular}
FS2: {num: plural}
Unification fails because of incompatible values.
Importance
Ensures that linguistic constraints like agreement and subcategorization are respected during
parsing.
Helps integrate information from lexical entries and syntactic rules.
HINDU COLLEGE GUNTUR
NATURAL LANGAUGE PROCESSING 5TH AI
Central in constraint-based grammar formalisms like HPSG and LFG.
HINDU COLLEGE GUNTUR