Unit 3: Semantic Parsing - Detailed Notes (From System Paradigms Onward)
System Paradigms in Semantic Parsing
Semantic parsing systems can be grouped under three key paradigms:
1. System Architectures
- Knowledge-Based Systems:
- Rely on human-crafted rules.
- Good for domains like medicine or law.
- Example: Rule-based hospital chatbot.
- Unsupervised Systems:
- No labeled data required.
- Use clustering or patterns.
- Example: Clustering word "java" by context (coffee, island, programming).
- Supervised Systems:
- Trained using labeled datasets.
- Use ML models like SVM or MaxEnt.
- Example: QA systems trained on SQuAD.
- Semi-Supervised Systems:
- Combine small labeled datasets with large unlabeled sets.
- Example: Yarowsky Algorithm bootstrapping data.
2. Scope
- Domain-Dependent:
- Specific to a field.
- Example: Airline booking assistant.
- Domain-Independent:
- Works across domains.
- Example: Alexa, Google Assistant.
3. Coverage
- Shallow Coverage:
- Produces intermediate outputs (e.g., POS tags).
- Example: POS tagging in Book a flight.
- Deep Coverage:
- Produces logical representations.
- Example: Logical form of Who is the president of India?
Word Sense
Understanding that words have multiple meanings depending on context.
- Types of Word Sense Ambiguities:
- Homonymy: Same spelling, unrelated meanings (e.g., bat - animal/tool).
- Polysemy: Related meanings (e.g., bank - financial or river side).
- Categorial Ambiguity: Multiple POS (e.g., book - noun/verb).
- Word Sense Disambiguation (WSD):
- Process of determining the right meaning of a word.
- Methods:
- Rule-based (e.g., Lesk Algorithm).
- Supervised (ML-based classifiers).
- Unsupervised (Clustering, IC, Conceptual Density).
- Semi-Supervised (Yarowsky Algorithm).
Resources
- Corpora: Structured sets of texts for training (plural: corpora).
- Dictionaries: LDOCE, Rogets Thesaurus.
- WordNet: Lexical database with synonym sets and glosses.
Rule-Based Systems
- Lesk Algorithm:
- Uses dictionary definitions and counts word overlaps in context.
- Example: Resolving bank using words like cash or river.
- Rogets Thesaurus Algorithm:
- Classifies based on category matches and word probabilities.
- SSI (Structural Semantic Interconnections):
- Graph-based representation of senses.
- Uses WordNet to construct semantic graphs and iteratively disambiguate.
Supervised Systems
- Use annotated data to train classifiers.
- Popular Classifiers: SVM, MaxEnt.
- Features:
- Lexical context
- POS tags
- Bag of Words
- Collocations
- Syntactic structure
- Topic and voice
- Subject/Object presence
- Prepositional phrase adjuncts
Unsupervised Systems
- No labeled data.
- Techniques:
- Clustering senses
- Semantic similarity
- Information Content (IC)
- Conceptual Density using WordNet hierarchy
Semi-Supervised Systems
- Combine small labeled data + bootstrapped unlabeled data.
- Yarowsky Algorithm:
- Key Principles:
- One Sense per Collocation
- One Sense per Discourse
- Bootstrapping
- Steps:
1. Initialize with seed examples
2. Extract features
3. Train classifier
4. Label data
5. Repeat
Additional Concepts
- Synset: Set of synonyms from WordNet.
- Example: "happy" happy, glad, joyful.
- Stop Words: Common but semantically weak words (e.g., the, and, is).