0% found this document useful (0 votes)

31 views36 pages

02 Linguistics Essentials

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views36 pages

02 Linguistics Essentials

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

CSE440: Natural Language

Processing II
Dr. Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 2: Linguistics
Essentials
Topics
- Common NLP components
- Sentence segmentation
- Tokenization
- Lemmatization/Stemming
- Parts-of-Speech tagging
- Named Entity Recognition
- Parsing
- Coreference Resolution
- Hands-on Demonstration
- Next class
- Install SpaCy and NLTK on your computer!
NLP Annotations
- Associating extra information to a piece of text

- Example:

Dr. Jennifer Smith visited China. She liked it very much.

NLP Annotations
- Associating extra information to a piece of text

- Example:

Dr. Jennifer Smith visited China. She liked it very much.

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

She/pron liked/verb it/pron very/adv much/adv ./punc

This is called Parts-of-Speech tagging.

More Annotations: Dependencies, Named Entities, Coreference

compound punct
nsubj dobj
compound

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

GPE

person

She/pron liked/verb it/pron very/adv much/adv ./punc

Common NLP Components
● Sentence segmentation
● Tokenization
● Lemmatization/Stemming
● Parts-of-Speech tagging
● Named Entity Recognition
● Parsing
● Coreference Resolution
Sentence Segmentation: Challenges
How do you know where an English sentence ends?
Sentence Segmentation: Challenges
How do you know where an English sentence ends?
- Consider: Mr. Smith lives in the U.S.A. He said “I am an American citizen!”
Sentence Segmentation: Challenges
How do you know where an English sentence ends?
- Consider: Mr. Smith lives in the U.S.A. He said “I am an American citizen!”

Many ‘.’ ‘!’ and ‘?’ end sentences but not all:
● Some ‘.’ are in abbreviations
● Some ‘.’ in abbreviations also end sentences
● Quotes after ‘.’ ‘!’ or ‘?’ are in the same sentence
● etc.
Sentence Segmentation: Solution
Rules:
● Easy to write a few rules
● Large rule sets are hard to maintain
Sentence Segmentation: Solution
Rules:
● Easy to write a few rules
● Large rule sets are hard to maintain
Machine learning:
● Classify each punctuation character: sentence final?
● Features: surrounding characters, words
● Around 99% accuracy
Sentence Segmentation: Solution
Rules:
● Easy to write a few rules
● Large rule sets are hard to maintain
Machine learning:
● Classify each punctuation character: sentence final?
● Features: surrounding characters, words
● Around 99% accuracy
Parsing (spacy’s algorithm):
● Let the dependency parser figure it out
Tokenization: Brainstorming
Someone has told you that words in English can be separated by simply splitting
on whitespace. How many times would that heuristic fail for the following text?

Mr. O’Neill said reaction to Sea Container’s proposal “hasn’t been very positive.”
In New York Stock Exchange composite trading yesterday, Sea Containers closed
at
$62.625, down 62.5 cents.

What could you do to improve the heuristic?

Tokenization Challenges
● Words with punctuation: C++, C#, M*A*S*H, etc.
● Emoticons: =) :) ;-) etc.
● Contractions: I’ll, isn’t, dog’s, etc.
● Typically split to separate, e.g., noun (I) from verb (’ll)
● Hyphens in words: e-mail, co-operate, etc.
● Hyphens between morphemes: non-lawyer, pro-Arab
● Hyphens between words: once-quiet study,
● take-it-or-leave-it offer, 26-year-old, etc.
● Names: New York vs. York
● Phrasal verbs: make up, work out, etc.
● Phone numbers: +(880) 1756-111111
Tokenization Challenges
How about other languages?
- Chinese: 我正在教一堂課
- Means “I am teaching a class.”
- each character is a word, simpler characters build complex ones, and there is no space!
- German: Lebensversicherungsgesellschaftsangestellter (pronounce this!)
- Means “life insurance company employee”
Tokenization Challenges
How about other languages?
- Chinese: 我正在教一堂課
- Means “I am teaching a class.”
- each character is a word, simpler characters build complex ones, and there is no space!
- German: Lebensversicherungsgesellschaftsangestellter (pronounce this!)
- Means “life insurance company employee”
- How about Bangla?
- Let me present you the one and only Michael Madhusudan Dutta

নিকু ম্ভিলা যজ্ঞ সাংগ করি, আরম্ভিলে/যুদ্ধ দম্ভি মেঘনাদ, বিষম সঙ্কটে/ঠেকিবে বৈদেহীনাথ, কহিনু তোমারে
Tokenization Solutions
Unfortunately, no general solution. Each language requires it’s own tokenization
principles.

How does common tools do it then?

Spacy: recursively split on whitespace, known exceptions, affixes, and punctuation

Problem: Similar Words Look Different
The words dog and dogs are closely related, but on a computer "dog" != "dogs"

Solutions:

● Cut out common substrings (stemming/lemmatization)

● Replace words with vectors (embeddings)
Stemming and Lemmatization
Stemming:

● Rules strip pieces of words (not morphemes)

● E.g, Porter stemmer: equivalence → equival
● Fast, but inaccurate, e.g., organization → organ, European !→ Europe
Stemming and Lemmatization
Stemming:

● Rules strip pieces of words (not morphemes)

● E.g, Porter stemmer: equivalence → equival
● Fast, but inaccurate, e.g., organization → organ, European !→ Europe

Lemmatization
● Hand-built lexicon for all word forms, walked → walk
● Accurate, but slower, and there is a chicken-egg scenario with parts of speech
tagging
Embedding
If dog = [0.5; 0.4; 0.1] and dogs = [0.5; 0.4; 0.2]
then cos dog; dogs = 0:99

Goal: learn an embedding vector for each word such that similar words have
similar vectors.

Will be covered in session 3.

Before moving on
Our next overview is going to be on Parts of Speech tagging. Before starting that,
please review these two links:
- Penn TreeBank tags:
- [Link]
- Section 2 and 3
- Universal POS tags:
- [Link]
NLP Libraries

SpaCy NLTK CoreNLP Processors

Fast yes no yes yes

State-of-the-art yes no yes yes

Large community yes yes yes no

Simple APIs yes yes no no

Language Python Python Java Scala

Before next class, please install SpaCy and NLTK on your computer
Parts-of-Speech (POS) Tagging
Assigning grammatical categories for words
She/pron liked/verb it/pron very/adv much/adv ./punc
closed class
- categories have a fixed set of words
- prepositions, determiners, pronouns, conjunctions, auxiliary verbs, particles,
numerals
open class
- categories have a growing set of words
- nouns, verbs, adjectives, adverbs
POS tagging
noun “person, place, or thing”: farmer, Dhaka, dice but also explosion, moment
verb “action or process”: grab, evolve, rain
adjective “property or quality”, modify nouns: green, old
adverb “modify verbs and adjectives”: slowly, very, today
adposition “before/after a noun phrase”: over, before
determiner “express reference of noun”: a, the, that
pronoun “substitute for noun”: you, our, who
conjunction “join two phrases”: and, but, if
particle “associated with other word”: not, maybe rule out
interjection “exclamation”: psst, ouch, hello
POS Tagging Challenges
One word can have different POS tag based on its use
- I painted the room vs. the painted room
- Is painted a verb or an adjective?
Annotate the following sentence with POS tags from Penn TreeBank tags:
“Wow! That first post really blew up.”
Named Entity Recognition (NER)
Identify phrases that are named people, locations, organizations,
punct etc.

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

Person GPE

Common named entity types:

- person Turing is often considered the. . .
- organization The IPCC said it is likely that. . .
- location The Mt. Sanitas loop hike. . .
- geo-political entity Palo Alto will raise parking fees.
- etc.
NER Challenges
Ambiguity:
- Washington was born into slavery.
- Washington went up 2 games to 1.
- Blair arrived in Washington today.
- Washington passed a primary seatbelt law.
NER Challenges
Ambiguity:
- Washington was born into slavery. <per>
- Washington went up 2 games to 1. <org>
- Blair arrived in Washington today. <loc>
- Washington passed a primary seatbelt law. <gpe>
Solution
Sequence Tagging
- Will study it in session 4

Simple scheme: label each word as (I)nside or (O)utside

More elaborate schemes:

- BIO: begin, inside, outside
- BILOU: begin, inside, last, outside, unit-length
Parsing and Syntactic Representation
Example: John hit the ball.
Parsing Challenges
Attachment ambiguity: One morning I shot an elephant in my pajamas.
- Who was in my pajamas? Me? The elephant?
Coordination Ambiguity
Old men and women
- Old (men and women)?
- Old (men) and women?

Which one is correct?

Parsing solutions
- Probabilistic grammar based parsing
- Transition based parsing

We will learn theories of parsing in session 5

Demo
We will now check out some of the tools that are available to us.
- SpaCy
- NLTK
- Stanford CoreNLP

NLP Ans
No ratings yet
NLP Ans
91 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP Model Comparisons and Concepts
No ratings yet
NLP Model Comparisons and Concepts
28 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
NLP Unit 1
No ratings yet
NLP Unit 1
43 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
4.chapter5 - Syntactic and Semantic Representations
No ratings yet
4.chapter5 - Syntactic and Semantic Representations
47 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Lec 2
No ratings yet
Lec 2
21 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
NLP Simple Explanation
No ratings yet
NLP Simple Explanation
9 pages
2 Marks
No ratings yet
2 Marks
22 pages
Lesson 3 Natural Language Understanding Techniques
No ratings yet
Lesson 3 Natural Language Understanding Techniques
89 pages
TextMining
No ratings yet
TextMining
43 pages
Introduction to NLP and NLTK Basics
No ratings yet
Introduction to NLP and NLTK Basics
23 pages
NLP Journl
No ratings yet
NLP Journl
15 pages
Part-Of-Speech Tagging Overview
No ratings yet
Part-Of-Speech Tagging Overview
84 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
POStagging
No ratings yet
POStagging
72 pages
POS Tagging and HMM in NLP
No ratings yet
POS Tagging and HMM in NLP
84 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Lecture6 2022
No ratings yet
Lecture6 2022
101 pages
Word Segmentation in NLP Explained
No ratings yet
Word Segmentation in NLP Explained
27 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Understanding Semantic Parsing in NLP
No ratings yet
Understanding Semantic Parsing in NLP
11 pages
NLP Notes
No ratings yet
NLP Notes
56 pages
Unit 5
No ratings yet
Unit 5
10 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Al3501 - Teaching Content
No ratings yet
Al3501 - Teaching Content
3 pages
NLP and Computational Linguistics Overview
No ratings yet
NLP and Computational Linguistics Overview
60 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
27 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
No ratings yet
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
108 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Ia2
No ratings yet
NLP Ia2
18 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
NLP M1
No ratings yet
NLP M1
31 pages
Natural Language Processing Is Fun! - Adam Geitgey - Medium
No ratings yet
Natural Language Processing Is Fun! - Adam Geitgey - Medium
19 pages
Natural Language Processing Week 1-5 With Tasks
No ratings yet
Natural Language Processing Week 1-5 With Tasks
5 pages
NLP
No ratings yet
NLP
29 pages
NLP Final
No ratings yet
NLP Final
27 pages
Transformers and BERT in NLP
No ratings yet
Transformers and BERT in NLP
20 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
NLP Lecture: Neural Nets & RNNs
No ratings yet
NLP Lecture: Neural Nets & RNNs
55 pages
5 Sequence Learning
No ratings yet
5 Sequence Learning
50 pages
01 Introduction
No ratings yet
01 Introduction
13 pages
Repertoire Analysis, The Fire Within
No ratings yet
Repertoire Analysis, The Fire Within
5 pages
Tapal Tea: Urban Pakistan's Choice
No ratings yet
Tapal Tea: Urban Pakistan's Choice
23 pages
The Nursing Clinical Teacher Effectiveness Inventory
No ratings yet
The Nursing Clinical Teacher Effectiveness Inventory
6 pages
The Detailed Neurological Examination in Adults PDF
No ratings yet
The Detailed Neurological Examination in Adults PDF
30 pages
Nightmare Time Ep2 With Lyrics
No ratings yet
Nightmare Time Ep2 With Lyrics
70 pages
Products & Services and Marketing Startegies of Kotak Mahindra Bank
100% (1)
Products & Services and Marketing Startegies of Kotak Mahindra Bank
64 pages
Makalah Yeni - 104639
No ratings yet
Makalah Yeni - 104639
11 pages
Pretest Posttest Mapeh 9.
100% (3)
Pretest Posttest Mapeh 9.
2 pages
Experimental Research Method
No ratings yet
Experimental Research Method
6 pages
Virtue and Ethics in Healthcare
100% (2)
Virtue and Ethics in Healthcare
31 pages
Hypertriglyceridemia and Acute Pancreatitis Case Study
No ratings yet
Hypertriglyceridemia and Acute Pancreatitis Case Study
14 pages
NUTANIX Enterprise-Cloud-Solutions-Pocketbook
No ratings yet
NUTANIX Enterprise-Cloud-Solutions-Pocketbook
18 pages
Cybersecurity in Accounting Information Systems CH
No ratings yet
Cybersecurity in Accounting Information Systems CH
12 pages
Martial Arts Evolution and Influence
No ratings yet
Martial Arts Evolution and Influence
4 pages
Entrepreneurship Journal Insights
No ratings yet
Entrepreneurship Journal Insights
9 pages
The Bloomsbury Introduction To Creative Writing Second Edition Tara Mokhtari PDF Download
100% (1)
The Bloomsbury Introduction To Creative Writing Second Edition Tara Mokhtari PDF Download
80 pages
KEY - Languge Function
No ratings yet
KEY - Languge Function
4 pages
Tribal Festivals of Gujarat: A Overview
No ratings yet
Tribal Festivals of Gujarat: A Overview
10 pages
Indonesian Literature Overview
No ratings yet
Indonesian Literature Overview
16 pages
Understanding Art and Its Functions
No ratings yet
Understanding Art and Its Functions
8 pages
Format of Parents Anti Ragging Affidavit
No ratings yet
Format of Parents Anti Ragging Affidavit
3 pages
Exemplary Traits Reading Characterization in Roman Poetry 1st Edition J Mira Seo Download
100% (3)
Exemplary Traits Reading Characterization in Roman Poetry 1st Edition J Mira Seo Download
88 pages
Arbitration and CISG Jurisdiction Analysis
No ratings yet
Arbitration and CISG Jurisdiction Analysis
2 pages
2024 Summer Brochure 1
No ratings yet
2024 Summer Brochure 1
5 pages
IMMI Refusal Notification With Decision Record
No ratings yet
IMMI Refusal Notification With Decision Record
6 pages
Convolutional Codes & BER Analysis
No ratings yet
Convolutional Codes & BER Analysis
3 pages
Educator & Learning Specialist Profile
No ratings yet
Educator & Learning Specialist Profile
4 pages
Saint John Baptist de La Salle's Legacy
No ratings yet
Saint John Baptist de La Salle's Legacy
14 pages
Writer's Workshop - and Ballad
No ratings yet
Writer's Workshop - and Ballad
5 pages
Lecture 4-Natural History of Disease
0% (1)
Lecture 4-Natural History of Disease
21 pages

02 Linguistics Essentials

Uploaded by

02 Linguistics Essentials

Uploaded by

CSE440: Natural Language

Dr. Jennifer Smith visited China. She liked it very much.

Dr. Jennifer Smith visited China. She liked it very much.

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

She/pron liked/verb it/pron very/adv much/adv ./punc

This is called Parts-of-Speech tagging.

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

She/pron liked/verb it/pron very/adv much/adv ./punc

What could you do to improve the heuristic?

How does common tools do it then?

Spacy: recursively split on whitespace, known exceptions, affixes, and punctuation

● Cut out common substrings (stemming/lemmatization)

● Rules strip pieces of words (not morphemes)

● Rules strip pieces of words (not morphemes)

Will be covered in session 3.

SpaCy NLTK CoreNLP Processors

Fast yes no yes yes

State-of-the-art yes no yes yes

Large community yes yes yes no

Simple APIs yes yes no no

Language Python Python Java Scala

Dr./propn Jennifer/propn Smith/propn visited/verb China/propn ./punc

Common named entity types:

Simple scheme: label each word as (I)nside or (O)utside

More elaborate schemes:

Which one is correct?

We will learn theories of parsing in session 5

You might also like