NLP Basic - YL
NLP Basic - YL
? Tokenization for
language chunks
Common solution: Use WordNet, a thesaurus containing lists of synonym sets and
hypernyms (“is a” relationships)
Problems with resources like WordNet
● Great as a resource but missing nuance.
○ E.g. “proficient” is listed as a synonym for “good”.
○ This is only correct in some context
● Subjective
● Requires human efforts to create and adapt
● Can’t compute accurate word similarities.
Legacy Techniques: counting
is everything
One-hot Vector
0 0
0 0
1 0
0 0
0 0
0 0
0 1
0 0
One-hot Vector
● Pros
○ Simple
○ Easily computed and suitable for parallel computing
● Cons
○ Dimensionality is the size of vocabulary
○ Out-of-Vocabulary (OOV) problem
○ All words are independent
Bag-of-Words
● Steps
○ Build vocab i.e., set of all the words in the corpus
○ Count the occurrence of words in each document
corpus vocab.
Bag-of-Words
● Pros
○ Simple
○ Surprisingly effective
○ Fast
● Cons
○ Order of words does not matter
○ Cannot capture syntactic/semantic information
N-gram model
● Steps
○ Build vocab, which set of all n-gram in the corpus
○ Count the occurrence of n-gram in each document
corpus vocab.
1 1 1 1 1 0 0 0 0
1 0 0 0 0 1 1 1 1
N-gram model
● Pros
○ Word order is considered
● Cons
○ Vocab size is very huge
○ Cannot capture syntactic/semantic information
○ Is able to incorporate limited word order information
Term Frequency-Inverse Document Frequency