0% found this document useful (0 votes)
10 views

NLP Basic - YL

Uploaded by

rui91seu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

NLP Basic - YL

Uploaded by

rui91seu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Get Started with NLP

NLP and Machine Learning

● Machine learning focuses on the development of computer programs


that can access data and use it to learn for themselves.

● Machine learning techniques can be applied to solve NLP problems.


○ Key challenge: How to convert the unstructured text into a structured
format
○ Representation for text matters
○ Representation learning is a set of techniques that learn a feature, i.e.,
transform the raw data input to a representation that can be effectively
exploited in machine learning tasks
Representation Matters

● Computer programs does not understand text

● Numeric representation are required for text

● Different from images (RGB matrix), there are not direct


transformation.
NLP pipeline

Raw Text Preprocessing

? Tokenization for
language chunks

Machine Learning Numerical representation


Models for these chunks
How do we represent the meaning of a
word?
Word Meaning
● The idea that is represented by a word, phrase, etc.
● The idea that a person wants to express by using words, signs, etc.
● The idea that is expressed in a work of writing, art, etc.

Common solution: Use WordNet, a thesaurus containing lists of synonym sets and
hypernyms (“is a” relationships)
Problems with resources like WordNet
● Great as a resource but missing nuance.
○ E.g. “proficient” is listed as a synonym for “good”.
○ This is only correct in some context

● Missing new meanings of words


○ Impossible to keep everything up-to-date

● Subjective
● Requires human efforts to create and adapt
● Can’t compute accurate word similarities.
Legacy Techniques: counting
is everything
One-hot Vector

● Map each word to an unique ID

● ID can be the index of the word in the whole vocabulary.


and 0
the 1
cat 2
The cat and the dog play
and, the, cat, dog, dog 3

The cat is on the mat play, on, mat, is play 4


on 5

corpus vocab. mat 6


is 7
One-hot Vector
● The ID can determine the one-hot word vector

● A vector filled with 0s, except for a 1 at the position of the ID


cat mat

0 0
0 0
1 0
0 0
0 0
0 0
0 1
0 0
One-hot Vector
● Pros
○ Simple
○ Easily computed and suitable for parallel computing

● Cons
○ Dimensionality is the size of vocabulary
○ Out-of-Vocabulary (OOV) problem
○ All words are independent
Bag-of-Words

● Steps
○ Build vocab i.e., set of all the words in the corpus
○ Count the occurrence of words in each document

The cat and the dog play 1 2 1 1 1 0 0 0


and, the, cat, dog,
The cat is on the mat play, on, mat, is 0 2 1 0 0 1 1 1

corpus vocab.
Bag-of-Words

● Pros
○ Simple
○ Surprisingly effective
○ Fast

● Cons
○ Order of words does not matter
○ Cannot capture syntactic/semantic information
N-gram model

● Steps
○ Build vocab, which set of all n-gram in the corpus
○ Count the occurrence of n-gram in each document

The cat and the dog play


The cat, cat and, and the, the dog, dog play, cat is, is on, on the,
The cat is on the mat the mat

corpus vocab.

1 1 1 1 1 0 0 0 0

1 0 0 0 0 1 1 1 1
N-gram model

● Pros
○ Word order is considered

● Cons
○ Vocab size is very huge
○ Cannot capture syntactic/semantic information
○ Is able to incorporate limited word order information
Term Frequency-Inverse Document Frequency

● Build vocab i.e., set of all the words in the corpus


● Count the occurrence of words in each document
● Use weighting scheme to determine the value
○ TF(w) = number of times term w appears in a document/Total number of terms in the
document
○ IDF(w) = log(total number of documents / number of documents with the term w in it)
○ The final weight is TF(w) * IDF(w)
● Intuitive logic:
○ Capture the importances of a word to document in a corpus
○ Importance of words is proportionally to the number of times a word appears
○ Importance of words is inversely proportionally to the document containing the word

You might also like