Chapter2 PDF
Chapter2 PDF
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
Bag-of-words
Basic method for finding topics in a text
Need to first create tokens using tokenization
... and then count up all the tokens
The more frequent a word, the more important it might be
Can be a great way to determine the significant words in a text
DataCamp Introduction to Natural Language Processing in Python
Bag-of-words example
Text: "The cat is in the box. The cat likes the box. The box is over the
cat."
"The": 3, "box": 3
"cat": 3, "the": 3
"is": 2
"in": 1, "likes": 1, "over": 1
DataCamp Introduction to Natural Language Processing in Python
Bag-of-words in Python
In [1]: from nltk.tokenize import word_tokenize
In [3]: Counter(word_tokenize(
"""The cat is in the box. The cat likes the box.
The box is over the cat."""))
Out[3]:
Counter({'.': 3,
'The': 3,
'box': 3,
'cat': 3,
'in': 1,
...
'the': 3})
In [4]: counter.most_common(2)
Out[4]: [('The', 3), ('box', 3)]
DataCamp Introduction to Natural Language Processing in Python
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Simple text
preprocessing
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
Why preprocess?
Helps make for better input data
When performing machine learning or other statistical methods
Examples:
Tokenization to create a bag of words
Lowercasing words
Lemmatization/Stemming
Shorten words to their root stems
Removing stop words, punctuation, or unwanted tokens
Good to experiment with different approaches
DataCamp Introduction to Natural Language Processing in Python
Preprocessing example
Input text: Cats, dogs and birds are common pets. So are fish.
In [2]: text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
In [5]: Counter(no_stops).most_common(2)
Out[5]: [('cat', 3), ('box', 3)]
DataCamp Introduction to Natural Language Processing in Python
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Introduction to gensim
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
What is gensim?
Popular open-source NLP library
Uses top academic models to perform complex tasks
Building document or word vectors
Performing topic identification and document comparison
DataCamp Introduction to Natural Language Processing in Python
Gensim Example
(Source: http://tlfvincent.github.io/2015/10/23/presidential-speech-topics)
DataCamp Introduction to Natural Language Processing in Python
In [6]: dictionary.token2id
Out[6]:
{'!': 11,
',': 17,
'.': 7,
'a': 2,
'about': 4,
...
DataCamp Introduction to Natural Language Processing in Python
In [8]: corpus
Out[8]:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...
]
Let's practice!
DataCamp Introduction to Natural Language Processing in Python
Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python
What is tf-idf?
Term frequency - inverse document frequency
Allows you to determine the most important words in each document
Each corpus may have shared words beyond just stopwords
These words should be down-weighted in importance
Example from astronomy: "Sky"
Ensures most common words don't show up as key words
Keeps document specific frequent words weighted high
DataCamp Introduction to Natural Language Processing in Python
Tf-idf formula
N
wi,j = tfi,j ∗ log( )
dfi
wi,j = tf-idf weight for token i in document j
tfi,j = number of occurences of token i in document j
dfi = number of documents that contain token i
N = total number of documents
DataCamp Introduction to Natural Language Processing in Python
In [12]: tfidf[corpus[1]]
Out[12]:
[(0, 0.1746298276735174),
(1, 0.1746298276735174),
(9, 0.29853166221463673),
(10, 0.7716931521027908),
...
]
DataCamp Introduction to Natural Language Processing in Python
Let's practice!