文本分词
分词处理相关API:
import nltk.tokenize as tk
# 把样本按句子进行拆分 sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把样本按单词进行拆分 word_list:单词列表
word_list = tk.word_tokenize(text)
# 把样本按单词进行拆分 punctTokenizer:分词器对象
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text)
案例:
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
词干提取
文本样本中的单词的词性与时态对于语义分析并无太大影响,所以需要对单词进行词干提取。
词干提取相关API:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
stemmer = pt.PorterStemmer() # 波特词干提取器,偏宽松
stemmer = lc.LancasterStemmer() # 朗卡斯特词干提取器,偏严格
stemmer = sb.SnowballStemmer('english') # 思诺博词干提取器,偏中庸
r = stemmer.stem('playing') # 提取单词playing的词干
案例:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = lc_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s %8s %8s %8s' % (
word, pt_stem, lc_stem, sb_stem))
table tabl tabl tabl
probably probabl prob probabl
wolves wolv wolv wolv
playing play play play
is is is is
dog dog dog dog
the the the the
beaches beach beach beach
grounded ground ground ground
dreamt dreamt dreamt dreamt
envision envis envid envis
词性还原
与词干提取的作用类似,词性还原更利于人工二次处理。因为有些词干并非正确的单词,人工阅读更麻烦。词性还原可以把名词复数形式恢复为单数形式,动词分词形式恢复为原型形式。
词性还原相关API:
import nltk.stem as ns
# 获取词性还原器对象
lemmatizer = ns.WordNetLemmatizer()
# 把单词word按照名词进行还原
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 把单词word按照动词进行还原
v_lemma = lemmatizer.lemmatize(word, pos='v')
案例:
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
n_lemma = lemmatizer.lemmatize(word, pos='n')
v_lemma = lemmatizer.lemmatize(word, pos='v')
print('%8s %8s %8s' % (word, n_lemma, v_lemma))
词袋模型
一句话的语义很大程度取决于某个单词出现的次数,所以可以把句子中所有可能出现的单词作为特征名,每一个句子为一个样本,单词在句子中出现的次数为特征值构建数学模型,称为词袋模型。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the | brown | dog | is | running | black | in | room | forbidden |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
词袋模型化相关API:
import sklearn.feature_extraction.text as ft
# 构建词袋模型对象
cv = ft.CountVectorizer()
# 训练模型,把句子中所有可能出现的单词作为特征名,每一个句子为一个样本,单词在句子中出现的次数为特征值。
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 获取所有特征名
words = cv.get_feature_names()
案例:
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
[[0 1 1 0 0 1 0 1 1]
[2 0 1 0 1 1 1 0 2]
[0 0 0 1 1 1 1 1 1]]
words = cv.get_feature_names()
print(words)
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room','running', 'the']