Unit 5 Machine Learning
Unit 5 Machine Learning
Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input
text into same casing format so that 'test', 'Test' and 'TEST' are treated the same way. This
is more helpful for text featurization techniques like frequency, TF-IDF as it helps to
combine the same words together thereby reducing the duplication and get correct
counts/TF-IDF values. We can convert text to lower case simply by calling strings lower()
method.
Example
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
print("Original Text")
print(text)
text=text.lower()
df["text"][0]=text
print("After Converting into Lower Case")
print(text)
Removal of Punctuations
We also need to carefully choose the list of punctuations to exclude depending on the
use case. For example, the string.punctuation in python contains the following
punctuation symbols: !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`.We can add or remove more
punctuations as per our need.
Example
import pandas as pd
import string
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][2]
print("Original Data")
print(text)
ps=string.punctuation
print("Puctuation Symbols:",ps)
new_text=""
for c in text:
if c not in ps:
new_text=new_text+c
df["text"][2]=new_text
print("After Removal of Punctuation Symbols")
print(new_text)
Removing Numbers
Sometimes it happens that words and digits combine are written in the text which creates
a problem for machines to understand. hence, We need to remove the words and digits
which are combined like game57 or game5ts7. This type of word is difficult to process
so better to remove them or replace them with an empty string. We can replace digits
and words containing digits from by using sub() method of re module. Syntax of the
method is given below.
re.sub(pat, replacement, str)
This function searches for specified pattern in the given string, and replaces the strings
by the specified replacement.
Example: Removing digits
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][7]
print("Original Data")
print(text)
text=re.sub("[0-9]","",text)
df["text"][7]=text
print("After Removal of Digits")
print(text)
These stop word lists are already compiled for different languages and we can safely
use them. For example, the stop word list for English language from the NLTK package
can be displayed and removed from text data as below.
Example
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
text=text.lower()
print("Original Text")
print(text)
sw=stopwords.words('english')
print("List of stop words:",sw)
tokens=text.split()
new_tokens=[w for w in tokens if w not in sw]
text=" ".join(new_tokens)
df["text"]=text
print("After Removal of stop words")
print(text)
Removing URLs
Next preprocessing step is to remove any URLs present in the text data. If we scraped
text data from web, then there is a good chance that the text data will have some URL in
it. We might need to remove them for our further analysis. We can also replace URLs
from the text by using sub() method of re module.
Example
import pandas as pd
import re
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][7]
print("Original Text")
print(text)
toks=text.split()
new_toks=[]
for t in toks: t=re.sub("https?://\S+|
www\.\S+","",t)
#\S represents any character except white space characters
new_toks.append(t)
text=" ".join(new_toks)
df["text"][7]=text
print("After Removal of URLs")
print(text)
Removal of Emojis
With more and more usage of social media platforms, there is an explosion in the usage
of emojis in our day to day life as well. Probably we might need to remove these emojis
for some of our textual analysis. We have to use ‘u’ literal to create a Unicode string.
Also, we should pass re.UNICODE flag and convert our input data to Unicode.
Example
import pandas as pd
import re
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
print(df["text"][0])
text=df["text"][0]
toks=text.split()
new_toks=[]
pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
Stemming
Stemming is the process of converting a word to its most general form, or stem. This helps
in reducing the size of our vocabulary. Consider the words: learn, learning, learned
and learnt. All these words are stemmed from its common root learn. However, in some
cases, the stemming process produces words that are not correct spellings of the root
word. For example, happi. That's because it chooses the most common stem for related
words. For example, we can look at the set of words that comprises the different forms of
happy: happy, happiness and happier. We can see that the prefix happi is more
commonly used. We cannot choose happ because it is the stem of unrelated words like
happen. NLTK has different modules for stemming and we will use the PorterStemmer
module which uses the Porter Stemming Algorithm.
Example
import numpy as np
import pandas as pd
from nltk.stem.porter import PorterStemmer
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
text=text.lower()
print("Original Text")
print(text)
stemmer = PorterStemmer()
toks=text.split()
Lemmatization
Lemmatization is a text pre-processing technique used in natural language processing
(NLP) models to break a word down to its root meaning to identify similarities. For
example, a lemmatization algorithm would reduce the word better to its root word, or
lemme, good.
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem
of the word. There are different algorithms used to find out how many characters have
to be chopped off, but the algorithms don’t actually know the meaning of the word in
the language it belongs to. In lemmatization, the algorithms do have this knowledge. In
fact, you can even say that these algorithms refer to a dictionary to understand the
meaning of the word before reducing it to its root word, or lemma. It reduces the size
of text data massively and hence is faster in processing large amount of text data.
However, stemming may results to meaningless words.
So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able
to do the same. There could be over-stemming or under-stemming, and the word
better could be reduced to either bet, or bett, or just retained as better. But there is no way
in stemming that can reduce better to its root word good. This is the difference between
stemming and lemmatization. Lemmatization preserves the meaning of words.
However, it may be computationally expensive due to less powerful dimensionality
reduction.
Example
import numpy as np
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][1]
text=text.lower()
print("Original Text")
print(text)
lemmatizer = WordNetLemmatizer()
toks=text.split()
new_toks=[]
for t in toks:
rw=lemmatizer.lemmatize(t)
new_toks.append(rw)
df["text"][1]=text
text=" ".join(new_toks)
print("After Lemmatization")
print(text)