0% found this document useful (0 votes)
3 views

Unit 5 Machine Learning

The document discusses text mining and preprocessing techniques essential for transforming unstructured text into structured data using natural language processing (NLP). Key preprocessing steps include tokenization, lower casing, removal of punctuations, numbers, stop words, URLs, HTML tags, and emojis, as well as stemming and lemmatization. The document provides examples using Python's NLTK library to illustrate these techniques.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 5 Machine Learning

The document discusses text mining and preprocessing techniques essential for transforming unstructured text into structured data using natural language processing (NLP). Key preprocessing steps include tokenization, lower casing, removal of punctuations, numbers, stop words, URLs, HTML tags, and emojis, as well as stemming and lemmatization. The document provides examples using Python's NLTK library to illustrate these techniques.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 5

Text Mining and Big


Data
Text
Preprocessing
Text mining (also known as text analysis), is the process of transforming unstructured
text into structured data for easy analysis. Text mining uses natural language
processing (NLP), allowing machines to understand the human language and process it
automatically. Natural Language Toolkit (NLTK) package of Python is widely used for
text mining.
To prepare the text data for the model building we perform text preprocessing. It is the
very first step of NLP projects. Some of the preprocessing steps are: Tokenization,
Lower casing, Removing punctuations, Removing URLs, Removing
Numbers, Removing Stop words, Removing HTML Tags etc.
Tokenization
Tokenization is nothing but splitting the raw text into small chunks of words or sentences,
called tokens. If the text is split into words, then it is called as Word Tokenization and if
it's split into sentences then it is called as Sentence Tokenization. Generally white
space character is used to perform the word tokenization and characters like periods,
exclamation point, question mark and newline character are used for Sentence
Tokenization.
Example: Word Tokenization
text = """There are multiple ways we can perform tokenization on given
text data. We can choose any method based on langauge, library and purpose
of modeling."""
# Split text by whitespace
tokens = text.split()
print(tokens)

Example 2: Sentence Tokenization


text = """A regular expression is a sequence of characters that define a
search pattern.Using Regular expression we can match character
combinations in string and perform word/sentence tokenization."""
sentences=text.split(".")
print(sentences)

Prepared By: Arjun Singh Saud


Natural Language Toolkit (NLTK) is library written in python for natural language
processing. NLTK has module word_tokenize() for word tokenization and
sent_tokenize() for sentence tokenization.
Example
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
text = """Characters like periods, exclamation point and newline char are
used to separate the sentences. But one drawback with split() method, that
we can only use one separator at a time! So sentence tonenization wont be
foolproof with split() method."""
tokens = word_tokenize(text)
print("Words as Tokens")
print(tokens)
tokens=sent_tokenize(text)
print("Sentences as Tokens")
print(tokens)

Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input
text into same casing format so that 'test', 'Test' and 'TEST' are treated the same way. This
is more helpful for text featurization techniques like frequency, TF-IDF as it helps to
combine the same words together thereby reducing the duplication and get correct
counts/TF-IDF values. We can convert text to lower case simply by calling strings lower()
method.
Example
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
print("Original Text")
print(text)
text=text.lower()
df["text"][0]=text
print("After Converting into Lower Case")
print(text)

Removal of Punctuations

Prepared By: Arjun Singh Saud


Another common text preprocessing technique is to remove the punctuations from the
text data. This is again a text standardization process that will help to treat 'hurray' and
'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the
use case. For example, the string.punctuation in python contains the following
punctuation symbols: !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`.We can add or remove more
punctuations as per our need.
Example
import pandas as pd
import string
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][2]
print("Original Data")
print(text)
ps=string.punctuation
print("Puctuation Symbols:",ps)
new_text=""
for c in text:
if c not in ps:
new_text=new_text+c
df["text"][2]=new_text
print("After Removal of Punctuation Symbols")
print(new_text)

Removing Numbers
Sometimes it happens that words and digits combine are written in the text which creates
a problem for machines to understand. hence, We need to remove the words and digits
which are combined like game57 or game5ts7. This type of word is difficult to process
so better to remove them or replace them with an empty string. We can replace digits
and words containing digits from by using sub() method of re module. Syntax of the
method is given below.
re.sub(pat, replacement, str)
This function searches for specified pattern in the given string, and replaces the strings
by the specified replacement.
Example: Removing digits

Prepared By: Arjun Singh Saud


import pandas as pd
import re

df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][7]
print("Original Data")
print(text)
text=re.sub("[0-9]","",text)
df["text"][7]=text
print("After Removal of Digits")
print(text)

Example 2: Removing Words Containing Digits


import pandas as pd
import re
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][7]
print("Original data")
print(text)
toks=text.split()
new_toks=[]
for w in toks:
w=re.sub(".*[0-9].*","",w)
new_toks.append(w)
text=" ".join(new_toks)
df["text"][7]=text
print("After Removal of Words Containing Digits")
print(text)

Removing Stop Words


Stop words are commonly occurring words in a language like 'the', 'a' and so on. They
can be removed from the text most of the times, as they don't provide valuable
information for downstream analysis. In cases like Part of Speech Tagging (POS), we
should not remove them as provide very valuable information about the POS.

These stop word lists are already compiled for different languages and we can safely
use them. For example, the stop word list for English language from the NLTK package
can be displayed and removed from text data as below.
Example

Prepared By: Arjun Singh Saud


from nltk.corpus import stopwords
import numpy as np
import pandas as pd

df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
text=text.lower()
print("Original Text")
print(text)
sw=stopwords.words('english')
print("List of stop words:",sw)
tokens=text.split()
new_tokens=[w for w in tokens if w not in sw]
text=" ".join(new_tokens)
df["text"]=text
print("After Removal of stop words")
print(text)

Removing URLs
Next preprocessing step is to remove any URLs present in the text data. If we scraped
text data from web, then there is a good chance that the text data will have some URL in
it. We might need to remove them for our further analysis. We can also replace URLs
from the text by using sub() method of re module.
Example
import pandas as pd
import re
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][7]
print("Original Text")
print(text)
toks=text.split()
new_toks=[]
for t in toks: t=re.sub("https?://\S+|
www\.\S+","",t)
#\S represents any character except white space characters
new_toks.append(t)
text=" ".join(new_toks)
df["text"][7]=text
print("After Removal of URLs")
print(text)

Prepared By: Arjun Singh Saud


Removal of HTML Tags
One another common preprocessing technique that will come handy in multiple places
is removal of html tags. This is especially useful, if we scrap the data from different
websites. We might end up having html strings as part of our text. We can remove the
HTML tags using regular expressions.
Example
import pandas as pd
import re
text="The HTML <b> element defines bold text, without any extra
importance."
print("Original Text Data")
print(text)
new_toks=[]
tokens=text.split()
for t in tokens:
t=re.sub("<.*>","",t)
new_toks.append(t)
text=" ".join(new_toks)
print("Text Data After Removal of HTML Tags")
print(text)

Removal of Emojis
With more and more usage of social media platforms, there is an explosion in the usage
of emojis in our day to day life as well. Probably we might need to remove these emojis
for some of our textual analysis. We have to use ‘u’ literal to create a Unicode string.
Also, we should pass re.UNICODE flag and convert our input data to Unicode.
Example
import pandas as pd
import re
df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
print(df["text"][0])
text=df["text"][0]
toks=text.split()
new_toks=[]
pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons

Prepared By: Arjun Singh Saud


u"\U0001F300-\U0001F5FF" # symbols &
pictographs
u"\U0001F680-\U0001F6FF" # transport & map
symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
for t in toks:
t=re.sub(pattern,"",t)
new_toks.append(t)
text=" ".join(new_toks)
print("After Removal of Words Containing Digits")
df["text"][7]=text
print(df["text"][7])

Stemming
Stemming is the process of converting a word to its most general form, or stem. This helps
in reducing the size of our vocabulary. Consider the words: learn, learning, learned
and learnt. All these words are stemmed from its common root learn. However, in some
cases, the stemming process produces words that are not correct spellings of the root
word. For example, happi. That's because it chooses the most common stem for related
words. For example, we can look at the set of words that comprises the different forms of
happy: happy, happiness and happier. We can see that the prefix happi is more
commonly used. We cannot choose happ because it is the stem of unrelated words like
happen. NLTK has different modules for stemming and we will use the PorterStemmer
module which uses the Porter Stemming Algorithm.
Example
import numpy as np
import pandas as pd
from nltk.stem.porter import PorterStemmer

df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][0]
text=text.lower()
print("Original Text")
print(text)
stemmer = PorterStemmer()
toks=text.split()

Prepared By: Arjun Singh Saud


new_toks=[]
for t in toks:
rw=stemmer.stem(t)
new_toks.append(rw)
df["text"][0]=text
text=" ".join(new_toks)
print("After Stemming")
print(text)

Lemmatization
Lemmatization is a text pre-processing technique used in natural language processing
(NLP) models to break a word down to its root meaning to identify similarities. For
example, a lemmatization algorithm would reduce the word better to its root word, or
lemme, good.

In stemming, a part of the word is just chopped off at the tail end to arrive at the stem
of the word. There are different algorithms used to find out how many characters have
to be chopped off, but the algorithms don’t actually know the meaning of the word in
the language it belongs to. In lemmatization, the algorithms do have this knowledge. In
fact, you can even say that these algorithms refer to a dictionary to understand the
meaning of the word before reducing it to its root word, or lemma. It reduces the size
of text data massively and hence is faster in processing large amount of text data.
However, stemming may results to meaningless words.

So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able
to do the same. There could be over-stemming or under-stemming, and the word
better could be reduced to either bet, or bett, or just retained as better. But there is no way
in stemming that can reduce better to its root word good. This is the difference between
stemming and lemmatization. Lemmatization preserves the meaning of words.
However, it may be computationally expensive due to less powerful dimensionality
reduction.

Example
import numpy as np

Prepared By: Arjun Singh Saud


import pandas as pd
from nltk.stem import WordNetLemmatizer

df = pd.read_csv("/content/drive/My Drive/sample.csv")
df = df[["text"]]
text=df["text"][1]
text=text.lower()
print("Original Text")
print(text)
lemmatizer = WordNetLemmatizer()
toks=text.split()
new_toks=[]
for t in toks:
rw=lemmatizer.lemmatize(t)
new_toks.append(rw)
df["text"][1]=text
text=" ".join(new_toks)
print("After Lemmatization")
print(text)

Prepared By: Arjun Singh Saud

You might also like