Unstructured Data Classification Handson
Unstructured Data Classification Handson
me/fresco_milestone) ( @fresco_milestone )
import numpy as np
import csv
Fill in the Command to load your CSV dataset "imdb.csv" with pandas
imdb=pd.read_csv('imdb.csv')
imdb.columns = ["index","text","label"]
print(imdb.head(5))
Data Analysis
print(data_size)
imdb_col_names =list(imdb.columns)
print(imdb_col_names)
print(imdb.describe(include='all'))
print(imdb.head(3))
(1000, 3)
Target Identification
Execute the below cell to identify the target variables. If 0 it is a bad review,if it is 1 it is a good review.
In [4]: imdb_target=imdb['label']
print(imdb_target)
Tokenization
import nltk
nltk.download('all')
def split_tokens(text):
message = text.lower()
word_tokens = word_tokenize(message)
return word_tokens
imdb['tokenized_message'] = imdb.text.apply(split_tokens)
[nltk_data] |
Lemmatization
Apply the function split_into_lemmas for the column tokenized_message with axis=1
Print the 55th row from the column tokenized_message.
Print the 55th row from the column lemmatized_message
def split_into_lemmas(text):
lemma = []
lemmatizer = WordNetLemmatizer()
a=lemmatizer.lemmatize(word)
lemma.append(a)
return lemma
imdb['lemmatized_message'] = imdb.tokenized_message.apply(split_into_lemmas)
print('Tokenized message:',imdb.tokenized_message[54] )
print('Lemmatized message:',imdb.lemmatized_message[54] )
def stopword_removal(text):
stop_words = stopwords.words('english')
filtered_sentence = []
return filtered_sentence
imdb['preprocessed_message'] = imdb.lemmatized_message.apply(stopword_removal)
print('Preprocessed message:',imdb.preprocessed_message[54])
Training_data=pd.Series(list(imdb['preprocessed_message']))
Training_label=pd.Series(list(imdb['label']))
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM = tf_vectorizer.transform(Training_data)
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = tfidf_vectorizer.transform(Training_data)
Perform train-test split on message_data_TDM and Training_label with 90% as train data and 10% as test data.
Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.me/fresco_milestone) ( @fresco_milestone )
In [12]: seed=9
train_data_shape = train_data.shape
test_data_shape = test_data.shape
classifier = classifier.fit(train_data,train_label)
target = classifier.predict(test_data)
score = classifier.score(test_data,test_label)
file.write(str((imdb['tokenized_message'][55],imdb['lemmatized_message'][55])))
Perform train-test split on message_data_TDM and Training_label with this time 80% as train data and 20% as test data.
Get the shape of the train-data and print the same.
Get the shape of the test-data and print the same.
Initialize SVM classifier with following parameters
loss = modified_huber
shuffle= True
random_state=seed
Train the model with train_data and train_label
Now predict the output with test_data
Evaluate the classifier with score from test_data and test_label
Print the predicted score
train_data_shape = train_data.shape
test_data_shape = test_data.shape
classifier = classifier.fit(train_data,train_label)
target = classifier.predict(test_data)
score = classifier.score(test_data,test_label)
file.write(str((imdb['preprocessed_message'][55])))
In [ ]: "@fresco_milestone"