0% found this document useful (0 votes)

1K views4 pages

Unstructured Data Classification Handson

This document discusses loading and preprocessing an IMDB movie review dataset using pandas and NLTK. Key steps include: 1. Loading the CSV dataset and viewing the first 5 rows. 2. Analyzing the dataset shape and statistics, and identifying the target variable. 3. Preprocessing the text data via tokenization, lemmatization, and stop word removal. 4. Creating term-document matrices using CountVectorizer and TfidfVectorizer. 5. Splitting the data into train and test sets for model training and evaluation. 6. Training Support Vector Machine and Stochastic Gradient Descent classifiers on the preprocessed data.

Uploaded by

mohamed yasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views4 pages

Unstructured Data Classification Handson

Uploaded by

mohamed yasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.

me/fresco_milestone) ( @fresco_milestone )

In [1]: import pandas as pd

import numpy as np

import csv

Fill in the Command to load your CSV dataset "imdb.csv" with pandas

In [2]: #Data Loading

imdb=pd.read_csv('imdb.csv')

imdb.columns = ["index","text","label"]

print(imdb.head(5))

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

3 3 Very little music or anything to speak of. 0

4 4 The best scene in the movie was when Gerardo i... 1

Data Analysis

Get the shape of the dataset and print it.

Get the column names in list and print it.
Group the dataset by label and describe the dataset to understand the basic statistics of the dataset.
Print the first three rows of the dataset

In [3]: data_size =imdb.shape

print(data_size)

imdb_col_names =list(imdb.columns)

print(imdb_col_names)

print(imdb.describe(include='all'))

print(imdb.head(3))

(1000, 3)

['index', 'text', 'label']

index text label

count 1000.000000 1000 1000.00000

unique NaN 997 NaN

top NaN 10/10 NaN

freq NaN 2 NaN

mean 499.500000 NaN 0.50000

std 288.819436 NaN 0.50025

min 0.000000 NaN 0.00000

25% 249.750000 NaN 0.00000

50% 499.500000 NaN 0.50000

75% 749.250000 NaN 1.00000

max 999.000000 NaN 1.00000

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

Target Identification

Execute the below cell to identify the target variables. If 0 it is a bad review,if it is 1 it is a good review.

In [4]: imdb_target=imdb['label']

print(imdb_target)

Tokenization

Convert the text into lower.

Tokenize the text using word_tokenize
Apply the function split_tokens for the column text in the imdb dataset with axis =1
In [6]: from nltk.tokenize import word_tokenize

import nltk

nltk.download('all')

def split_tokens(text):

message = text.lower()

word_tokens = word_tokenize(message)

return word_tokens

imdb['tokenized_message'] = imdb.text.apply(split_tokens)

[nltk_data] Downloading collection 'all'

[nltk_data] |

[nltk_data] | Downloading package abc to /home/user/nltk_data...

Lemmatization

Apply the function split_into_lemmas for the column tokenized_message with axis=1
Print the 55th row from the column tokenized_message.
Print the 55th row from the column lemmatized_message

In [7]: from nltk.stem.wordnet import WordNetLemmatizer

def split_into_lemmas(text):

lemma = []

lemmatizer = WordNetLemmatizer()

for word in text:

a=lemmatizer.lemmatize(word)

lemma.append(a)

return lemma

imdb['lemmatized_message'] = imdb.tokenized_message.apply(split_into_lemmas)

print('Tokenized message:',imdb.tokenized_message[54] )

print('Lemmatized message:',imdb.lemmatized_message[54] )

Tokenized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Lemmatized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Stop Word Removal

Set the stop words language as english in the variable stop_words

Apply the function stopword_removal to the column lemmatized_message with axis=1
Print the 55th row from the column preprocessed_message

In [8]: from nltk.corpus import stopwords

def stopword_removal(text):

stop_words = stopwords.words('english')

filtered_sentence = []

filtered_sentence = ' '.join([word for word in text if word not in stop_words])

return filtered_sentence

imdb['preprocessed_message'] = imdb.lemmatized_message.apply(stopword_removal)

print('Preprocessed message:',imdb.preprocessed_message[54])

Training_data=pd.Series(list(imdb['preprocessed_message']))

Training_label=pd.Series(list(imdb['label']))

Preprocessed message: long , whiny pointless .

Term Document Matrix

Apply CountVectorizer with following parameters

ngram_range = (1,2)
min_df = (1/len(Training_label))
max_df = 0.7
Fit the tf_vectorizer with the Training_data
Transform the Total_Dictionary_TDM with the Training_data

In [9]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)

message_data_TDM = tf_vectorizer.transform(Training_data)

Term Frequency Inverse Document Frequency (TFIDF)

Apply TfidfVectorizer with following parameters

ngram_range = (1,2)
min_df = (1/len(Training_label))
max_df = 0.7
Fit the tfidf_vectorizer with the Training_data
Transform the Total_Dictionary_TFIDF with the Training_data

In [10]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)

message_data_TFIDF = tfidf_vectorizer.transform(Training_data)

Train and Test Data

Splitting the data for training and testing(90% train,10% test)

Perform train-test split on message_data_TDM and Training_label with 90% as train data and 10% as test data.

In [11]: from sklearn.model_selection import train_test_split

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.1)

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.me/fresco_milestone) ( @fresco_milestone )

Support Vector Machine

Get the shape of the train-data and print the same.

Get the shape of the test-data and print the same.
Initialize SVM classifier with following parameters
kernel = linear
C= 0.025
random_state=seed
Train the model with train_data and train_label
Now predict the output with test_data
Evaluate the classifier with score from test_data and test_label
Print the predicted score

In [12]: seed=9

from sklearn.svm import SVC

train_data_shape = train_data.shape

test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape)

print("The shape of test data:", test_data_shape)

classifier = SVC( kernel='linear',C=0.025,random_state=seed )

classifier = classifier.fit(train_data,train_label)

target = classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print('SVM Classifier : ',score)

with open('output.txt', 'w') as file:

file.write(str((imdb['tokenized_message'][55],imdb['lemmatized_message'][55])))

The shape of train data: (900, 9051)

The shape of test data: (100, 9051)

SVM Classifier : 0.73

Stochastic Gradient Descent Classifier

Perform train-test split on message_data_TDM and Training_label with this time 80% as train data and 20% as test data.
Get the shape of the train-data and print the same.
Get the shape of the test-data and print the same.
Initialize SVM classifier with following parameters
loss = modified_huber
shuffle= True
random_state=seed
Train the model with train_data and train_label
Now predict the output with test_data
Evaluate the classifier with score from test_data and test_label
Print the predicted score

In [13]: from sklearn.linear_model import SGDClassifier

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.2)

train_data_shape = train_data.shape

test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape )

print("The shape of test data:", test_data_shape )

classifier = SGDClassifier( loss='modified_huber',shuffle=True,random_state=seed )

classifier = classifier.fit(train_data,train_label)

target = classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print('SGD classifier : ',score)

with open('output1.txt', 'w') as file:

file.write(str((imdb['preprocessed_message'][55])))

The shape of train data: (800, 9051)

The shape of test data: (200, 9051)

SGD classifier : 0.7

In [ ]: "@fresco_milestone"

Company Policies and Procedures Manual
100% (25)
Company Policies and Procedures Manual
50 pages
CIPM Book
100% (16)
CIPM Book
266 pages
Data Governance Playbook
100% (16)
Data Governance Playbook
168 pages
Milestone - Coding - Python - Cu
No ratings yet
Milestone - Coding - Python - Cu
3 pages
Srinyantu Chatterjee Report
0% (1)
Srinyantu Chatterjee Report
1 page
07 Data Governance Policy
No ratings yet
07 Data Governance Policy
7 pages
Internet of Things Prime
No ratings yet
Internet of Things Prime
3 pages
Unstructtured Data Classification Fresco
100% (1)
Unstructtured Data Classification Fresco
4 pages
Import As From Import Import: Problem 1
100% (1)
Import As From Import Import: Problem 1
5 pages
Cybersecurity Risk Assessment Template IT Security Risk Assessment
100% (6)
Cybersecurity Risk Assessment Template IT Security Risk Assessment
16 pages
Vendor Risk Assessment v1.0: Instructions
No ratings yet
Vendor Risk Assessment v1.0: Instructions
2 pages
Deloitte - GRC Technology
67% (3)
Deloitte - GRC Technology
18 pages
Data Protection Policy
100% (4)
Data Protection Policy
11 pages
Personal Data Sheet
No ratings yet
Personal Data Sheet
4 pages
Python3 - Programming-Final Assessment - INCOMPLETO
No ratings yet
Python3 - Programming-Final Assessment - INCOMPLETO
32 pages
Incident Response Plan Template
100% (1)
Incident Response Plan Template
51 pages
en 20220917213527
No ratings yet
en 20220917213527
2 pages
Python List Handson 1
No ratings yet
Python List Handson 1
2 pages
Stat
No ratings yet
Stat
5 pages
Scala Constructs: Concepts of Functional Programming
No ratings yet
Scala Constructs: Concepts of Functional Programming
21 pages
Creating A Selenium Script
No ratings yet
Creating A Selenium Script
3 pages
SR No Category Sub Category Course Name Enable / Disable D Hands On? Yes/No Handson Detail
No ratings yet
SR No Category Sub Category Course Name Enable / Disable D Hands On? Yes/No Handson Detail
3 pages
Powershell
No ratings yet
Powershell
4 pages
Fresco
100% (2)
Fresco
17 pages
DATAbase Connectivity
100% (2)
DATAbase Connectivity
4 pages
GDPR Training
83% (6)
GDPR Training
60 pages
Non Disclosure Agreement Template
100% (4)
Non Disclosure Agreement Template
6 pages
List of Documents EU GDPR ISO 27001 Integrated Documentation Toolkit en
100% (5)
List of Documents EU GDPR ISO 27001 Integrated Documentation Toolkit en
7 pages
Machine Learning - Exploring The Model
50% (2)
Machine Learning - Exploring The Model
3 pages
Unstructured
No ratings yet
Unstructured
37 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
2 pages
Python 3 Functions and OOPs
No ratings yet
Python 3 Functions and OOPs
7 pages
Image Classification Handson-Image - Test
No ratings yet
Image Classification Handson-Image - Test
5 pages
Python 3 Programming
No ratings yet
Python 3 Programming
3 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
5 pages
Basics of Statistics and Probability - FP: Statistical Measures
No ratings yet
Basics of Statistics and Probability - FP: Statistical Measures
12 pages
Kafka Remanere
No ratings yet
Kafka Remanere
3 pages
Python Pandas MCQs
No ratings yet
Python Pandas MCQs
7 pages
Tensor Flow
No ratings yet
Tensor Flow
2 pages
Python 3 Programming Q & A
No ratings yet
Python 3 Programming Q & A
4 pages
Q Answer
No ratings yet
Q Answer
11 pages
Image Classification Hands-On
100% (1)
Image Classification Hands-On
1 page
New Text Document
No ratings yet
New Text Document
10 pages
Scala - The Diatonic Syallable
No ratings yet
Scala - The Diatonic Syallable
2 pages
R
No ratings yet
R
15 pages
Data Visulization FrescoPlay MFDM
No ratings yet
Data Visulization FrescoPlay MFDM
2 pages
Rsa
No ratings yet
Rsa
2 pages
Tableau Sequel
No ratings yet
Tableau Sequel
5 pages
This Study Resource Was
No ratings yet
This Study Resource Was
5 pages
Cassandra Data Handling Hands On
No ratings yet
Cassandra Data Handling Hands On
3 pages
Nodejs Mock Test III
No ratings yet
Nodejs Mock Test III
6 pages
Descriptor
No ratings yet
Descriptor
4 pages
Untitled
No ratings yet
Untitled
2 pages
Python Qualis
No ratings yet
Python Qualis
6 pages
Create A DataFrame
No ratings yet
Create A DataFrame
1 page
Association Rule Mining
100% (2)
Association Rule Mining
2 pages
Data Visualization New
No ratings yet
Data Visualization New
3 pages
This Study Resource Was: - Are A Set of Rules That Determine The Execution of A Transaction
No ratings yet
This Study Resource Was: - Are A Set of Rules That Determine The Execution of A Transaction
8 pages
Stat 2
No ratings yet
Stat 2
3 pages
Numpy - Python Package For Data
No ratings yet
Numpy - Python Package For Data
9 pages
Statistics and Probability Katabasis 2
No ratings yet
Statistics and Probability Katabasis 2
2 pages
Selenium QA
No ratings yet
Selenium QA
2 pages
Flask-Python Web Framework Hands-On
No ratings yet
Flask-Python Web Framework Hands-On
12 pages
Python-Module03-Case Study03
100% (1)
Python-Module03-Case Study03
2 pages
Continuous Integration 2
No ratings yet
Continuous Integration 2
1 page
Gradle Hello or Gradle - Q Hello
No ratings yet
Gradle Hello or Gradle - Q Hello
3 pages
ScalaNew Malay
No ratings yet
ScalaNew Malay
4 pages
Burp Suite
No ratings yet
Burp Suite
2 pages
Python 3 Application Programming
100% (1)
Python 3 Application Programming
12 pages
JIRA Respuestas
No ratings yet
JIRA Respuestas
4 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
DataFrame Operations Using A Json File
No ratings yet
DataFrame Operations Using A Json File
1 page
Katalon Studio Assessment
No ratings yet
Katalon Studio Assessment
19 pages
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Java Source Code: Program 1 //find Maximum of 2 Nos
0% (1)
Java Source Code: Program 1 //find Maximum of 2 Nos
28 pages
The Complete Guide To GDPR PDF
100% (1)
The Complete Guide To GDPR PDF
60 pages
Core Functionalities: Data Flow Analysis
No ratings yet
Core Functionalities: Data Flow Analysis
11 pages
New Patient Registration Form Template
No ratings yet
New Patient Registration Form Template
8 pages
Swift Codes
No ratings yet
Swift Codes
3 pages
Klassify Data Classification Suite Datasheet
100% (1)
Klassify Data Classification Suite Datasheet
4 pages
Nestle India LTD.: Statutory Audit
No ratings yet
Nestle India LTD.: Statutory Audit
24 pages
Patient Consent in Dentistry
No ratings yet
Patient Consent in Dentistry
5 pages
Guide To Data Protection Impact Assessments 14 Sep 2021
100% (3)
Guide To Data Protection Impact Assessments 14 Sep 2021
36 pages
Data Retention Policy: Plan International UK
100% (4)
Data Retention Policy: Plan International UK
12 pages
Data Privacy Handbook
100% (1)
Data Privacy Handbook
17 pages
Confirmation of Saudi Arabia Driving Licence Particulars
0% (1)
Confirmation of Saudi Arabia Driving Licence Particulars
1 page
GDPR Data Protection Audit
80% (10)
GDPR Data Protection Audit
23 pages
GDPR Guide PDF
100% (8)
GDPR Guide PDF
158 pages
GDPR Audit Checklist
100% (6)
GDPR Audit Checklist
2 pages
Privacy Impact Assessment Template
No ratings yet
Privacy Impact Assessment Template
9 pages
Python Pandas Handson
No ratings yet
Python Pandas Handson
6 pages
Mini-Project - Java Fullstack Developer - MySQL - FP (63426)
No ratings yet
Mini-Project - Java Fullstack Developer - MySQL - FP (63426)
1 page
Color Theory
No ratings yet
Color Theory
4 pages
Bower
No ratings yet
Bower
1 page
Generatebioinformaticsdatausing Generative Adversarial Network AReview
No ratings yet
Generatebioinformaticsdatausing Generative Adversarial Network AReview
12 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
MentalRiskES IberLEF 2023 TextualTherapists
No ratings yet
MentalRiskES IberLEF 2023 TextualTherapists
18 pages
zhang-et-al-2024-data-driven-optimization-of-high-dimensional-variables-in-proto
No ratings yet
zhang-et-al-2024-data-driven-optimization-of-high-dimensional-variables-in-proto
13 pages
Beyer_Knowledge_Distillation_A_Good_Teacher_Is_Patient_and_Consistent_CVPR_2022_paper
No ratings yet
Beyer_Knowledge_Distillation_A_Good_Teacher_Is_Patient_and_Consistent_CVPR_2022_paper
10 pages
aiml
No ratings yet
aiml
101 pages
CSY3025 Artificial Intelligence Techniques: Deep Learning
No ratings yet
CSY3025 Artificial Intelligence Techniques: Deep Learning
42 pages
Main
No ratings yet
Main
1 page
Instruct Following
No ratings yet
Instruct Following
9 pages
A Novel Ensemble Learning Approach of Deep Learning Techniques To Monitor Distracted Driver Behaviour in Real Time
No ratings yet
A Novel Ensemble Learning Approach of Deep Learning Techniques To Monitor Distracted Driver Behaviour in Real Time
6 pages
Cyber Threat Detection From Twitter
No ratings yet
Cyber Threat Detection From Twitter
8 pages
Decisionsupport Financial
No ratings yet
Decisionsupport Financial
11 pages
AI Companions Reduce
No ratings yet
AI Companions Reduce
62 pages
Data Analytics Classification
No ratings yet
Data Analytics Classification
56 pages
Module3-Similarity-based Learning-11Mar2024
No ratings yet
Module3-Similarity-based Learning-11Mar2024
34 pages
ML_UNIT-V
No ratings yet
ML_UNIT-V
161 pages
Ebook-2023-Glossary-AI-Terms
No ratings yet
Ebook-2023-Glossary-AI-Terms
22 pages
Facial Expression Recognition Based On Tensorflow Platform
No ratings yet
Facial Expression Recognition Based On Tensorflow Platform
4 pages
1-s2.0-S1877050920310231-main (1)
No ratings yet
1-s2.0-S1877050920310231-main (1)
8 pages
Data Warehouse and Data Mining Lab Manual
No ratings yet
Data Warehouse and Data Mining Lab Manual
49 pages
Dr. Ahmed Elngar - ML
No ratings yet
Dr. Ahmed Elngar - ML
118 pages
ML Techmax
100% (1)
ML Techmax
202 pages
Capstone Project
No ratings yet
Capstone Project
25 pages
B1809677-Ngô Hồng Quốc Bảo-Wrong Pose Dectection Based on Machine Learning
No ratings yet
B1809677-Ngô Hồng Quốc Bảo-Wrong Pose Dectection Based on Machine Learning
52 pages
A Review of Research Works On Supervised Learning Algorithms For SCADA
No ratings yet
A Review of Research Works On Supervised Learning Algorithms For SCADA
19 pages
Deepfake Detection of Images
No ratings yet
Deepfake Detection of Images
9 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Multi-Disease Prediction Using Machine Learning Algorithm
No ratings yet
Multi-Disease Prediction Using Machine Learning Algorithm
9 pages
An In-Depth Study and Improvement of Isolation Forest
No ratings yet
An In-Depth Study and Improvement of Isolation Forest
19 pages
Updated Project Report Biomodal Biometric Authentication System
No ratings yet
Updated Project Report Biomodal Biometric Authentication System
30 pages

Unstructured Data Classification Handson

Uploaded by

Unstructured Data Classification Handson

Uploaded by

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.

In [1]: import pandas as pd

In [2]: #Data Loading

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

3 3 Very little music or anything to speak of. 0

4 4 The best scene in the movie was when Gerardo i... 1

Get the shape of the dataset and print it.

In [3]: data_size =imdb.shape

['index', 'text', 'label']

index text label

count 1000.000000 1000 1000.00000

unique NaN 997 NaN

top NaN 10/10 NaN

freq NaN 2 NaN

mean 499.500000 NaN 0.50000

std 288.819436 NaN 0.50025

min 0.000000 NaN 0.00000

25% 249.750000 NaN 0.00000

50% 499.500000 NaN 0.50000

75% 749.250000 NaN 1.00000

max 999.000000 NaN 1.00000

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

Convert the text into lower.

[nltk_data] Downloading collection 'all'

[nltk_data] | Downloading package abc to /home/user/nltk_data...

In [7]: from nltk.stem.wordnet import WordNetLemmatizer

for word in text:

Tokenized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Lemmatized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Stop Word Removal

Set the stop words language as english in the variable stop_words

In [8]: from nltk.corpus import stopwords

filtered_sentence = ' '.join([word for word in text if word not in stop_words])

Preprocessed message: long , whiny pointless .

Term Document Matrix

Apply CountVectorizer with following parameters

In [9]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Term Frequency Inverse Document Frequency (TFIDF)

Apply TfidfVectorizer with following parameters

In [10]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Train and Test Data

Splitting the data for training and testing(90% train,10% test)

In [11]: from sklearn.model_selection import train_test_split

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.1)

Support Vector Machine

Get the shape of the train-data and print the same.

from sklearn.svm import SVC

print("The shape of train data:", train_data_shape)

print("The shape of test data:", test_data_shape)

classifier = SVC( kernel='linear',C=0.025,random_state=seed )

print('SVM Classifier : ',score)

with open('output.txt', 'w') as file:

The shape of train data: (900, 9051)

The shape of test data: (100, 9051)

SVM Classifier : 0.73

In [13]: from sklearn.linear_model import SGDClassifier

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.2)

print("The shape of train data:", train_data_shape )

print("The shape of test data:", test_data_shape )

classifier = SGDClassifier( loss='modified_huber',shuffle=True,random_state=seed )

print('SGD classifier : ',score)

with open('output1.txt', 'w') as file:

The shape of train data: (800, 9051)

The shape of test data: (200, 9051)

SGD classifier : 0.7

You might also like