0% found this document useful (0 votes)

16 views5 pages

Fake News Classifier

The document shows the steps taken to perform text classification on a dataset using Naive Bayes. It includes data loading and preprocessing steps like removing stop words and special characters, splitting the data into training and test sets, applying CountVectorizer and TfidfVectorizer to generate feature vectors, and fitting and evaluating a Multinomial Naive Bayes classifier on both the count vectorized and tfidf vectorized data. The accuracy scores from the Naive Bayes models on both feature representations are printed.

Uploaded by

K.Thirumal Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views5 pages

Fake News Classifier

Uploaded by

K.Thirumal Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 5

import the necessary libraries

-----------------------------------------------------------

import nlp_utils
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Loading dataset
--------------------------------------------------------------

df=pd.read_csv('train.csv')

count the number of rows and columns in the dataset

--------------------------------------------------------------
df.shape

increasing the width of the the columns

-------------------------------------------------------------------
pd.set_option('display.max_colwidth', -1)

df['title']

df['text']

Print the the number of data points belonging to each categories

-----------------------------------------------------------------

df['label'].value_counts()

Count the number of null values present in the dataset

----------------------------------------------------------------

df.isnull().sum()

Remove the null values from the dataset

-------------------------------------------------------------------
df=df.dropna()

Check no null values present in the dataset?

Reset the index of the given series

---------------------------------------------------
df.reset_index(inplace=True)

Text cleaning
___________________________________________________________________________________
___________________

import re
import string
Remove all alpha numeric letters
___________________________________________________________________________________
____________

alphanumeric = lambda x: re.sub('\w\d\w', ' ', x)

Convert all strings to lowercase

___________________________________________________________________________________
________________________________________

punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ',

x.lower())

Remove all '\n' in the string and replace it with a space

___________________________________________________________________________________
_____________________________________

remove_n = lambda x: re.sub("\n", " ", x)

Remove all non-ascii characters

___________________________________________________________________________________
_______________________________

remove_non_ascii = lambda x: re.sub(r'[^\x00-\x7f]',r' ', x)

Apply all the lambda functions wrote previously through .map on the comments column
___________________________________________________________________________________
____________________________

df['text'] =
df['text'].map(alphanumeric).map(punc_lower).map(remove_n).map(remove_non_ascii)

df['text']

Removing stop words

___________________________________________________________________________________
______________

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words=stopwords.words('english')
#DataFrame.apply(Function_to_apply_to_each_row)
def rem_stopword(data):
li=[]
for w in data.split():
if w not in stop_words:
li.append(w)
return " ".join(li)

#Testing Removing stop words

data="All the students of Third Year CSM are studying NLP "
print(rem_stopword(data))

Removing stop words and stemming the text

___________________________________________________________________________________
________

from nltk.stem.porter import PorterStemmer

import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
review = re.sub('[^a-zA-Z]', ' ', df['text'][i])
review = review.lower()
review = review.split()

review = [ps.stem(word) for word in review if not word in

stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)

Splitting the dataframe

___________________________________________________________________________________
________________________________

We select the label column as Y

Y=df['label']

Making train and test data

___________________________________________________________________________________
__________________________

Split the data into 70 percent train and 30 percent test

X_train, X_test, Y_train, Y_test = train_test_split(df['text'], Y, test_size=0.30,

random_state=40)
Tfidf vectorizer
___________________________________________________________________________________
_________________

Applying tfidf to the data set

tfidf_vect = TfidfVectorizer(stop_words = 'english',max_df=0.7)

tfidf_train = tfidf_vect.fit_transform(X_train)
tfidf_test = tfidf_vect.transform(X_test)

Count vectorizer
___________________________________________________________________________________
__________________

count_vect = CountVectorizer(stop_words = 'english')

count_train = count_vect.fit_transform(X_train.values)
count_test = count_vect.transform(X_test.values)

Naive Bayes model on tfidf

___________________________________________________________________________________
_____

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

from sklearn.metrics import accuracy_score

clf = MultinomialNB()
clf.fit(tfidf_train, Y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(Y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(Y_test, pred)
print(cm)
Naive Bayes model on Count Vectorized
__________________________________________________________________

clf = MultinomialNB()
clf.fit(count_train, Y_train)
pred1 = clf.predict(count_test)
score = metrics.accuracy_score(Y_test, pred1)
print("accuracy: %0.3f" % score)
cm2 = metrics.confusion_matrix(Y_test, pred1)
print(cm2)

LAB 6
No ratings yet
LAB 6
47 pages
cyberbullying code
No ratings yet
cyberbullying code
6 pages
Report On - Social Media Research Topic Modeling
No ratings yet
Report On - Social Media Research Topic Modeling
26 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
NLP_crecord_mid2
No ratings yet
NLP_crecord_mid2
36 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
ccc
No ratings yet
ccc
25 pages
Self Evaluation Exercises (1)
No ratings yet
Self Evaluation Exercises (1)
12 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Nlp Assignment 4(22bce9560)
No ratings yet
Nlp Assignment 4(22bce9560)
12 pages
Null 0
No ratings yet
Null 0
6 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Sumati
No ratings yet
Sumati
10 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
17 - Source Code - nlp-2-5
No ratings yet
17 - Source Code - nlp-2-5
4 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Ment Analysis Text Classification
No ratings yet
Ment Analysis Text Classification
9 pages
7 Aiml
No ratings yet
7 Aiml
4 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
Sample
No ratings yet
Sample
6 pages
ML File
No ratings yet
ML File
13 pages
ML_lab_programs
No ratings yet
ML_lab_programs
8 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Lab Report 8
No ratings yet
Lab Report 8
11 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
Random Forest
No ratings yet
Random Forest
5 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
code text
No ratings yet
code text
4 pages
Unstructtured Data Classification Fresco
100% (1)
Unstructtured Data Classification Fresco
4 pages
Code Day 9 ML (ordinal) - Jupyter Notebook
No ratings yet
Code Day 9 ML (ordinal) - Jupyter Notebook
4 pages
Naive Bayes Classification - Jupyter Notebook
No ratings yet
Naive Bayes Classification - Jupyter Notebook
4 pages
APPLICATION CODE
No ratings yet
APPLICATION CODE
3 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
text, pos, wor2vec
No ratings yet
text, pos, wor2vec
3 pages
Python Project
No ratings yet
Python Project
2 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
SMA EXP 10 CODE PRINT
No ratings yet
SMA EXP 10 CODE PRINT
7 pages
Experiment 7 ML
No ratings yet
Experiment 7 ML
3 pages
ML Program Output
No ratings yet
ML Program Output
22 pages
6TH-P6
No ratings yet
6TH-P6
3 pages
clp2
No ratings yet
clp2
1 page
IR_Prac_5
No ratings yet
IR_Prac_5
3 pages
Python Assignment 3
No ratings yet
Python Assignment 3
3 pages
Perkins Troubleshooting 1104d 1106d Industrial Engine
98% (58)
Perkins Troubleshooting 1104d 1106d Industrial Engine
20 pages
Ir practical 5
No ratings yet
Ir practical 5
2 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
2 pages
Code
No ratings yet
Code
6 pages
Clp
No ratings yet
Clp
1 page
ex-7
No ratings yet
ex-7
2 pages
Bioreaction Engineering Principles
100% (13)
Bioreaction Engineering Principles
554 pages
Work Archt Sept16 SPA-InG1175353721
No ratings yet
Work Archt Sept16 SPA-InG1175353721
642 pages
tool4cool_operating_instructions_12-2019
No ratings yet
tool4cool_operating_instructions_12-2019
76 pages
(Junoon-e-JEE) - (3.0) - Electrostatics (Part-2) - 29th September.
50% (2)
(Junoon-e-JEE) - (3.0) - Electrostatics (Part-2) - 29th September.
140 pages
Q 3
No ratings yet
Q 3
2 pages
CH-1 and 2
No ratings yet
CH-1 and 2
129 pages
Spesifikasi Alat Baru HOPPER
No ratings yet
Spesifikasi Alat Baru HOPPER
159 pages
RSCH2122 Week 20 Exam
58% (12)
RSCH2122 Week 20 Exam
7 pages
Chapter (3) Three (B)
No ratings yet
Chapter (3) Three (B)
65 pages
CH_03_I
No ratings yet
CH_03_I
25 pages
Chapter 13 Binomial Tree Complete Version Fall 2022-20221101
No ratings yet
Chapter 13 Binomial Tree Complete Version Fall 2022-20221101
74 pages
S-24CS01A/02A/04A H Series: For Automotive 105 2-Wire Serial E Prom
No ratings yet
S-24CS01A/02A/04A H Series: For Automotive 105 2-Wire Serial E Prom
37 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
3 pages
De 4th SEM Syllabus
No ratings yet
De 4th SEM Syllabus
11 pages
New Viscometers For Measuring The Viscosity of Liq
No ratings yet
New Viscometers For Measuring The Viscosity of Liq
8 pages
2D Geotechnical Site-Response Analysis Including Soil Heterogeneity and Wave Scattering
No ratings yet
2D Geotechnical Site-Response Analysis Including Soil Heterogeneity and Wave Scattering
24 pages
8 - Breakdown Repair Report
No ratings yet
8 - Breakdown Repair Report
1 page
Foldeddipoleantenna 160809040534
No ratings yet
Foldeddipoleantenna 160809040534
18 pages
AUST Last 4 MCQ Math Solution PDF
No ratings yet
AUST Last 4 MCQ Math Solution PDF
34 pages
Creating A TDR Inside The IJTAG Network by Reading in An Instrument ICL With DataInPort and DataOutPort
No ratings yet
Creating A TDR Inside The IJTAG Network by Reading in An Instrument ICL With DataInPort and DataOutPort
4 pages
Description Power Range: Analog Servo Drive
No ratings yet
Description Power Range: Analog Servo Drive
8 pages
Medical Instruments: By: Weka Febrinda Sandi
No ratings yet
Medical Instruments: By: Weka Febrinda Sandi
9 pages
Dry Fly Ash Bin Weighing System
No ratings yet
Dry Fly Ash Bin Weighing System
15 pages
PRE BOARD 1 MATHS STANDARFD
No ratings yet
PRE BOARD 1 MATHS STANDARFD
7 pages
CTX Control Relays: 87045 LIMOGES Cedex
No ratings yet
CTX Control Relays: 87045 LIMOGES Cedex
5 pages
Lexurgy LSC
No ratings yet
Lexurgy LSC
3 pages
The Application of Wifi Technology in Smart Home
No ratings yet
The Application of Wifi Technology in Smart Home
7 pages
Grade8-Worksheet-Answer Key-Algebraic Expressions and Identities-2022-23
No ratings yet
Grade8-Worksheet-Answer Key-Algebraic Expressions and Identities-2022-23
3 pages
Enoch Boulton Geometrics1
No ratings yet
Enoch Boulton Geometrics1
6 pages

Fake News Classifier

Uploaded by

Fake News Classifier

Uploaded by

import the necessary libraries

count the number of rows and columns in the dataset

increasing the width of the the columns

Print the the number of data points belonging to each categories

Count the number of null values present in the dataset

Remove the null values from the dataset

Check no null values present in the dataset?

Reset the index of the given series

alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)

Convert all strings to lowercase

punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ',

Remove all '\n' in the string and replace it with a space

remove_n = lambda x: re.sub("\n", " ", x)

Remove all non-ascii characters

remove_non_ascii = lambda x: re.sub(r'[^\x00-\x7f]',r' ', x)

Removing stop words

#Testing Removing stop words

Removing stop words and stemming the text

from nltk.stem.porter import PorterStemmer

review = [ps.stem(word) for word in review if not word in

Splitting the dataframe

We select the label column as Y

Making train and test data

Split the data into 70 percent train and 30 percent test

X_train, X_test, Y_train, Y_test = train_test_split(df['text'], Y, test_size=0.30,

Applying tfidf to the data set

tfidf_vect = TfidfVectorizer(stop_words = 'english',max_df=0.7)

count_vect = CountVectorizer(stop_words = 'english')

Naive Bayes model on tfidf

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

You might also like

alphanumeric = lambda x: re.sub('\w\d\w', ' ', x)