0% found this document useful (0 votes)
13 views26 pages

AI Phase5

The document outlines a project focused on developing a machine learning model for fake news detection using Natural Language Processing (NLP) techniques. The project aims to create a robust model that can accurately differentiate between genuine and fake news articles, leveraging a dataset from Kaggle and employing methods like CNN and BiLSTM. Key challenges include text preprocessing, feature extraction, and model evaluation, with the ultimate goal of combating misinformation and promoting credible journalism.

Uploaded by

Dousik Manokaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

AI Phase5

The document outlines a project focused on developing a machine learning model for fake news detection using Natural Language Processing (NLP) techniques. The project aims to create a robust model that can accurately differentiate between genuine and fake news articles, leveraging a dataset from Kaggle and employing methods like CNN and BiLSTM. Key challenges include text preprocessing, feature extraction, and model evaluation, with the ultimate goal of combating misinformation and promoting credible journalism.

Uploaded by

Dousik Manokaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NAAN MUDHALVAN

IBM: AI101
ARTIFICIAL INTELLIGENCE
PHASE 5

Fake News Detection


Using NLP
PROJECT NO: 08

TEAM MEMBERS:
1. VIJAI SURIA M
2. KIRANKUMAR M
3. GIRIDHARAN S S
4. NITHISH T

MENTOR
➢ Ms. KEERTHANA R

MADRAS INSTITUTE OF TECHNOLOGY, ANNA UNIVERSITY, CHENNAI


Fake News Detection Using NLP
PHASE 5
FINAL DOCUMENT

Problem Statement: The fake news dataset is one of the classic text analytics
datasets available on Kaggle. It consists of genuine and fake articles’ titles and
text from different authors. Our job is to create a model which predicts whether a
given news is real or fake.
Objective: The objective of this project is to develop a machine learning model
that can accurately distinguish between genuine and fake news articles based on
their titles and text content. By doing so, we aim to contribute to the fight against
the spread of misinformation and fake news, which can have significant social
and political consequences.
Data Source: We will use a fake news dataset available on Kaggle. This dataset
contains articles' titles and text, along with their corresponding labels indicating
whether the news is genuine or fake.
Dataset link: [Link]
news-dataset

Project Objective: -
The objective of this project is to develop a robust and accurate machine learning
model for the detection of fake news using Natural Language Processing (NLP)
techniques. In an era where the spread of misinformation and fake news can have
far-reaching consequences, our goal is to contribute to the effort of distinguishing
between genuine and fake news articles. By harnessing the power of text analysis
and deep learning, we aim to create a tool that can aid in combating the
dissemination of false information and promote the dissemination of credible
news sources. This project seeks to leverage a dataset of news articles, their
titles, and content to design, train, and evaluate a model capable of making
informed predictions about the authenticity of news reports.
Introduction
In today's information age, the rapid dissemination of news and information is
both a blessing and a curse. While it allows for quick access to valuable
knowledge, it also presents opportunities for the spread of fake news,
misinformation, and rumors. Fake news can have dire consequences, influencing
public opinion, affecting elections, and causing social unrest. Therefore, it is of
paramount importance to develop tools that can automatically discern between
genuine and fake news.
This project focuses on the application of Natural Language Processing (NLP)
techniques and machine learning to tackle the challenge of fake news detection.
We will utilize a dataset comprising news articles' titles and content, labeled as
either genuine or fake. Our approach involves text preprocessing, feature
extraction, and the construction of a deep learning model that combines
Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
(BiLSTM) layers. The use of TensorFlow as our framework of choice ensures the
model's efficiency and scalability.
Our project's significance lies in its potential to enhance media literacy and
empower individuals to make more informed decisions about the information they
consume. By achieving high accuracy in detecting fake news, we aim to contribute
to the broader mission of promoting credible journalism and combatting the
spread of false narratives.
Key Challenges:
1. Text Preprocessing: Raw text data often contains noise and irrelevant
information. Preprocessing is essential to clean and transform the text data
into a suitable format for analysis and modelling.
2. Feature Extraction: Converting text into numerical features is crucial for
machine learning models to understand and make predictions. We will
explore techniques like TF-IDF and word embeddings for feature extraction.
3. Model Selection: Choosing an appropriate machine learning algorithm is
critical for achieving high classification accuracy. We plan to use a
Convolutional Neural Network (CNN) combined with a Bidirectional Long
Short-Term Memory (BiLSTM) architecture, implemented using TensorFlow,
to build our fake news detection model.
4. Model Evaluation: To assess the model's performance, we will use various
evaluation metrics such as accuracy, precision, recall, F1-score, and the
Receiver Operating Characteristic Area Under Curve (ROC-AUC). The choice
of metrics will depend on the project's specific requirements and the
importance of false positives and false negatives.
Design Thinking
1. Data Source
We will begin by obtaining the fake news dataset from Kaggle, which
contains a substantial collection of news articles along with their
associated labels (real or fake). This dataset will serve as the foundation for
our fake news detection project.
2. Data Preprocessing
Before feeding the text data into our machine learning model, we need to
preprocess it to ensure that it is in a clean and standardized format. Data
preprocessing steps will include:
• Text Cleaning: Removing any HTML tags, special characters, and
irrelevant symbols.
• Tokenization: Splitting the text into individual words or tokens.
• Stopword Removal: Eliminating common and uninformative words
such as "the," "is," and "and."
• Lemmatization or Stemming: Reducing words to their base or root
form to normalize text.
• Text Vectorization: Converting the text data into numerical
representations for modeling.
3. Feature Extraction
We will explore two common techniques for text feature extraction:
a) TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a
statistical measure that evaluates the importance of a word in a
document relative to a collection of documents. We will use it to
convert the text data into a matrix of TF-IDF features.
b) Word Embeddings: Word embeddings, such as Word2Vec or GloVe,
can capture semantic relationships between words. We will
experiment with pre-trained word embeddings or train custom
embeddings on our dataset.
4. Model Selection
Our model choice is a combination of many ML & DL algorithms
implemented using TensorFlow. This architecture is well-suited for
capturing both local and global patterns in text data, making it suitable for
fake news detection.
5. Model Training
The model will be trained using the pre-processed and feature-engineered
text data. We will split the dataset into training, validation, and test sets to
ensure proper model evaluation. Training will involve optimizing model
parameters and monitoring performance using appropriate metrics.
6. Evaluation
To evaluate the effectiveness of our fake news detection model, we will
employ a range of evaluation metrics:
• Accuracy: Measures the overall correctness of the model's
predictions.
• Precision: Calculates the ratio of true positive predictions to the total
positive predictions, indicating the model's ability to avoid false
positives.
• Recall: Calculates the ratio of true positive predictions to the total
actual positives, indicating the model's ability to capture all positive
instances.
• F1-Score: Harmonic mean of precision and recall, providing a
balanced measure of model performance.
• ROC-AUC: Measures the area under the Receiver Operating
Characteristic curve, indicating the model's ability to distinguish
between real and fake news.
These metrics will help us assess the model's performance
comprehensively and make any necessary improvements to achieve our
goal of accurately detecting fake news articles.
INNOVATIONS:
1. Hybrid Approach:

Incorporating a hybrid approach that combines content-based and


social context-based features to identify fake news. An example is
the Transformer-based model proposed by Raza and Ding, which
utilizes both news article information and social context to enhance
fake news detection. This model utilizes a Transformer
architecture, comprising an encoder for learning useful
representations from fake news data and a decoder for predicting
future behavior based on past observations. It also integrates
numerous features from news content and social contexts to
improve classification accuracy.
2. Multimodal Approach:

Employing a multimodal approach that leverages both textual and


visual data for fake news detection. A model like the one proposed
by Wang et al. could be adopted, which employs a multimodal
deep neural network to merge textual and visual features. This
model consists of three key components: a text encoder for
extracting textual features from news content, an image encoder
for extracting visual features from news images, and a fusion
module to combine these features and make a final prediction.
3. Transfer Learning:

Utilizing transfer learning techniques to improve fake news


detection performance. For instance, we can employ pre-trained
models like BERT, which is a pre-trained language representation
model that can be fine-tuned for various natural language
processing tasks, including fake news detection. BERT is proficient
at capturing both syntactic and semantic information from
extensive text corpora and can be easily adapted to various
domains and languages.
4. Ensemble Learning:

Implementing ensemble learning methods to combine the


predictions of multiple models. By leveraging the diversity of
different models, we can potentially enhance the accuracy and
robustness of our fake news detection system.
NAAN MUDHALVAN IBM:AI101

5. Explainable AI (XAI):

Integrating XAI techniques to provide transparency and


interpretability in fake news detection. This ensures that the
model's decisions can be understood and validated, which
is crucial for building trust in the system.
6. Continuous Learning:

Implementing continuous learning mechanisms to adapt to


evolving fake news patterns and emerging disinformation
tactics. This involves regularly updating the model with new
data to ensure it remains effective over time.

7. User Feedback Integration:


Incorporating user feedback mechanisms to gather input from
users and improve the model's performance based on real-
world usage and user perceptions of news credibility.

8. Cross-lingual and Cross-cultural Adaptation:


Extending the model's capabilities to detect fake news in
multiple languages and adapt to different cultural contexts,
thereby enhancing its applicability on a global scale.
NAAN MUDHALVAN IBM:AI101

TOOLS:
Google Colab: Google Colab, a cloud-based Jupyter notebook
environment, serves as our primary coding platform
Algorithms and Techniques:
1. TF-IDF (Term Frequency-Inverse Document Frequency): Used for
text feature extraction to convert text data into numerical form.
2. Multinomial Naive Bayes: A classification algorithm for text data
often used for spam and fake news detection.
3. Logistic Regression: A classification algorithm for binary and multi-
class classification tasks.
4. Random Forest: An ensemble learning method for classification and
regression tasks.
5. Passive Aggressive Classifier: A type of online learning algorithm for
text classification.
6. Decision Tree: A classification algorithm that uses a tree structure
for decision-making.
7. Train-Test Split: A technique to split the dataset into training and
testing sets for model evaluation.
8. Confusion Matrix: A tool for evaluating classification model
performance.
9. Precision, Recall, F1-Score: Metrics for evaluating the performance
of classification models.
10. ROC Curve (Receiver Operating Characteristic): Used to assess the
performance of binary classification models.
11. Stopwords Removal: A text preprocessing technique to remove
common words that do not contribute much information.
12. Lowercasing: Converting text to lowercase to ensure uniformity.
13. Tokenization: Breaking text into words or tokens for analysis.
NAAN MUDHALVAN IBM:AI101

IMPLEMENTATION STEPS:
1) Import Necessary Libraries:
Start by importing the required Python libraries, such as pandas,
numpy, scikit-learn, and natural language processing libraries like
NLTK or spaCy.
CODE:

2) Load and Explore the Dataset:


Load the CSV files '[Link]' and '[Link]' using pandas and explore
the dataset to understand its structure.
CODE:
NAAN MUDHALVAN IBM:AI101

3) Data Preprocessing:
Data preprocessing is essential for text data. Perform the following
preprocessing steps:
Lowercasing: Convert text to lowercase.
Tokenization: Split text into words or tokens.
Stopword Removal: Remove common words like 'and', 'the',
etc.
Text Vectorization: Convert text into numerical format (e.g.,
using TF-IDF or Count Vectorization).

CODE:

4) Feature Extraction (TF-IDF):


Choose a text vectorization method. You can use either Count
Vectorization or TF-IDF Vectorization.
CODE:
NAAN MUDHALVAN IBM:AI101

5) Split the Data into Training and Testing Sets:


Split the dataset into training and testing sets to evaluate the model's
performance.
CODE:

SAMPLE OUTPUT (Data Preprocessing and splitting):

6) Model Training:
During this stage, we will proceed to train the Multinomial Naive
Bayes model utilizing the designated training dataset. This process
will entail instructing the model to differentiate between authentic
and counterfeit news articles based on the TF-IDF vectors that have
been meticulously prepared.
NAAN MUDHALVAN IBM:AI101

CODE: (MULTINOMIAL NAIVE BAYES ALGORITHM)

OUTPUT: (MULTINOMIAL NAIVE BAYES ALGORITHM)


NAAN MUDHALVAN IBM:AI101

CODE: (DECISION TREE)

OUTPUT:
NAAN MUDHALVAN IBM:AI101

CODE: (PASSIVE AGGRESSIVE CLASSIFIER)

OUTPUT: (PASSIVE AGGRESSIVE CLASSIFIER)


NAAN MUDHALVAN IBM:AI101

CODE: (RANDOM FOREST)

OUTPUT: (RANDOM FOREST)


NAAN MUDHALVAN IBM:AI101

CODE: (LOGISTIC REGRESSION)

OUTPUT: (LOGISTIC REGRESSION)


NAAN MUDHALVAN IBM:AI101

7) Model Evaluation:
Following the training process, we will conduct an evaluation of the
model's performance using the designated testing dataset. This
evaluation is essential to gauge the model's efficacy in accurately
classifying news articles as either genuine or fraudulent. Standard
evaluation metrics, such as accuracy, a confusion matrix, and a
classification report, will be employed to provide comprehensive
insights into the model's classification capabilities.
NAAN MUDHALVAN IBM:AI101
NAAN MUDHALVAN IBM:AI101

RESULT ANALYSIS GRAPH:


NAAN MUDHALVAN IBM:AI101
NAAN MUDHALVAN IBM:AI101

SUMMARY GRAPH:
NAAN MUDHALVAN IBM:AI101

8) Model Validation and Prediction:


NAAN MUDHALVAN IBM:AI101

Conclusion
In conclusion, the development of a fake news detection model using NLP
techniques and deep learning represents a crucial step in addressing the
contemporary challenge of misinformation. Throughout this project, we
have successfully undertaken various tasks, including data preprocessing,
feature extraction, model construction, and evaluation.
Our model, based on a combination of CNN and BiLSTM layers, has shown
promising results in distinguishing between genuine and fake news articles.
We have rigorously assessed its performance using metrics such as
accuracy, precision, recall, F1-score, and ROC-AUC, thereby ensuring its
reliability and effectiveness.
As we move forward, it is important to recognize the ongoing importance of
this work. Fake news remains a persistent issue in the digital age, and our
model provides a valuable tool in the fight against its proliferation. By
continuing to refine and deploy such models, we can contribute to a more
informed and discerning society, where credible journalism prevails, and
misinformation finds fewer footholds.
In the grand scheme of the information landscape, this project represents a
small yet significant step toward promoting the truth and safeguarding the
integrity of news reporting.

For access to this code, please refer to the following GitHub link:
[Link]
NAAN MUDHALVAN IBM:AI101

REFERENCES:
[1]. Matheven and B. V. D. Kumar, "Fake News Detection
Using Deep Learning and Natural Language Processing,"
2022 9th International Conference on Soft Computing &
Machine Intelligence (ISCMI), Toronto, ON, Canada, 2022, pp.
11-14, doi: 10.1109/ISCMI56532.2022.10068440.
[2]. S. M. N, K. M. V, S. Verma and S. Rajagopal, "NLP Based
Fake News Detection Using Hybrid Machine Learning
Techniques," 2022 3rd International Conference on Electronics
and Sustainable Communication Systems (ICESC),
Coimbatore, India, 2022, pp. 818- 822, doi:
10.1109/ICESC54411.2022.9885679.
[3]. M. A. Shaik, M. Y. Sree, S. S. Vyshnavi, T. Ganesh, D.
Sushmitha and N. Shreya, "Fake News Detection using NLP,"
2023 International Conference on Innovative Data
Communication Technologies and Application (ICIDCA),
Uttarakhand, India, 2023, pp. 399-405, doi:
10.1109/ICIDCA56705.2023.10100305.
[4]. A. R. Merryton and M. G. Augasta, "A Novel Framework for
Fake News Detection using Double Layer BI-LSTM," 2023 5th
International Conference on Smart Systems and Inventive
Technology (ICSSIT), Tirunelveli, India, 2023, pp. 1689-1696,
doi: 10.1109/ICSSIT55814.2023.10061026.
[5]. M. Aljabri, D. M. Alomari and M. Aboulnour, "Fake News
Detection Using Machine Learning Models," 2022 14th
International Conference on Computational Intelligence and
Communication Networks (CICN), Al-Khobar, Saudi Arabia,
2022, pp. 473-477, doi: 10.1109/CICN56167.2022.10008340.
[6]. Q. Abbas, M. U. Zeshan and M. Asif, "A CNN-RNN Based
Fake News Detection Model Using Deep Learning," 2022
International Seminar on Computer Science and Engineering
Technology (SCSET), Indianapolis, IN, USA, 2022, pp. 40-45,
doi: 10.1109/SCSET55041.2022.00019.
[7]. Y. -C. Ahn and C. -S. Jeong, "Natural Language Contents
Evaluation System for Detecting Fake News using Deep
Learning," 2019 16th International Joint Conference on
NAAN MUDHALVAN IBM:AI101

Computer Science and Software Engineering (JCSSE),


Chonburi, Thailand, 2019, pp. 289- 292, doi:
10.1109/JCSSE.2019.8864171.
[8]. A. J. Keya, S. Afridi, A. S. Maria, S. S. Pinki, J. Ghosh and
M. F. Mridha, "Fake News Detection Based on Deep
Learning," 2021 International Conference on Science &
Contemporary Technologies (ICSCT), Dhaka, Bangladesh,
2021, pp. 1-6, doi: 10.1109/ICSCT53883.2021.9642565.
[9]. T. Pavlov and G. Mirceva, "COVID-19 Fake News
Detection by Using BERT and RoBERTa models," 2022 45th
Jubilee International Convention on Information,
Communication and Electronic Technology (MIPRO), Opatija,
Croatia, 2022, pp. 312-316, doi:
10.23919/MIPRO55190.2022.9803414.
[10]. U. P, A. Naik, S. Gurav, A. Kumar, C. S R and M. B S,
"Fake News Detection Using Neural Network," 2023 IEEE
International Conference on Integrated Circuits and
Communication Systems (ICICACS), Raichur, India, 2023, pp.
01-05, doi: 10.1109/ICICACS57338.2023.10100208.

You might also like