0% found this document useful (0 votes)
11 views20 pages

Major Project

The document presents a project synopsis on an NLP-based extended lexicon model for sarcasm detection using tweets and emojis, submitted for a Bachelor of Engineering degree in Artificial Intelligence & Data Science. It outlines the challenges of sarcasm detection in natural language processing, the proposed system architecture, and the methodology used for data preparation and analysis. The project aims to enhance the accuracy of sarcasm detection by integrating a broader vocabulary of sarcastic terms and their connections to emojis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

Major Project

The document presents a project synopsis on an NLP-based extended lexicon model for sarcasm detection using tweets and emojis, submitted for a Bachelor of Engineering degree in Artificial Intelligence & Data Science. It outlines the challenges of sarcasm detection in natural language processing, the proposed system architecture, and the methodology used for data preparation and analysis. The project aims to enhance the accuracy of sarcasm detection by integrating a broader vocabulary of sarcastic terms and their connections to emojis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Project Synopsis Report

on

NLP BASED EXTENDED LEXICON MODEL FOR


SARCASM DETECTION WITH TWEETS AND EMOJIS

Submitted for partial fulfillment of the requirements for the award of the
degree of

BACHELOR OF ENGINEERING

in

ARTIFICIAL INTELLIGENCE & DATASCIENCE

By

Nalapatla Nimitha 160620747029

Pogalla Srinila 160620747036

Ramidi Divya Sri 160620747053

Under the guidance of

Mrs. S. SandhyaRani
(Assistant Professor, Department of AI&DS)

Department of Artificial Intelligence & Datascience

STANLEY COLLEGE OF ENGINEERING AND TECHNOLOGY FOR WOMEN


(Affiliated to Osmania University)
Chapel Road,Abids,Hyderabad-500001
2023-2024
Stanley College of Engineering &Technology for Women
Chapel Road, Hyderabad
(Affiliated to OU & Approved by AICTE)

_________________________________________________________________________

Ref No: SCETW/AI&DS Dept/IV Year 2023-2024 Date:02/02/2024

CERTIFICATE

This is to certify that Project titled “NLP BASED EXTENDED LEXICON MODEL FOR

SARCASM DETECTION USING TWEETS AND EMOJIS” is a synopsis work carried over

by Ms.Nalapatla Nimitha (H.T.No.160620747029), Ms.Pogalla Srinila (H.T. No.

160620747036), and Ms.Ramidi Divya Sri (H.T. No. 160620747053), in partial fulfillment of the

requirements for the award of the degree Bachelor of Engineering in Artificial Intelligence &

Datascience from Osmania University during the Semester -Ⅶ of their B.E. course during the

academic year 2023-2024

(Dr.K.Vaidehi) (S. SandhyaRani)

Head,Department of AI&DS Project Guide

Project Coordinator:
Dr. D. Shravani
Dr. K. Vaidehi
DECLARATION

I hereby declare that project work entitled “NLP BASED EXTENDED


LEXICON MODEL FOR SARCASM DETECTION USING TWEETS
AND EMOJIS” submitted to the Osmania University, Hyderabad, is a record
of an original work done by me under the guidance of Mrs. S. SandhyaRani ,
Assistant Professor, Stanley College of Engineering and Technology for Women,

and this project work is submitted in partial fulfillment of the requirements for
the award of the degree of the Bachelor of Engineering in Artificial Intelligence
& Datascience.

Nalapatla Nimitha 160620747029

Pogalla Srinila 160620747036

Ramidi Divya Sri 160620747053


ACKNOWLEDGEMENT

I express my sincere thanks to our project guide Mrs. S. SandhyaRani, and our HOD Dr. K.
Vaidehi, our faculties members and other staff whose help have made me to complete my
project. I am also grateful to my dearest friends who have encouraged and helped me in every
possible at every step. I am grateful to them for guiding me right from the inception till the
successful completion of the project. Their motivation and encouragement have made me
achieve knowledge upon this project and my full engineering session. I sincerely
acknowledge them for extending their valuable guidance, support for literature, critical
reviews of project and the report and above all the moral support they had provided me with
all stages of this project. I have taken efforts in this project. However, it would not have been
possible without the kind of support and help of many individuals and organizations. I would
like to extend my sincere thanks to all of them. I am highly indebted to our mentor for their
guidance and constant super vision as well as for providing necessary information regarding
the project and also for their support in completing the project.

I would like to express my gratitude towards my parents and member of Stanley college of
engineering and technology for women for their kind cooperation and encouragement which
help me in completion of this project. I would like to express my special gratitude and thanks
to all teachers for giving me such attention and time.
ABSTRACT

The implicit and context-dependent nature of sarcasm makes it a difficult challenge in the
field of natural language processing (NLP). Detecting sarcastic words is difficult for
traditional rule-based or statistical approaches because sarcasm frequently incorporates irony
and comic intent. Sarcasm has grown common in tweets these days due to the extensive
usage of social media platforms, particularly Twitter, and is frequently accompanied by
emojis that offer further meaning. In natural language processing, the conventional
approaches to sarcasm detection mostly depended on feature engineering and machine
learning algorithms. These techniques included sentiment analysis, part-of-speech patterns,
and lexical clues (e.g., particular keywords or phrases linked with sarcasm).

One of the main features of the ELM is the introduction of a broader vocabulary of sarcastic
terms, phrases, and idioms. and their connection to emojis. The ELM's improved accuracy
and ability to handle social media language make it a valuable tool for NLP researchers and
practitioners who wish to learn more about the nuances of sarcasm in modern
communication. The ELM's improved accuracy and ability to handle social media language
make it a valuable tool for NLP researchers and practitioners who wish to learn more about
the nuances of sarcasm in modern communication.
INTRODUCTION

Sentiment analysis is the technique of examining digital text to identify if the message is
good, negative, or neutral in terms of emotion. Large volumes of text data, such as emails,
chat transcripts from customer assistance, comments, and ratings on social media, are being
stored by corporations. This text can be scanned by sentiment analysis algorithms, which will
automatically identify the author's viewpoint on a given subject.Opinion mining, or sentiment
analysis, is a crucial business intelligence technique that aids in the enhancement of goods
and services by organizations.

A sentiment analysis system assists businesses in making improvements to their goods and
services by utilizing detailed and sincere customer input.. sentiment analysis is a application
of natural language processing (NLP) technologies that teaches computer programs to
comprehend text similarly to humans is sentiment analysis. A sentence is divided into several
tokens, or parts, using tokenization. Words are reduced to their base form through
lemmatization. Stop-word removal eliminates words (with, for, and at) from sentences that
don't contribute anything significant. NLP technology provides a sentiment score for the
retrieved keywords after additional analysis. An emotional component of the sentiment
analysis system is indicated by a sentiment score, which is a measurement scale. lexicon
algorithm is used to determine the sentiment expressed by a textual content.

The various varieties of sentiment analysis

Fine-grained scoring

Categorizing text intent into different emotional levels is known as fine-grained sentiment
analysis. Generally, the process involves grading user emotion on a 0–100 scale, where
extremely positive, positive, neutral, negative, and very negative are represented by equal
segments. A 5-star rating system is used by e-commerce sites as a detailed scoring system to
assess the customer experience during a purchase.

Aspect-oriented

Specific facets of a good or service are the subject of aspect-based analysis. Manufacturers of
laptops, for instance, ask consumers about their experiences with the touchpad, keyboard,
sound, and graphics. They connect client intent with hardware-related terms by using
sentiment analysis methods.

Intent-based

In market research, intent-based analysis aids in the comprehension of consumer emotion.


Opinion mining is a tool used by marketers to determine where a certain client group stands
in the buying cycle. After identifying terms like sales, discounts, and reviews in observed
conversations, they target consumers who are interested in making a purchase with targeted
advertisements.

Emotional detection

Analyzing a person's psychological state at the time of producing a piece is known as


emotional detection. Sentiment analysis goes beyond simple categorization, making
emotional detection a more intricate field. Using this method, sentiment analysis models
make an effort to decipher a person's word choice in order to identify a variety of emotions,
including happiness, rage, sadness, and regret.
LITERATURE SURVEY
S.no Author of paper Aim Dataset Used Methodology Results

1. [1]Travis Sentiment Tweets Algorithm- MNB It will divide


LeCompte, Analysis of dataset (MultinomialNaïve data into
LeCompt Tweets Bayes) , support sad,angry,hap
Including vectormachine py,Scared,tha
Emoji Data nkful,surprise
tweets ,love
Including
Emoji Data
2. [2] Bagus Satria Sarcasm Tweets CNN(Convolution Accuracy --
Wiguna, Detection dataset al 87.5%
Cinthia Vairra Engine for Neural Network )
Hudiyanti, Alqis Twitter and Emoji
Rausanfita, Agus Sentiment sentiment classifier
Zainal Analysis
Arifin, Rizka W. using Textual
Sholikah and
EmojiFeature
3. [3]Adithya Raju Sarcasm Amazon Logistic Accuracy --
,Akshay Sonawane Detection in product Regression, Svm : 89.79%
,Salil Vartikar Tweets reviews Linear SVM, Random
,Mandar Kulkarni Dataset,twitter forest : 92.3%
Decision
,Jad Aboul Hosn dataset LR : 64.77%
,Saranya Trees, NB : 79.99%
Random DT :90.85%
Rajagopalan Forest and NN :93.05%
Naïve Bayes
4. [4]Ashwithaanu Sarcasm Twitter Decisiontree The accuracy
Shruthi gowda detection Messages Randomforest rate is between
classification
Tc manjunath using nlp 90 and 96
Supportvector
machine percent.
5. [5]k.Sentamiselvan Detection on Twitter and Supportvector Accuracy of
P.Suresh sarcasm using Amazon machine,Naïve Svm- 64%
G K kamalam machine Bayes,Decision Decisiontree -
S.Mahendran learning Tree 59%
D .Aneri classifiers and Randomforest76
rulebased %
approach Naïve bayes -
51%
6. [6]Jayashree Exploiting Twitter SVM Emojis provide a
Subramanian Emojis for Facebook DecisionTree new dimension
Sarcasm
Varun Sridharan Detection Classifier to social media
RandomForest communication.
We study the
role of emojis
for sarcasm
detection on
social media.
7. [7]Arifur Rahaman Sarcasm Twitter Machine Decision tree
Detection in learning Support (91.84%)
Mohammed
Tweets: A vector machine Random forest
Humayun kabir Sarcastic Random forest (91.90%)
Feature Based
Approach
Using
Supervised
Machine
Learning
Model

8. [8]S R tandan A Research on Twitter Multidimensional Each sub


Detection of Intra-Attention category of
Sarcasm Using Recurrent anger sentiment
will be provided
Machine Network
and it will be
Learning (MIARN) checked against
Techniques Convolutional the model
neuralnetwork developed.
Longterm – short
term memory
9. [9] Sarcasm Sarcastic preprocessing, Accuracy --
detection comments on
Daniel Sandor BERT tokenizer BiLSTM-68.0
online reddit
and Marina Bagic Algorithmnaive BiLSTM -67.0
Babac comments
using Bayes, support BERT-based
machine vector machines, -73.1
learning
random forests,
recurrent neural
networks and
convolutional
neural networks
(CNNs)
10. [10] Shaina Gupta, Emoticon and Car,Hotel Algorithms- Accuracy –
Ravinder Singh and Reviews Artificial Neural
Text Sarcasm Drug : 90.216%
Network ,a hybrid
Varun Singla Detection in Car :88.167%
algorithm(polarity
Sentiment of sarcasm,emot Hotel :83.295%
ion),stop word
Analysis
removal,SVM(
support vector
machine)
11. [11]Souravdas, Sentiment Twitter Algorithm-Naive Accuracy --
Dipankarda and classification dataset Bayes algorithm, 84.17%
Anup kumar kolya With GST Conditional
random field
tweet data
on LSTM
based on
polarity
popularity
PROPOSED SYSTEM
The lexicon alogrithm method is utilized to ascertain the sentiment conveyed within a text.
This feeling could be neutral, favorable, or unfavorable. Sarcasm can be achieved by limiting
one's textual content to positive or neutral sentiments. Therefore, while lexical algorithms can
be helpful, they are not sufficient for detecting sarcasm. To develop systems that are effective
at detecting sarcasm in text containing both positive and neutral sentiment.
A lexicon algorithm must be expanded. For that reason, two sarcasm analysis systems—one
derived from the lexicon algorithm extension and the other from—have been proposed in this
study. A lexical algorithm and a pure sarcasm analysis algorithm are combined to form the
first system. The second system is made up of a lexicon algorithm and a pure sarcasm
analysis algorithm. The second system consists of the combination of a lexicon algorithm and
a sentiment prediction algorithm.

PROPOSED SYSTEM ARCHITECTURE

Fig 1: Proposed System Model


MODULES/METHODOLOGY USED
Data Set Description:
The dataset for sarcasm detection in tweets using emojis , consisting of 84 rows, has been
extracted from Kaggle. Each row represents a tweet or a comment and includes both textual
content and emojis. The dataset is curated to facilitate the task of distinguishing between
sarcastic and non-sarcastic tweets, providing a valuable resource for training and evaluating
sarcasm detection models. The Kaggle platform likely hosts additional information, such as
dataset creation details, labels, or potential challenges associated with sarcasm identification
in tweets.

Data Preprocessing
The process of getting raw data ready for a machine learning model is called data preparation.
It is the initial and most important stage in building a machine learning model. Not all of the
time do we find clean, prepared data while starting a machine learning project. Additionally,
cleaning and formatting data is a must for any process involving it. Thus, we employ the data
pre-processing task for this.

Real-world data typically has errors and missing numbers and may be in an unusable format
that prevents machine learning models from being applied directly. In order to clean the data
and prepare it for a machine learning model, which also improves the model's accuracy and
efficiency, pre-processing is necessary.

Step 1: The hotel review dataset should load. In this stage, load the dataset into the data
analysis program. Text reviews (the input) in this dataset are usually accompanied by labels
(the output), which can include sentiments such as "positive" or "negative." The aim is to
train a model to predict these labels from the text.

Step 2: Data cleansing: There are multiple steps that make up data cleansing. Eliminating
Special Characters and Punctuation: Special characters (like @, $, %) and punctuation
(like!,?,.) might be eliminated to concentrate on the text's real content because they are
frequently unnecessary for sentiment analysis. Managing Irrelevant Information:
Occasionally, text data may contain metadata or other information that isn't necessary for the
analysis. Such details should be eliminated so that you can focus on the review text itself.
Step 3: Tokenization:
• Tokenization is the division of the text into more manageable chunks, like words or
sentences. This stage is essential because it divides the text into digestible chunks for
additional examination.
• The phrase "I adore this hotel," for instance, might be tokenized as ["I," "love," "this,"
"hotel"].
Step 4: Convert Text to Lowercase:
• This will guarantee consistency in your text data by converting all text to lowercase. It
keeps "good" and "good" from being treated by the model as two distinct words,
which could result in inaccurate feature extraction and modeling.
• "Good" and "good," for instance, ought to be regarded as synonymous terms.

Step 5: Eliminate Stop Terms:Common terms in a language that frequently have little
meaning can be safely eliminated to lower noise in the data. Words like "the," "is," "and,"
"in," etc. are examples.
• Eliminating stop words can decrease the dimensionality of the data and increase
model efficiency without significantly sacrificing important information.

Step 6: Apply Stemming and Lemmatization


• Utilize lemmatization or stemming.Techniques for reducing words to their root form,
such as stemming and lemmatization, aid in word standardization and enhance feature
extraction.
• Stemming: To get the word stem, words' suffixes are removed. For instance, "flies"
becomes "fli," "jumping" becomes "jump," etc. Although stemming is more forceful,
it can produce non-words.
• Lemmatization: This more complex method condenses words into their dictionary- or
base-form (lemma). For instance, "running" becomes "run," "better" becomes "good,"
and so on. Although lemmatization requires more computing power, it is more
accurate.
• Depending on the particular NLP task and dataset you are using, you can choose
between lemmatization and stemming.
Data Splitting
We separate our dataset into a training set and a test set as part of the machine learning data
pre-processing. This is one of the most important phases in the pre-processing of the data
since it allows us to improve the machine learning model's performance. Assume that we
have provided using one dataset for training and a whole different dataset for testing our
machine learning model. Our model will then have trouble comprehending the correlations
between the models as a result. Our model will perform worse if we give it a new dataset
after it has been trained really well with high training accuracy. Thus, our constant goal is to
create a machine learning model that excels with the training set and utilizing the test dataset.
Training Set: A portion of the dataset is used to hone the machine learning model with
known results.

Test set: A portion of the dataset used to evaluate the machine learning model. The model
predicts the outcome using the test set.

Extended Lexicon Algorithm


The lexicon algorithm has been expanded in two different ways in this work to produce two
systems that may be more effective in analyzing sarcasm, particularly in texts with neutral or
positive sentiment.

First system:

The first method, depicted in below Figure 2, combines a pure sarcasm analysis algorithm
with a lexicon algorithm. Text-based materials are fed into this system. These materials may
come from Facebook or Twitter, among other social media sites. For the purpose of
computing polarity, the text's contents are parsed into the lexicon algorithm. Next, the pure
sarcasm analysis algorithm parses the positive sentiment contents to find sarcasm. A list of
lingual contents, both sardonic and non-sarcastic, is the system's ultimate result.
Fig2: Component diagram of the first system. An overview of the arrangement of
components of the first system.
Second System:
This system combines a sentiment prediction algorithm with a lexicon algorithm. This system
employs the lexicon algorithm in the same manner as the first. A technique that can forecast
the sentiment of textual content created in a particular setting makes up the sentiment
prediction algorithm. The sentiment prediction algorithm receives as input various details
about the context in which language content would be created, including the state of the
context, the author's level of education, personality, and knowledge of the subject matter the
author would be discussing. After processing these details, the sentiment prediction algorithm
forecasts the sentiment of the textual content that will emerge.
Fig3: Component diagram of the second system. An overview of the arrangement of the
components of the second system.
These environment details are processed based on a training data set that was formed as
follow:
•Each environment detail had a polarity. The polarity could be negative, neutral or positive as
shown in Table 1
•An AND operation was performed in between the detail’s polarities of each textual content
environment to predict the sentiment of the textual content that would be made under each
environment. The training data set of the sentiment prediction algorithm was formed of the
environment details polarities and their predicted sentiment (TABLE 2)
•P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
•P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is
true.
•P(A) and P(B) are the probabilities of observing A and B independently for each other; this
is known as the marginal probability.
•Naïve Bayes is mainly used for classification purposes. It is an algorithm that discriminates
different objects based on certain features. This algorithm is built after the Bayes theorem
which assumes that all features within a class are independent from one another and that is
why it is known as ‘naive’.
Table 1: Environment details and their polarity values

Table 2: Training data set


Decision Tree Classifier:
A decision tree classifier is a valuable machine learning algorithm for distinguishing between
fake and genuine tweets based on polarity scores. Leveraging a tree-like structure, the
classifier learns patterns and rules from labeled training data, enhancing its ability to
accurately classify new tweets. One notable advantage is its interpretability, as the tree
structure provides a clear representation of decision-making, allowing easy comprehension of
the key features or words influencing classification decisions. Additionally, the classifier
accommodates nonlinear relationships, capturing complex interactions between polarity
scores and other features, thereby improving accuracy.

Fig 4: Decision Tree


Furthermore, decision trees offer feature importance analysis, automatically determining the
significance of different features in classification. This aids in feature selection and provides
insights into factors influencing tweet authenticity. The algorithm's robustness to outliers and
missing data ensures effective handling of real-world inconsistencies. In terms of scalability,
decision trees efficiently handle large datasets by recursively partitioning data based on
features, making them well-suited for real-time social media analysis with vast volumes of
tweets.
CONCLUSION

The purpose of this project was to suggest extensions to the lexicon algorithm for the purpose
of developing more effective sarcasm detection systems. This goal has been accomplished
because two systems have been created to deal with the circumstances. However, it was
discovered in the initial system that, in order to produce meaningful findings and raise the
system's accuracy, the training set of the sarcasm analysis algorithm needed to be pertinent to
the real data that needed to be evaluated. There is a lot of research being done on the second
system. To create a system that would enable the gathering of environmental information
under which the textual contents would be generated, some work must be done.
References:
[1] LeCompte, Travis, "Sentiment Analysis of Tweets Including Emoji Data" (2017). Honors Theses.836.
https://repository.lsu.edu/cgi/viewcontent.cgi?params=/context/honors_etd/article/1854/&path
_info=LeCompte__Travis_2017.pdf

[2] Bagus Satria Wiguna, Cinthia Vairra Hudiyanti, Alqis Rausanfita, Agus Zainal Arifin, Rizka
W.Sholikah,”Sarcasm Detection Engine for Twitter Sentiment Analysis using Textual and Emoji Feature”
file:///C:/Users/PC/Downloads/812-Article%20Text-2584-1-10-20210228.pdf

[3] Adithya Raju, Akshay Sonawane, Salil Vartikar, Mandar Kulkarni, Jad Aboul Hosn, Saranya Rajagopalan,”
sarcasm Detection in Tweets”
https://jadhosn.github.io/projects/CSE575_FinalReport-SarcasmDetection.pdf

[4] Ashwitha Anu, Shruthi Gowda, TC Manjunath” sarcasm detection in natural language processing”
https://www.researchgate.net/publication/346519948_Sarcasm_detection_in_natural_lang uage_processing
[5] K. Sentamiselvan, P Suresh, G K Kamalam, S Mahendran, D Aneri” Detection on sarcasm using machine
learning classifiers and rule based approach”
https://www.researchgate.net/publication/349434281_Detection_on_sarcasm_using_machi
ne_learning_classifiers_and_rule_based_approach
[6] Jayashree Subramanian, Varun Sridharan ”Exploiting emojis for sarcasm detection”
https://www.researchgate.net/publication/333831290_Exploiting_Emojis_for_Sarcasm_D etection

[7] Arifur Rahaman , Mohammed Humayun Kabir ”Sarcasm detection in tweets: A sarcastic feature based
approach using supervised machine learning model” March 2021Thesis for: M.Sc. (Engg.)Advisor: Mohammed
Humayun Kabir

https://www.researchgate.net/publication/353523194_Sarcasm_Detection_in_Tweets_A_S
arcastic_Feature_Based_Approach_Using_Supervised_Machine_Learning_Model

[8] S R Tandan “A research on detection of sarcasm using machine learning techniques” March 2020

https://www.researchgate.net/publication/339900592_A_Research_on_Detection_of_Sarc
asm_using_Machine_Learning_Techniques

[9] Daniel Šandor and Marina Bagic Babac ”Sarcasm detection in online comments using machine learning”
ISSN: 2398-6247Open Access. Article publication date: 31 July
https://www.emerald.com/insight/content/doi/10.1108/IDD-01-2023-0002/full/html.

[10]Shaina Gupta, Ravinder Singh and Varun Singla,” Emoticon and Text Sarcasm Detection in Sentiment
Analysis” First international conference on sustainable technologies for computational intelligence (pp.1).

https://www.researchgate.net/publication/336989039_Emoticon_and_Text_Sarcasm_Det ectio
n_in_Sentiment_Analysis

[11] Souravdas, dipankarda and Anup kumar kolya”Sentiment classification with GST tweet data on LSTM
based on polarity-popularity” https://www.ias.ac.in/article/fulltext/sadh/045/0140

You might also like