0% found this document useful (0 votes)

47 views8 pages

2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data

This study proposes an effective text data preprocessing technique for sentiment analysis of Twitter data, focusing on the comparison of preprocessing methods like stemming, lemmatization, and spelling correction. The research utilizes a dataset of 1,314,000 tweets to develop an algorithm that weights sentiment scores based on hashtags and cleaned text, achieving an accuracy of 90.3% with the Support Vector Machine classifier. The findings highlight the importance of proper preprocessing in enhancing prediction accuracy and computational efficiency in handling unstructured social media data.

Uploaded by

practice752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views8 pages

2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data

Uploaded by

practice752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Effective Text Data Preprocessing Technique for

Sentiment Analysis in Social Media Data

Saurav Pradha Malka N. Halgamuge, Senior Member, IEEE Nguyen Tran Quoc Vinh
School of Computing and Mathematics Dep. of Electrical and Electronic Engineering Faculty of Information Technology
Charles Sturt University The University of Melbourne The University of Da Nang - University of
Melbourne, Victoria, Australia Victoria 3010, Australia Science and Education, Vietnam
[email protected] [email protected] [email protected]

Abstract ---- In the big data era, data is made in real-time or

closer to real-time. Thus, businesses can utilize this ever- I. INTRODUCTION
growing volume of data for the data-driven or information- Background
driven decision-making process to improve their businesses. One of the most significant current discussions in the world is big
Social media, like Twitter, generates an enormous amount of data. In recent years, there has been a considerable rise in social
such data. However, social media data are often unstructured media giants such as Twitter thus proving them to be a massive
and difficult to manage. Hence, this study proposes an amount of big data [1, 2, 3]. Those data can then be collected in
effective text data preprocessing technique and develop an large volume and can then be utilized to train the machine learning
algorithm to train the Support Vector Machine (SVM), Deep and Deep Learning, which will aid in decision making.
Learning (DL) and Naïve Bayes (NB) classifiers to process
Twitter data. We develop an algorithm that weights the Sentiment analysis is one of the methods of extraction of text from
sentiment score in terms of weight of hashtag and cleaned text. various sources for personal or commercial use. Due to the
In this study, we (i) compare different preprocessing popularity of social media, everyone posts a massive amount of
techniques on the data collected from Twitter using various data online, which can then be used to generate sentiments. This
techniques such as (stemming, lemmatization and spelling can be utilized to give companies an insight into their product.
correction) to obtain the efficient method (ii) develop an Information such as performance of the products throughout the
algorithm to weight the scores of the hashtag and cleaned text year, analysis of different competing products and can be
to obtain the sentiment. We retrieved N=1,314,000 Twitter extracted and utilized to the company's advantage.
data, and we compared the popularity of two products, Google
Now and Amazon Alexa. Using our data preprocessing The detection of the real-time abuse of the drug using the tweets
algorithm and sentiment weight score algorithm, we train has been analyzed by Phan et al. [4]. Authors use legal and illegal
SVM, DL, NB models. The results show that stemming drugs dataset, original text with the collection of 31,478 tweets. It
technique performed best in terms of computational speed. does not use any preprocessing and uses the J48, Random Forest,
Additionally, the accuracy of the algorithm was tested against Naïve Bayes, and SVM (Support Vector Machine) classifier for
manually sorted sentiments and sentiments produced before training purposes. The developed classifier developed has been
text data preprocessing. The result demonstrated that the tested on the real-world tweet dataset with the precision of 74.8%
impact produced by the algorithm was close to the manually with the J48 algorithm. The suggested work includes Term
annotated sentiments. In terms of model performance, the Frequency-Inverse Document Frequency (TFIDF) used to reflect
SVM performed better with the accuracy of 90.3%, perhaps, the relevance of the term in the given document and to improve
due to the unstructured nature of Twitter data. Previous the accuracy and to use Mechanical Turk for the collection of vast
studies used conventional techniques; hence, no precise amounts of data.
methods were utilized on cleaning the text. Therefore, our
In another study, Bhat et al. [5] used sentiments for the
approach confirms that proper text data preprocessing
development of the system that observes the opinion by people on
technique plays a significant role in the prediction accuracy
some product or people. It uses the Twitter API to extract 1000
and computational time of the classifier when using the
latest tweets and performs text processing like stemming and stop
unstructured Twitter data.
word removal. No machine learning algorithm was used for the
Keywords: Social media data, Twitter, big data, text data classifying purpose. This gave us a model that calculates the
preprocessing, sentiment analysis, Deep Learning, Support sentiment by multiplies of adverbs value instead of summing up
Vector Machine, Naïve Bayes, Google Now, Amazon Alexa. the whole sentiment of tweets. Future work such as the

978-1-7281-3003-3/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
development of the algorithm for identifying the offensive total data that was collected together was N= 341K tweets for
statements and, improving the efficiency of mapped words are Google Now and N= 1 million datasets from Amazon Alexa. The
suggested. data contains information about the tweets that the customer
tweeted about Google Now and Amazon Alexa.
It is quite challenging to process unstructured data; hence, social
media data is challenging to manage and requires proper The data attributes that are targeted in the collection of the tweets
preprocessing before obtaining the right sentiment. Past are as follows: original tweets, clean tweets, polarity, sentiments,
investigations utilized standard techniques for cleaning the text; hashtags, user mention, retweets, favourites, permalink, tweet
thus, no appropriate strategies were utilized. length, twitter ID.

B. Selection Process of Efficient Data Preprocessing

Motivation
Techniques
Preprocessing method is used to produce the correct sentiment for
effective decision making. It is implemented to remove the Previous studies have used different preprocessing techniques as
unstructured nature of data obtained from social media. However,
given in Table 1. Various experiments were carried out with
to apply this method to massive datasets will require much time.
different techniques or combinations of techniques to observe
In this situation, the method that requires computationally less
which method would most likely give us a better outcome.
time and more accuracy should be chosen. For the useful
Preprocessing was performed using the techniques mentioned in
generation of the sentiment, the sentiment of the cleaned text and
Table 1.
hashtag should be used in combination because hashtag also gives
context on the importance of the topic. Furthermore, as the
datasets are large, effective classifier with more accuracy and less
computational time should also be chosen to obtain a faster result. Table 1: Description of the Preprocessing techniques

No Technique Description
Name
Paper Contributions
1 Lower text The primary purpose of lowering the text to all
The main contributions of this paper include the following: lowercase is so that the word such as "Hello"
and "hello" would not be treated as a different
• Develop a new algorithm to provide proportional weight word since they are the same word. It helps in
between the hashtag and cleaned text combined to obtain reducing the number of words that the
sentiment output. dictionary needs to hold at a time [6].
• Conduct an extensive comparison of the popularity of two 2 Removal of It facilitates in the removal of the user
@mention mentions as they do not provide any relevant
products: Google Now and Amazon Alexa using 1,314,000 information about the text sentiment.
unstructured tweets. 3 Removal of It is done to remove any URL from the tweet.
• Compare three different types of preprocessing technique URL It includes the removal of URLs starting
(Stemming, Lemmatization and Spelling Correction) and its HTTP, https and also pic:\\ (it is the URL for
effect on sentiment produced. the picture in the tweet)
• Compare sentiment provided by the user, algorithm, uncleaned 4 Removal of Removing the punctuation or any non-
sentiment and the sentiment provided by the cleaned text. punctuation alphanumeric words from the original text.
• Compute the computational speed and accuracy of Support 5 Removal of The hashtag is removed from the text. For this
the hashtag process, the hashtag is removed from the text
Vector Machine (SVM), Deep Learning (DL) and Naïve Bayes and stored in a separate column. Then we can
(NB) classifiers to assess the better performer. use the weight of the hashtag in a separate
process.
The paper is structured as follows: Section 2 begins by elaborating 6 Removing Whitespace does not provide any meaning to
different cleaning process and data analysis part, which consists whitespace the text, so it is removed for computational
of complete details on the algorithm and classifiers used. Section purposes.
7 Tokenization It is the splitting of each sentence to text.
3 describes the result of using the method and methodology
8 Removal of For this study, the encoded texts are removed,
discussed in Section 2. The details on the improvement of the the encoded and only the ones that give specific meaning
paper are discussed in Section 4 followed by the conclusion is text formats are kept. Example of it includes removal of
presented in Section 5. words such as xbf, x9a etc.
9 Stop word Words such as "a", "an", "the" do not provide
removal any meaning to the text. Hence, those words
I. MATERIALS AND METHODS have been removed. Now the relevant text is
used for the sentimental purpose. It builds the
exactness of the text [7].
10 Stemming Stemming is the conversion of the words to
A. Data Collection their root meaning. It makes lessens the total
The data was collected using the GetOldTweets3 Python 3 library. words and facilitates in the computational
Tweets were collected to observe an extensive comparison of the speed.
popularity of two products: Google Now and Amazon Alexa. The

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
B.1 Techniques Comparison: The weight is calculated using

A variety of methods are used to assess the preprocessing of the = + (1)

text. Their descriptions are given in Table 2:
where is the weight of the hashtag and β is the weight of the
Table 2: Data Preprocessing Techniques used in this study tweets, is the hashtag score, and β is the clean tweet score.
Techniques Description
Technique 1 It is the stemming process. It is where the words
(Stemming) are converted to their root words, which makes Start
the length of the words smaller. It thus facilitates
the computation process.
Get tweets from
Technique 2 It is the lemmatization. It is like the stemming GetOldTweets
(Lemmatization) process and returns the base or dictionary form
of the word
Save to CSV
Technique 3 It is the usage of the lemmatization and addition
(Spelling spelling correction. The words are first passed Get data from
Correction) for the spelling correction, where the spelling is CSV
checked and corrected if necessary. Then it is
passed on to lemmatization. Common Process in all the three tecnhiques
Convert text Remove Convert text Removal of special
To use three different techniques, first, the data was passed to the to lower case URLS to lower text characters

common process, which consists of techniques mentioned in Remove Removal of Remove

Tokenization
punctuation Hashtag @mentions
Table 1.
Stop word Removal of
Encode removal
The sentiment obtained was then stored in the list. In the same removal whitespace
time, the sentiments were converted to the numeric value by
finding out which type of sentiment: Technique 1: Technique 2: Technique 3:
Addition of Addition of Lemmatization +
0 f = negative Stemming Lemmatization Spelling Correction
( )= 1 if = neutral
2 if = positive
Calculate the computational
where x is the sentiment (positive, negative or neutral). speed of the three technique

The technique
In order to identify the performance of the technique, the steps
performing best
presented in Figure 1 was used.
Pass it to algorithm

C. Data Analysis
Pass it to textblob

Data analysis was run on Lenovo G50 with the i7 processor Sentiment
running latest Windows 10 computer system. Appropriate
libraries were installed to run the proposed programs. The data
have been gathered by using a python library named
GetOldTweets3. That data was preprocessed and then passed to What is the
sentiment?
the algorithm developed. The algorithm uses the weight of the
hashtag and the weight of the cleaned tweet. The purpose of using
the hashtag weight is because it is useful in a recommendation
system, classification, categorization, search. Hashtags (i) If negative If postive If neutral
then write 0 then write 2 then write 1
facilitate the search of topics based on social content with themes,
(ii) provide support to the user to identify relevant topics, (iii) the
recommendation system built using hashtag has also received Write to CSV file
much attention. All of which highlights the importance of using
hashtags in our algorithm.
End
D. Proposed Algorithm
Figure 1: Overall selection process of the best performing preprocessing
We develop a new algorithm to provide proportional weight techniques
between the hashtag and cleaned text combined to obtain
sentiment output.

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
main subject of the text, has been shown to provide valuable
Algorithm for weighting the text:
information and has been used in sentimental analysis. For
example, the proportional weight given to the hashtag in the
algorithm is 40%, and the weight given to the cleaned text is 60%.
While (tweet != 0): If there is no hashtag in the tweet, the full weight (100%) is given
Calculate the tweet sentiment to the cleaned text, and the sentiment is produced.
polarity
II. Results
If (hashtag != 0 ):
Techniques Comparison:
Calculate the hashtag sentiment
polarity Figure 3 shows the result of sentiments given by three techniques.
The result of those techniques is then compared against the clean
End if
sentiment tweets.
If (hashtag == 0):

Set the weight of the tweets to

100%.

End if

Repeat:

Compute Tw=αH + βT

Assign the final polarity

Based on polarity determine the sentiments

Until there are no tweets left

The detailed flow of the algorithm is shown in Figure 2. After the

successful selection of the preprocessing technique, it is passed to
the algorithm to generate sentiment. The algorithm works on the Figure 3: Performance of the three cleaning techniques with the
weight of text and the weight of the hashtag. Hashtags indicate the unprocessed sentiment text

The result of comparing the three techniques with the sentiment

of the uncleaned or unprocessed text is shown in Figure 4. Each
sentiment given by three different techniques is compared with
the unprocessed text. As indicated in Figure 4, the rate of
dissimilarity was 28% compared to the other two (Technique 2
and Technique 3), which was 11 and 14 %. It shows that the result
produced by Technique 1 is more different compared to the
unprocessed sentiments.

Techniques Comparison:

Table 3: Technique Speed Comparison

Technique Speed as Percentage
Stemming 0.25
Lemmatization 0.53
Spelling Correction 100.00
As Table 3 illustrates, there is a significant difference between
Technique 3 and the other two technique. Technique 3 takes
nearly 400 times more time than Technique 1, as Technique 3
(spelling correction) goes through each word in a text and matches
whether it corrects spelling or not. If the spelling is not correct, it
fixes the spelling. This results in requiring a considerable amount
of time for massive sets of data.

Figure 2: Proposed algorithm after the selection of the preprocessing Figure 4 shows the visualized form of Table 3.
technique

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
Table 4: Accuracy of Three Classifiers on a random set of N = 10,000
datasets from 1 Million data for classifier train

Classifiers Accuracy Computational Speed

(%) (sec)
Deep Learning 70.96 (epoch = 224
10)
Naïve Bayes 65.09 752.77
Support Vector 90.3 142.64
Machine

Note that the speed calculation is based on the time required to

train the classifier. Support vector machine outperformed both the
Naïve Bayes and Deep Learning with the accuracy of 90.3% when
used the random tweets (N =10,000) of 1 million retrieved tweets.
The time required to train the classifier was around 142.54
seconds. The reason for taking random sets of tweets 10,000 is
Figure 4: Computational Speed in the second visualization of
due to the difficulty of obtaining a large quantity of manually
three techniques
annotated tweets.
It is evident that Technique 1 requires less computational speed
as illustrated by Figure 4 and gives different sentiment than the
unprocessed text as evident from Figure 3.

Comparison of Different Sentiments:

Figure 6: The accuracy of the three classifiers against the amount of

computational time in seconds.

Figure 5: Comparison of Negative, Neutral and Positive Sentiment

produced by three different methods

Figure 5 illustrates the different types of sentiment produced by

the user, such as a sentiment produced by an algorithm, sentiment
generated without cleaning the data and sentiment produced after
cleaning data. This demonstrates our algorithm significantly
accurate than all other cleaning processes as compared to the
manually sorted sentiment.

Classifiers Performance:
Figure 7: The computational speed of the three classifiers against the
The sentiment produced by the algorithm is used to train the
amount of computational time in seconds.
classifiers. Computation speed has been calculated based on both
the training phase and the decision phase. The computational Figure 5 illustrates the computational speed of the classifiers with
speed and accuracy from those classifiers are presented in Table accuracy. The accuracy as a percentage while the computational
4. speed is measured in terms of speed in second. Here, the least time
is taken by Support Vector Machine with more accuracy than any
other classifier.

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
Overall Sentiments throughout the years: illustrated from Figure 9, 2018 was the most popular year for the
Amazon Alexa. It elaborates that 2018 was the year with the most
tweet and most popularity.
(a) Google Now:
Comparison of Amazon Alexa and Google Now for the Year
2016:

(a) Google Now:

As seen from Figure 8, the peak time for Google Now was in 2016.
Monthly sentiment values were observed to find out the
popularity of Google Now.

Figure 8: Overall Sentiments curve of Google Now from the start of 2006
to 2019 on N = 341K sets of tweets.

Please note that this sentiment is based on the number of tweets

that are collected from a specific time. The total tweet collected
were N = 341K tweets from 2006 to 2019. The data that was
collected from 2019 was from January to the 1st of May 2019.

Figure 10: Overall Sentiment curve of Google Now monthly for the Year
As shown in Figure 8, the peak time for Google Now was in 2016
2016.
with a slow decline after that year. The reason for the peak in 2016
is because Google released a daydream VR (virtual reality) Figure 10, demonstrates the peak time for Google Now in 2016
platform, which utilized Google Now. In addition, there was a was in March with the highest scaling.
major changed to Google as they launched Google Assistant – a
next big evolution to Google Now.

(b) Amazon Alexa: (b) Amazon Alexa:

Figure 11: Overall Sentiment curve of Amazon Alexa monthly for the
Figure 9: Overall Sentiment of Amazon Alexa on the year-wise basis Year 2016.
The release date of Amazon Alexa was in 2014. Since then, Similarly, in Figure 11, it is also shown that September was the
Amazon has launched Amazon Echo in 2017, which is using the most popular month for the Amazon Alexa.
Amazon Alexa voice assistants. Because of this, the popularity of
Amazon Alexa and its hashtag #alexa has increased a lot. As

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
Comparison of Amazon Alexa and Google Now for the Year tweets. For Google Now, the most popular month for 2016 was
2018: March with a total tweet of N = 23,882 tweets, as shown in Figure
10. For Amazon Alexa, the most popular month in 2018 was
(a) Google Now: January with total tweets of N = 66,081, as shown in Figure 13.
We observed this relationship from the recent data, 2018 to
compare the popularity of Google Now and Amazon Alexa. Figure 4. DISCUSSION
12 illustrates the sentiment values for Google Now monthly. In
Sentiment analysis using freely available Twitter data is quite
contrast to 2016 data, it is evident that the best month for Google essential to gain an insight into a product or to observe the
Now in 2018 is on November.
popularity of a product or organization. This information helps the
business to make a better-informed decision and help companies
to make a considerable profit from it. Effective text preprocessing
is required, hence, it can generate accurate sentiments. Also, the
nature of tweets being the short length, people use the shortcut
language, and short abbreviations in their tweets which makes it
crucially important to preprocess the text effectively otherwise the
sentiment generated from short tweets will also be wrong.

Some studies have used only simple programming for

preprocessing, while others have used standard techniques like
tokenization, stemming from obtaining a better result. In our
study, the combination of the preprocessing techniques such as
stemming, tokenization, removal of special character,
punctuation, and usage of simple programming combined
Figure 12: Overall Sentiment of Google Now in month wise basis for the performed better in the cleaning process. Hence, our results show
year 2018 that the data works well with a classifier such as Naïve Bayes,
Deep Learning and with the SVM classifier giving the higher
(b) Amazon Alexa: result with less computational speed to obtain the result. In
contrast [8], the Naive Bayes and Random Forest classifiers are
Similarly, in contrast to 2016 data, it is evident that the best month
more susceptible than Logistic Regression and support vector
for Amazon Alexa in 2018 is on November.
machine classifiers when different pre-processing techniques
The most popular month for Amazon Alexa was January 2018 with were utilized. Choosing a learning algorithm relies upon the
the overall tweets of 66,081. Figure 13 presents the sentiment elements of the application [9]. Expanding the number of features
values for Amazon Alexa monthly. It shows that the most popular embraced by classifier furthermore builds the element space
month for Amazon Alexa was on January using 66081 total tweets. dimension causing "curse of dimensionality". This makes learning
confused with less accuracy and more substantial computation
time.

There are obvious limitations to our study. In general, sarcastic

tweets include positive words or even increased positive words to
pass on a negative opinion or the other way around. This decreases
the accuracy of the classifier as it will classify those text to the
wrong sentiments. Moreover, the scenario where the tweet only
contains a picture, links or mentions is also excluded. In our study,
there were multiple instances of that tweet which had to be
discarded and will be included in the future improvements.

The result obtained from this study confirms that proper

preprocessing technique plays a significant role in cleaning the
text and increasing the accuracy of the classifier. In terms of
choosing classifiers, Naïve Bayes and SVM were the most
Figure 13: Overall Sentiment of Amazon Alexa in month wise basis for common classifiers used and with them performing better among
the year 2018
all the classifier. Even among those two, SVM performed better
Overall Popularity: in most of the cases.

Figure 8 and 9 demonstrate that the most successful year for

Google Now was 2016 with N = 172,240 tweets and the most
successful year for Amazon Alexa was 2018 with N = 447,911

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.
[8] J. Zhao, and G Xiaolin, "Comparison research on text pre-processing
III. Conclusion methods on twitter sentiment analysis." IEEE Access, Vol 5, pp 2870-
In recent years, there has been a considerable rise in social media 2879, 2017.
data such as Twitter which proves that they are a vast amount of
big data that could be utilized for the decision-making process. [9] A. Singh, M. N. Halgamuge, R. Lakshmiganthan, "Impact of
Different Data Types on Classifier Performance of Random Forest, Naïve
Nonetheless, those data are unstructured. Text data preprocessing Bayes, and k-Nearest Neighbors Algorithms", International Journal of
is one of the effective methods in terms of cleaning and making Advanced Computer Science and Applications (IJACSA), Vol 8, No 12,
those unstructured data, structured and meaningful. We have pp 1-10, December 2017.
compared three different types of text data preprocessing
technique (Stemming, Lemmatization and Spelling Correction)
and its effect on sentiment produced. Our algorithm can be
utilized to provide proportional weight between the hashtag and
cleaned text combined to obtain sentiment output. First, three
different preprocessing methods were compared to determine the
best performing method, and then the result was passed to an
algorithm developed. So proper sentiments can be made. These
sentiments were then used for training the classifiers (SVM, NB,
and DL). Our analysis shows the SVM performed better with the
SVM algorithm, perhaps due to the unstructured nature in Twitter
data. Moreover, the evidence from this study suggests the
implications of choosing the correct text data preprocessing on
sentiment produced will facilitate quick and accurate decision
making. Additionally, correct sentiments can be used to map the
overall sentiments produced throughout the year to perceive the
popularity of products and obtain insight into its overall
performance. This information can then be utilized by businesses
to make a profitable and better-informed decision in the future.

References
[1] M. Khader, A. Awajan and G. Al-Naymat, "The Effects of Natural
Language Processing on Big Data Analysis: Sentiment Analysis Case
Study", 2018 International Arab Conference on Information Technology
(ACIT), 2018.

[2] A. Singh, M. N. Halgamuge, and B. Mouess, "An Analysis of

Demographic and Behaviour Trends using Social Media: Facebook,
Twitter and Instagram”, Social Network Analytics: Computational
Research Methods and Techniques, Elsevier, Chapter 5, ISBN:
9780128154588, January 2019.

[3] S. Kalid , A. Syed, A. Mohammad, and M. N. Halgamuge, "Big-Data

NoSQL Databases: Comparison and Analysis of "Big-Table",
"DynamoDB", and "Cassandra", IEEE 2nd International Conference on
Big Data Analysis (ICBDA'17), Beijing, China, pp 89-93, 10-12 March
2017.

[4] N. Phan, S. Chun, M. Bhole and J. Geller, "Enabling Real-Time Drug

Abuse Detection in Tweets", IEEE 33rd International Conference on
Data Engineering (ICDE), 2017.

[5] S. Bhat, S. Garg and G. Poornalatha, "Assigning Sentiment Score for

Twitter Tweets", 2018 International Conference on Advances in
Computing, Communications and Informatics (ICACCI), 2018.

[6] L. Batista and L. Alexandre, "Text Pre-processing for Lossless

Compression", Data Compression Conference, Dhaka, Bangladesh,
2008.

[7] B. Savaliya and C. Philip, "Email fraud detection by identifying email

sender", 2017 International Conference on Energy, Communication,
Data Analytics and Soft Computing (ICECDS), 2017.

Authorized licensed use limited to: Princess Sumaya University for Technology. Downloaded on April 14,2023 at 11:46:44 UTC from IEEE Xplore. Restrictions apply.

NLP Research Paper
No ratings yet
NLP Research Paper
9 pages
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
No ratings yet
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
4 pages
Machine Learning For Sentiment Analysis of Twitter Data
No ratings yet
Machine Learning For Sentiment Analysis of Twitter Data
9 pages
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
No ratings yet
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
5 pages
Twitter Sentiment Analysis With Textblob
No ratings yet
Twitter Sentiment Analysis With Textblob
6 pages
Effective Sentiment Analysis of Twitter With Apache Spark
No ratings yet
Effective Sentiment Analysis of Twitter With Apache Spark
8 pages
Sentiment Classification System of Twitter Data For US Airline Service Analysis
No ratings yet
Sentiment Classification System of Twitter Data For US Airline Service Analysis
5 pages
Sentiment Analysis for Data Scientists
No ratings yet
Sentiment Analysis for Data Scientists
22 pages
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
No ratings yet
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
6 pages
Machine Learning Based Sentiment Analysis For Text Messages
No ratings yet
Machine Learning Based Sentiment Analysis For Text Messages
7 pages
Sentiment Analysis of Tweets: ML Comparison
No ratings yet
Sentiment Analysis of Tweets: ML Comparison
6 pages
Finalreview 1
No ratings yet
Finalreview 1
4 pages
Twitter Sentiment Analysis Survey
No ratings yet
Twitter Sentiment Analysis Survey
7 pages
Sentiment of Tweets
No ratings yet
Sentiment of Tweets
7 pages
Twitter Sentiment Analysis for Product Reviews
No ratings yet
Twitter Sentiment Analysis for Product Reviews
6 pages
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
No ratings yet
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
3 pages
Twitter and Emotions: Exploring Sentiment Detection
No ratings yet
Twitter and Emotions: Exploring Sentiment Detection
6 pages
Uno 3
No ratings yet
Uno 3
16 pages
Entropy: Tweets Classification On The Base of Sentiments For US Airline Companies
No ratings yet
Entropy: Tweets Classification On The Base of Sentiments For US Airline Companies
22 pages
Crowd Sourcing Platform IEEE Paper 1
No ratings yet
Crowd Sourcing Platform IEEE Paper 1
7 pages
MINI
No ratings yet
MINI
9 pages
Fin Ijprems1714118825
No ratings yet
Fin Ijprems1714118825
6 pages
IJCRT2207068
No ratings yet
IJCRT2207068
5 pages
Chory 2018
No ratings yet
Chory 2018
7 pages
Twitter and Emotions: Exploring Sentiment Detection
No ratings yet
Twitter and Emotions: Exploring Sentiment Detection
11 pages
Twitter and Emotions: Exploring Sentiment Detection
No ratings yet
Twitter and Emotions: Exploring Sentiment Detection
5 pages
Senti bp1
No ratings yet
Senti bp1
2 pages
Sentiment Analysis Based On Deep Learning - A Comparative Study
No ratings yet
Sentiment Analysis Based On Deep Learning - A Comparative Study
29 pages
XGBOOST
No ratings yet
XGBOOST
5 pages
Sentiment Analysis for Airlines
No ratings yet
Sentiment Analysis for Airlines
4 pages
Introduction
No ratings yet
Introduction
27 pages
FULLTEXT02
No ratings yet
FULLTEXT02
46 pages
Sentiment Analysis Using Twitter Data
No ratings yet
Sentiment Analysis Using Twitter Data
7 pages
Project Report
No ratings yet
Project Report
10 pages
Sentiment Analysis On Social Media Using Support Vector Machines SVM
No ratings yet
Sentiment Analysis On Social Media Using Support Vector Machines SVM
5 pages
Team11 - Analysing Visualizing Sentiment Patterns in Social Media
No ratings yet
Team11 - Analysing Visualizing Sentiment Patterns in Social Media
15 pages
FML Project Report
No ratings yet
FML Project Report
18 pages
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
IEEE Paper Format
No ratings yet
IEEE Paper Format
4 pages
Sentiment Analysis Based Twitter Tweets Classification Using Data Embedded With LSTM Technique
No ratings yet
Sentiment Analysis Based Twitter Tweets Classification Using Data Embedded With LSTM Technique
9 pages
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
No ratings yet
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
4 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
11 pages
Sushobhit Elsevier Sentiment Analysis One Column-4
No ratings yet
Sushobhit Elsevier Sentiment Analysis One Column-4
29 pages
Batch-6c Minipro Doc Rev-2
No ratings yet
Batch-6c Minipro Doc Rev-2
33 pages
Sentiment Analysis Twitter
No ratings yet
Sentiment Analysis Twitter
3 pages
Comparison of Classifiers For Sentiment Analysis
No ratings yet
Comparison of Classifiers For Sentiment Analysis
6 pages
Research Paper
No ratings yet
Research Paper
5 pages
Twitte Analysis
No ratings yet
Twitte Analysis
53 pages
6 Project Report Sem6
No ratings yet
6 Project Report Sem6
13 pages
A Review of Deep Learning Models For Twitter Sentiment Analysis Challenges and Opportunities
No ratings yet
A Review of Deep Learning Models For Twitter Sentiment Analysis Challenges and Opportunities
30 pages
Abstract
No ratings yet
Abstract
2 pages
Toxic Comment Classification System Using Deep Lea
No ratings yet
Toxic Comment Classification System Using Deep Lea
6 pages
Twitter Sentiment Analysis Using Deep Learning
No ratings yet
Twitter Sentiment Analysis Using Deep Learning
5 pages
Survey of Deep Learning Approaches For Twitter Text Classification
No ratings yet
Survey of Deep Learning Approaches For Twitter Text Classification
7 pages
Twitter Sentiment Analysis Study
No ratings yet
Twitter Sentiment Analysis Study
5 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
Trend Analysis Using Agglomerative Hierarchical Clustering
No ratings yet
Trend Analysis Using Agglomerative Hierarchical Clustering
20 pages
Time Series Data Mining A Case Study With Big
No ratings yet
Time Series Data Mining A Case Study With Big
7 pages
2021, Jafari - Time Series Similarity Analysis Framework in Fresh Produce Yield Forecast Domain
No ratings yet
2021, Jafari - Time Series Similarity Analysis Framework in Fresh Produce Yield Forecast Domain
7 pages
2021, Gulati - Efficiency Enhancement of Machine Learning Approaches Through The Impact of Preprocessing Techniques
No ratings yet
2021, Gulati - Efficiency Enhancement of Machine Learning Approaches Through The Impact of Preprocessing Techniques
6 pages
2018, Qiao - Research On Personalized Recommendation of Distance Education Resources Based On Spark
No ratings yet
2018, Qiao - Research On Personalized Recommendation of Distance Education Resources Based On Spark
5 pages
2023, Semenoglou - Image Based Time Series Forecasting
No ratings yet
2023, Semenoglou - Image Based Time Series Forecasting
15 pages
2021, Ting - Research On Intelligent Image Scrambling Transform Encryption Algorithm Based On Big Data Analysis
No ratings yet
2021, Ting - Research On Intelligent Image Scrambling Transform Encryption Algorithm Based On Big Data Analysis
4 pages
2018, Sriyanong - A Text Preprocessing Framework For Text Mining On Big Data Infrastructure
No ratings yet
2018, Sriyanong - A Text Preprocessing Framework For Text Mining On Big Data Infrastructure
5 pages
2019, Lee - Surrogate Rehabilitative Time Series Data For Image-Based Deep Learning
No ratings yet
2019, Lee - Surrogate Rehabilitative Time Series Data For Image-Based Deep Learning
5 pages
2017, Hashimoto - Topic Life Cycle Extraction From Big Twitter Data Based On Community Detection in Bipartite Networks
No ratings yet
2017, Hashimoto - Topic Life Cycle Extraction From Big Twitter Data Based On Community Detection in Bipartite Networks
6 pages
2023, Tavares
No ratings yet
2023, Tavares
23 pages
2016, Deleoni
No ratings yet
2016, Deleoni
23 pages
2022, Jongeling
No ratings yet
2022, Jongeling
133 pages
2019, Dakic
No ratings yet
2019, Dakic
6 pages
Network Traffic Data Aggregation
No ratings yet
Network Traffic Data Aggregation
5 pages
Automated Plant Disease Analysis (APDA) Performance Comparison of Machine
No ratings yet
Automated Plant Disease Analysis (APDA) Performance Comparison of Machine
6 pages
ML Framework for Intelligent MIS
No ratings yet
ML Framework for Intelligent MIS
45 pages
Rs Eg 2 Astrology
No ratings yet
Rs Eg 2 Astrology
13 pages
DMW Module 3
No ratings yet
DMW Module 3
112 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Proactive Collections Management: Using Artificial Intelligence To Predict Invoice Payment Dates By: Sonali Nanda
No ratings yet
Proactive Collections Management: Using Artificial Intelligence To Predict Invoice Payment Dates By: Sonali Nanda
22 pages
Deep Learning Nanodegree Syllabus: Project: Find Donors For Charityml
No ratings yet
Deep Learning Nanodegree Syllabus: Project: Find Donors For Charityml
13 pages
Chapter III - Supervised and Unsupervised Algorithms
No ratings yet
Chapter III - Supervised and Unsupervised Algorithms
122 pages
Machine Learning Algorithms Guide
No ratings yet
Machine Learning Algorithms Guide
5 pages
Civil Complaints Management System by Using Machine Learning Techniques
No ratings yet
Civil Complaints Management System by Using Machine Learning Techniques
4 pages
Data Science and Machine Learning With Python (New Module)
No ratings yet
Data Science and Machine Learning With Python (New Module)
16 pages
Assignment 9 Swayam Course
No ratings yet
Assignment 9 Swayam Course
3 pages
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
No ratings yet
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
16 pages
Naïve Bayes Algorithm Project
No ratings yet
Naïve Bayes Algorithm Project
6 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
DWDM Unit Wise Question Bank
No ratings yet
DWDM Unit Wise Question Bank
8 pages
Bayesian Spelling Correction Guide
No ratings yet
Bayesian Spelling Correction Guide
5 pages
Machine Learning in DNA Analysis
No ratings yet
Machine Learning in DNA Analysis
12 pages
Naïve Bayes & Bayesian Networks
No ratings yet
Naïve Bayes & Bayesian Networks
13 pages
Skills: Data Scientist
No ratings yet
Skills: Data Scientist
3 pages
Fake News Detection Presentation
No ratings yet
Fake News Detection Presentation
15 pages
Unit 4
No ratings yet
Unit 4
121 pages
Ex No 5 Write A Program To Implement The Naïve Bayesian Classifier
No ratings yet
Ex No 5 Write A Program To Implement The Naïve Bayesian Classifier
3 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
10 pages
Top 100 Machine Learning Questions With Answers For Interview PDF
100% (4)
Top 100 Machine Learning Questions With Answers For Interview PDF
48 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
A Comprehensive Survey On Artificial Intelligence and Machine Learning Techniques
No ratings yet
A Comprehensive Survey On Artificial Intelligence and Machine Learning Techniques
7 pages
A Comparative Study of Classification Methods in Data Mining Using RapidMiner Studio
100% (1)
A Comparative Study of Classification Methods in Data Mining Using RapidMiner Studio
6 pages
Measurement: Sensors: Sowmya T., Mary Anita E.A
No ratings yet
Measurement: Sensors: Sowmya T., Mary Anita E.A
13 pages

2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data

Uploaded by

2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data

Uploaded by

Effective Text Data Preprocessing Technique for

Sentiment Analysis in Social Media Data

Abstract ---- In the big data era, data is made in real-time or

978-1-7281-3003-3/19/$31.00 ©2019 IEEE

B. Selection Process of Efficient Data Preprocessing

A variety of methods are used to assess the preprocessing of the = + (1)

common process, which consists of techniques mentioned in Remove Removal of Remove

Set the weight of the tweets to

Assign the final polarity

Based on polarity determine the sentiments

Until there are no tweets left

The detailed flow of the algorithm is shown in Figure 2. After the

The result of comparing the three techniques with the sentiment

Table 3: Technique Speed Comparison

Classifiers Accuracy Computational Speed

Note that the speed calculation is based on the time required to

Comparison of Different Sentiments:

Figure 6: The accuracy of the three classifiers against the amount of

Figure 5: Comparison of Negative, Neutral and Positive Sentiment

Figure 5 illustrates the different types of sentiment produced by

(a) Google Now:

Please note that this sentiment is based on the number of tweets

(b) Amazon Alexa: (b) Amazon Alexa:

Some studies have used only simple programming for

There are obvious limitations to our study. In general, sarcastic

The result obtained from this study confirms that proper

Figure 8 and 9 demonstrate that the most successful year for

[2] A. Singh, M. N. Halgamuge, and B. Mouess, "An Analysis of

[3] S. Kalid , A. Syed, A. Mohammad, and M. N. Halgamuge, "Big-Data

[4] N. Phan, S. Chun, M. Bhole and J. Geller, "Enabling Real-Time Drug

[5] S. Bhat, S. Garg and G. Poornalatha, "Assigning Sentiment Score for

[6] L. Batista and L. Alexandre, "Text Pre-processing for Lossless

[7] B. Savaliya and C. Philip, "Email fraud detection by identifying email

You might also like