1 s2.0 S2949719124000177 Main
1 s2.0 S2949719124000177 Main
∗ Corresponding author.
E-mail addresses: [email protected] (Md.S. Islam), [email protected] (K.M. Alam).
https://doi.org/10.1016/j.nlp.2024.100069
Received 30 July 2023; Received in revised form 4 March 2024; Accepted 30 March 2024
2949-7191/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Nowadays internet has become the most valuable part of our life. methods (Shafin et al., 2020). If machine learning systems are trained
In this century, one cannot imagine a day without using the internet, with benchmark instances of different sentiments/emotions, machines
browsing different social media accounts, posting different types of can automatically learn how to detect sentiment without the help of
content on his profile. Everyone is maintaining a huge network on human interaction (Shafin et al., 2020). Supervised machine learning
the internet with which he interacts daily (Sumit et al., 2018). So, (ML) algorithms such as Naive Bayes (NB), Decision Tree (DT), Lo-
millions of posts, blogs, comments, reviews, opinions are gathered on gistic Regression (LR), K-Nearest Neighbors (K-NN), Support Vector
the internet everyday (Habibullah et al., 2023). People express their Machine (SVM) etc. ensemble learning (EL) algorithms such as Random
thoughts, feelings, opinions, evaluations on a particular topic generally Forest (RF), Gradient Boost (GB), XGboost (XGB), LightGBM etc. deep
in the form of text on the Internet in different languages and platforms. learning (DL) algorithms such as Convolutional Neural Network (CNN),
Numerous research studies have been done on sentiment analysis in Recurrent Neural Network (RNN), Long Short Term Memory (LSTM),
English, Chinese, Hindi, Japanese, Arabic and Urdu languages while Bidirectional Long Short Term Memory (BiLSTM) etc are greatly ap-
sentiment analysis in Bengali language is still in a dearth level (Nafisa plicable for sentiment analysis (Nafisa et al., 2023; Prottasha et al.,
et al., 2023; Bitto et al., 2023; Junaid et al., 2022). Few research 2022).
works have been conducted in Bengali language on sentiment analysis In this work, we have proposed a method for sentiment anal-
due to lack of resources, datasets/corpus and complexity of Bengali ysis on Bangla language based on a new comprehensive document
language (Habibullah et al., 2023; Bitto et al., 2023; Amin et al., 2019). level dataset and machine learning and deep learning approaches. The
Bangla (Bengali), an ancient Indo-European language, the seventh most dataset contain a total of 203,463 Bangla comments collected from
spoken language and is used daily by more than 250 million people various microblogging sites. We have examined various hybrid feature
in the world, it is the primary language of Bangladesh and secondary metrics and various ML, EL and DL algorithms. The followings are the
language of India (Habibullah et al., 2023; Bhowmik et al., 2022). Its contributions of the proposed work:
use is becoming more prevalent with the recent growth of online micro-
1. A newly created comprehensive Bangla sentiment corpus of
blogging sites (Azmin and Dhar, 2019). Bangladeshis are increasingly
203,463 comments from 5 microblogging sites (Facebook,
involved in online activities such as connecting with friends and fam-
YouTube, Instagram, TikTok, Likee), manually tagging them into
ily through social media, expressing their opinions and thoughts on
15 categories containing 204,6150 tokens and 165,319 unique
popular micro-blogging and social networking sites, sharing opinions
tokens.
and thoughts through comments on online news portals, online market
2. Validate the dataset by 40 native Bangla speakers with a valida-
places and so on Hassan et al. (2022). This brings about a large
tion accuracy of 94.67%.
amount of user-generated information on various sites, which can be
3. Examining various feature extraction techniques such as BOW,
used for many applications. Sites need to examine the millions of
N-gram, TF-IDF, TF-IDF-ICF, Word2Vec, FastText, GloVe,
messages posted daily, extract all relevant posts for that product or
Bangla-BERT etc and groping them to form hybrid feature met-
service, analyze various types of user feedback, and finally outline user
rics and make a comparative study among them on the created
feedback to gain useful information. This task can be done manually
Bangla sentiment dataset to extract important features and make
by humans but it is very time consuming and tedious. This is why
a comparative study among the techniques applied.
the concept of creating automated systems for sentiment analysis has
4. The proposed novel hybrid feature extraction method (Bangla-
become so important.
BERT+Skipgram), skipBangla-BERT, outperforms all other tech-
Sentiment analysis is also termed as opinion mining, opinion ex-
niques.
traction, sentiment extraction, sentiment mining, subjectivity analy-
5. Applying ML, EL and DL algorithms that generates better per-
sis, emotion analysis, review mining, polarity analysis, emotional AI
formance in different metrics compared to state-of-the-art tech-
etc (Hassan et al., 2022; Prottasha et al., 2022). Sentiment analysis has
niques. The hybrid CNN-BiLSTM model surpasses the existing
a lot of empirical and practical applications such as product analysis,
state-of-the-art methods.
social media monitoring, market analysis (Nafisa et al., 2023), product
review analysis (Bitto et al., 2023), market trend analysis (Bhowmik In this chapter, we briefly describe the basic concepts of sentiment
et al., 2022), customer interest analysis, movie review analysis (Hassan analysis or opinion mining and its present perspective on Bangla nat-
et al., 2022), political review analysis (Tabassum and Khan, 2019) etc. ural language processing in Section 1, related works in Section 2, the
Sentiment analysis is very important for business industries, NGOs, proposed methodology for sentiment analysis on Bangla language in
Governments and other organizations (Hassan et al., 2022). Sentiment Section 3, experimental results analysis and discussion in Section 4 and
analysis can be performed in three different levels. They are document Section 5 concludes the work with some future remarks.
level, sentence level and aspect level (Prottasha et al., 2022). The
document level considers that a document has an opinion on an entity, 2. Related works
and the task is to classify whether an entire document expresses a
positive or negative sentiment (traditional SA). The task at the sen- In this modern technically advanced world, sentiment analysis is a
tence level are with sentences and from a sentence it can be decided topic of great importance in every language. Some of the recent studies
whether the sentence is positive, negative or neutral (traditional SA). of Bangla SA are discussed and summarized here.
The aspect level broadly known as aspect-based sentiment analysis An extensive dataset for sentiment classification from Bangladesh
performs a fine-grained analysis that recognizes aspects of a provided e-commerce reviews (Daraz and Pickaboo) was conducted by Rashid
document or sentence and the sentiment expressed towards each as- et al. (2024), this study proposed a dataset consisting of 78,130 reviews
pect (Rahman and Dey, 2018). For example, cricket is an aspect where including 67,268 positive and 10,862 negative reviews from a wide
sentiment analysis can be performed, all the comments or reviews of range of products. Verse-based emotion analysis of Bangla music from
people are related to that aspect (cricket). Similarly restaurant, election, lyrics was proposed by Mia et al. (2024), the authors developed a
football, world cup, fifa, movie, cinema, drama, viral person can be new dataset comprising of 6500 verses from 2152 Bangla songs and
some examples of different aspects. In each different aspect SA can be manually annotated them into 3 emotion classes namely Love, Sad and
applied. Idealistic, furthermore used several machine learning and neural net-
Sentiment analysis is a well known application of natural language work based classifiers. Among the models BERT outperformed others
processing (NLP) (Bitto et al., 2023). SA is widely implemented using with an accuracy of 65%. A machine learning based method for the sen-
machine learning in different areas (Nafisa et al., 2023). Sentiment of timent analysis of Bangla food reviews was proposed by Islam and Alam
texts can be impact fully analyzed with the help of machine learning (2023b) creating a dataset of 44,491 reviews from different Facebook
2
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
pages and groups. The authors implemented several machine learning 60% and 70% data as test dataset. Using 30% data as test dataset they
algorithms such as the MNB, SVM, KNN, LR and RF; and various deep obtained accuracy 88.81% in SVM, 85.92% in Random Forest, 80.14%
learning algorithms namely CNN, LSTM, GRU, Bi-LSTM, Bi-GRU, CNN- in KNN, 88.09% in Logistic Regression, 83.03% in Decision Tree. In the
LSTM, CNN-BiLSTM, CNN-GRU and CNN-BiGRU. Among these models, test dataset the real reviews were distributed as 62.5% positive reviews
RF and CNN-BiGRU outperformed others from ML and DL domains re- and 37.5% negative reviews while the model predicted 58.3% positive
spectively with an accuracy of 88.73% and 90.96%, furthermore, they reviews and 41.7% negative reviews.
also considered the Friedman statistical test and explainable NLP to An automated system for sentiment analysis was proposed by Tuhin
explain the performance of the best fitted model. BanglaBook which is a et al. (2019) from Bangla text using supervised learning techniques.
new large-scale Bangla dataset collected from book reviews was (Kabir The authors implemented sentence and document level of sentiment
et al., 2023) proposed that consists of 158,065 samples. The authors analysis. They created a sentiment dataset consisting of 7500 sentences
used BOW and N-grams as feature extraction methods, afterwards im- which was tagged manually by them into six basic emotion categories
plemented ML and DL classifiers and obtained highest 0.9331 f1-score namely happy, sad, excited, angry, tender and scared. From the six
with BERT. Another work from Nafisa et al. (2023) compiled a method emotion categories, they mapped the happy, excited and tender cate-
for bipolar SA of online news comments, implemented six ML models gories to the higher emotion level positive category and rest of the three
along with BOW and TF-IDF transformers and a DL approach LSTM emotion categories (sad, angry and scared) to negative class. They split
along with Word2Vec metric. They obtained highest 80% accuracy their built corpus into training dataset and test dataset using holdout
with RF and 83% with LSTM. Another new study was performed (Bitto method and took 7400 sentences for training and only 100 sentences for
et al., 2023) for the user reviews collected from food delivery startups. testing. They implemented two classification algorithms: Naive Bayes
They collected 1400 reviews from 4 food delivery Facebook pages and and Topical approach and compared their proposed work with two
applied bipolar SA. Applying ML and DL algorithms, they obtained other papers on sentiment analysis from Bengali texts. Topical approach
highest accuracy of 89.64% using XGB and 91.07% from LSTM. An acquired highest accuracy above 90% but only for 100 test sentences.
extended lexicon dictionary based method was (Bhowmik et al., 2022) Two new datasets for aspect-based sentiment analysis (ABSA) in
proposed where the authors utilized DL algorithms and deployed in two Bangla language were introduced by Rahman and Dey (2018). Senti-
aspect based datasets collected from Kabir et al. (2023). They obtained ment analysis can be performed in three different levels. They are doc-
highest 84.18% accuracy using a hybrid model BERT-LSTM. At Hassan ument level, sentence level and aspect level (Rahman and Dey, 2018).
et al. (2022), the authors proposed a method for Bangla conversation They created two new datasets (name: Cricket dataset and Restaurant
reviews, they collected 1141 data from Bangla movies and short film dataset) for ABSA. A total of 2900 comments from the Cricket domain
scripts and implemented seven ML algorithms and recorded highest covering 5 aspect categories make up the Cricket dataset and 2600
85.59% accuracy with SVM. Another study carried by Prottasha et al. reviews from Restaurants make up the Restaurant dataset. The authors
(2022) focused on transfer learning strategy of BERT based supervised collected comments from different Facebook pages, BBC Bangla, Daily
fine tuning. They examined 6 different publicly available datasets and Prothom Alo etc. for Cricket dataset. The Cricket dataset was annotated
obtained highest results with the hybrid model CNN-BiLSTM along by six native graduate students from second year, one faculty member,
with BERT based fine tuning. They proved by experiments that BERT one MS student and two employees from Institute of Information
outperform Word2Vec, FastText, GloVe feature extractor techniques. Technology, University of Dhaka, into batting, bowling, team, team
Another DL based study was performed in Alvi et al. (2022) using management and other aspects. They validated the cricket dataset
LSTM, GRU and BLSTM classifiers along with 10-fold cross validation based on zipf’s law and measured an intraclass correlation value of 0.71
and achieved highest 78.41% accuracy score. to validate the annotation method. To build the Restaurant dataset,
A method for Bangla text sentiment analysis using supervised ma- they directly took assistance from the English standard Restaurant
chine learning with extended lexicon dictionary was proposed by dataset. Abstract translations of all comments were done with their
Bhowmik et al. (2022). The authors collected datasets from Rahman proper annotations into Bangla. A total of 2800 comments were con-
and Dey (2018), there were two aspect based datasets, one is the tained in the main English dataset. The Restaurant dataset was based
Cricket dataset with a total of 2059 comments and the other is the on five aspect categories food, price, service, ambiance and miscella-
Restaurant dataset with 2979 comments both covering three sentiment neous. Both the datasets were labeled into three target sentiment labels
categories positive, negative and neutral. They created a weighted cat- positive, negative and neutral. They applied TF-IDF to extract features
egorical lexicon data dictionary (LDD) for extracting sentiments from from texts. They have implemented three supervised machine learning
Bangla texts. There were a total of 1056 and 1115 active sentimental algorithms: SVM, Random Forest and KNN. On the Cricket dataset, they
tokens as well as 970 and 2190 contradictory tokens in Cricket and obtained the highest f1-score of 0.37 in the Random Forest classifier
Restaurant datasets respectively. They also developed the weighted list while the highest f1-score obtained for the Restaurant dataset was 0.42
of dictionaries for Bangla conjunction, adjective and adverb quantifier. from KNN.
They developed a rule based BTSC algorithm of 30 steps that can A method for sentiment analysis on Bangla and Romanized Bangla
classify the polarity of a Bangla document or sentence. The BTSC text was proposed by Hassan et al. (2016) using deep recurrent models.
algorithm basically works on the basis of LDD and POS tagging and The authors considered standard Bangla, Banglish (mixing of Bangla
produces an SCS value (score of sentence) that is either less than 0 words with English words) and Romanized Bangla in this research
(belonging to the negative category), or greater than 0 (belonging to work. They created a dataset with 9337 samples (BRBT dataset), of
the positive category) or equal to 0 (belongs to neutral category). which 6698 samples are Bangla and 2639 samples are Romanized
For the product review sentiment analysis a method was proposed Bangla (RB dataset). They collected data from five different sources:
by Shafin et al. (2020) using NLP and machine learning in Bangla 4621 samples from Facebook, 2610 samples from Twitter, 801 from
language. Online marketing become very popular after the period of YouTube, 1255 from online news portals, and 50 samples from product
COVID-19 and Bikroy, Daraz, Evaly, Chaldal.com are some popular review pages and tagged them into three emotion categories — posi-
e-commerce sites in Bangladesh (Shafin et al., 2020). The authors tive, negative and ambiguous. They utilized two native (Bangla) human
collected 1020 reviews from Bangla e-commerce sites. They prepro- experts to annotate all the test samples for a total of two validations
cessed their data and used TF-IDF to extract important features from one annotator knew nothing about the other’s decisions. To show the
the dataset before go for ML models. Their dataset contained 52.2% validity of the performed tagging procedure, they showed a confusion
positive reviews and 47.8% negative reviews. They implemented five matrix about all the annotated test samples and from this it was proved
supervised ML algorithms, they are — SVM, Random Forest, KNN, that the annotators agreed on 75% of the test samples. They mainly
Logistic Regression and Decision Tree. They examined 30%, 40%, 50%, used Recurrent Neural Network (RNN), if we want to be more specific
3
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 1
Summary of the recent works for the sentiment analysis from Bangla text
Name Year Dataset No. of Dataset Dataset publicly Feature Best model Best model
used classes ownership available? metric (ML) (DL)
2024 6500 3 Self No TF-IDF SGD BERT
Mia et al. (2024) GloVe
Rashid et al. (2024) 2024 78,130 2 Self Yes N/A N/A N/A
Islam and Alam 2023 44,491 3 Self No TF-IDF RF CNN-BiGRU
(2023b)
Kabir et al. (2023) 2023 1,58,065 3 Self No BOW RF Bangla-BERT
N-Gram
Bitto et al. (2023) 2023 1400 2 Self No Word2Vec XGB LSTM
Hassan et al. (2022) 2022 1141 3 Self No N/A N/A Bangla-BERT
Junaid et al. (2022) 2022 1040 2 Self No BOW LR LSTM
N-Gram
TF-IDF
Word2Vec
GloVe
Prottasha et al. (2022) 2022 2900, 3 Collected Yes Word2Vec SVM CNN-BiLSTM
2600 FastText
GloVe
BERT
Bhowmik et al. (2022) 2022 2900, 3 Collected Yes Word2Vec N/A BERT-LSTM
2600
Alvi et al. (2022) 2022 7000 3 Collected Yes Word2Vec N/A GRU
we will say they used LSTM based neural networks which contained study, it is clear that no benchmark dataset is still available for sen-
three layers namely the embedding layer, LSTM layer and a fully timent analysis on Bangla language. So, there is a scope to build one
connected layer containing three nodes. They obtained a maximum or more benchmark datasets for Bangla sentiment analysis.
accuracy of 78% with 2 categories, and 70% with 3 categories on the
Bangla dataset, 55% accuracy on the BRBT dataset with 2 categories, 2.1.2. Fewer number of categories (polarity labels)
and 22% accuracy on 3 categories using categorical cross entropy loss.
Traditional approach that is generally implied in SA is to classify
A method was proposed by Chowdhury and Chowdhury (2014) for
the human review as positive, negative or neutral class. The authors
performing sentiment analysis in Bangla microblog posts. The authors
of Kabir et al. (2023), Bitto et al. (2023), Hassan et al. (2022, 2016),
collected 1300 Bangla tweets using the Twitter API and created their
Shafin et al. (2020), Chowdhury and Chowdhury (2014), Rahman and
dataset. Instead of manual annotation they applied a semi-supervised
Dey (2018), Bhowmik et al. (2022), Prottasha et al. (2022) used this
bootstrapping method (constructing a lexicon dictionary of 737 single
traditional approach. Another work from Tuhin et al. (2019) used six
words) to annotate tweets into positive or negative sentiment cate-
basic emotion categories namely happy, sad, excited, angry, tender and
gories. They split their dataset into training and test sets using the
holdout method, leaving 1000 instances for training and 300 for test- scared. So, there is a scope to consider more polarity labels in Bangla
ing. Two state-of-the-art supervised learning models, namely SVM and SA instead of using only the traditional classes which can reflect more
MaxEnt, were used in this study. Thirteen different features were used actual feelings of human being.
for both classifiers and f-scores were measured for both positive and
negative categories. They achieved highest f-score 0.93 and accuracy 2.1.3. The use of traditional feature extraction methods
93% using SVM for both categories with (unigram+emoticon) feature. For text mining related works, TF-IDF, N-gram and BOW is the most
common traditionally used feature extraction metrics for ML models.
2.1. Observations from literature review The authors of Kabir et al. (2023), Nafisa et al. (2023), Shafin et al.
(2020), Tuhin et al. (2019) used these metrics. A new feature extractor
We have several observations from the literature being studied. The named lexicon data dictionary (LDD) is used by Bhowmik et al. (2022).
summary of the recent works for the sentiment analysis from Bangla The authors of Nafisa et al. (2023), Prottasha et al. (2022) used
text is given in Table 1. The observed findings are summarized in the word embedding model of Word2Vec and BERT respectively. But the
subsequent sections. authors (Chowdhury and Chowdhury, 2014) used 13 different hybrid
feature extractors. Therefore, it is noticed that most of the works mainly
2.1.1. Small dataset size used the traditional feature extraction methods. So, there is a scope
An annotated high quality dataset is the pre-requisite of any NLP
to use different hybrid feature extraction methods for the sentiment
based classification task (Rahman, 2019). The datasets developed by
analysis on Bangla language.
Kabir et al. (2023) contain 158,065 samples, dataset of Bitto et al.
(2023) contain only 1400 reviews from Facebook pages, dataset of Has-
san et al. (2022) consists of 1141 data samples from Bangla movies 2.1.4. Domination of deep learning models over fundamental machine
and short film scripts, dataset of Hassan et al. (2016) have only 9337 learning models
samples (BRBT dataset), dataset of Shafin et al. (2020) contain 1020 Most of the recent methods on Bangla sentiment analysis focuses
reviews from Bangla e-commerce sites, dataset of Tuhin et al. (2019) on both the ML and DL models, but the obtained results of DL models
containing 7500 sentences and dataset of Chowdhury and Chowdhury surpass the traditional ML models. The methods proposed by Kabir
(2014) contains 1300 Bangla tweets. Two aspect based datasets on et al. (2023), Bitto et al. (2023), Bhowmik et al. (2022), Prottasha et al.
Cricket (2900 samples) and Restaurant (2600 samples) were proposed (2022) achieved better results with DL models and so does for others
by Rahman and Dey (2018), and later the authors of Bhowmik et al. also. So, there is a clear domination of deep learning models over the
(2022), Prottasha et al. (2022) used those datasets. From the above fundamental machine learning models.
4
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
new social sites like TikTok and Likee (small video community) so we
have also considered these two new sites. Bangla comments are rare
on Instagram, so we could not collect more comments from Instagram.
We have collected data using our self-developed crawler and named it
as Sentiment Analysis Dataset Crawler (SAD_crawler). The pseudocode
of the developed crawler is shown in Algorithm 1 where X, Y, P and Q
are all dynamic classes; getElementsByClassName(), replace(), trim(),
include() are several used methods and copied_content is the output
variable to store Bangla comments.
Algorithm 1 SAD_Crawler
1: Take a content d to collect Bangla comments
2: Take an empty variable alldata to store comments
3: alldata ← NULL
4: englishWordPattern (EWP) ← [a-zA-Z]
5: comments ← d.getElementsByClassName(‘‘X")[0]
6: c ← comments
7: length ← c.getElementsByClassName(‘‘Y").length
8: for 𝑖 𝑖𝑛 𝑙𝑒𝑛𝑔𝑡ℎ do
9: rc ← getElementsByClassName(‘‘P")[0]
10: rc ← rc.innerText.replace(EWP, ‘‘ ")
11: co ← c.getElementsByClassName(‘‘Y")[i].rc
12: if 𝑐𝑜.𝑡𝑟𝑖𝑚() == 𝑁𝑈 𝐿𝐿 then
13: continue
14: end if
15: alldata ← alldata + comment + ‘‘\t"
16: try
17: r ← d.getElementsByClassName(‘‘Y")
18: s ← r[i]
19: t ← s.getElementsByClassName(‘‘Q")
Fig. 1. Workflow of the proposed Bangla SA system. 20: likes ← t[0].innerText
21: if 𝑙𝑖𝑘𝑒𝑠.𝑖𝑛𝑐𝑙𝑢𝑑𝑒(}}𝑙𝑖𝑘𝑒𝑠ε) then
22: alldata ← alldata + likes + ‘‘\n"
23: else
3. Proposed methodology for sentiment analysis on bangla lan-
24: alldata ← alldata + ‘‘ " + ‘‘\n"
guage
25: end if
26: catch Expected exception
Sentiment analysis (SA) is the mining study of human opinion that
27: alldata ← alldata + ‘‘ " + ‘‘\n"
analyzes people’s opinions, feelings, evaluations and judgment towards
28: end try
social entities such as services, products, people, events, organizations
29: end for
etc (Nafisa et al., 2023). An annotated high quality dataset is the
30: copied_content ← copy(alldata)
pre-requisite of any NLP based classification task (Rahman, 2019). In
31: Output: In copied_content variable, Bangla comment and its visible
this work (SA_Bangla), we have proposed a method for Bangla SA
reaction number (likes) are copied.
using a new novel comprehensive dataset and applying various hybrid
32: Paste the copied content to an excel file to save the data
feature extraction techniques. Our work starts with data collection,
then gradually we adopt several more steps such as data preprocessing,
data visualization, split the dataset into training and test sets, feature 3.2. Data annotation and validation
extraction, building models and evaluate results etc. Fig. 1 illustrates
the workflow of the proposed method. We have annotated the collected 203,463 Bangla comments manu-
ally by five human experts (4 graduate students and 1 MS student) into
15 basic sentiment categories such as happy, sad, angry, enthusiasm,
3.1. Data collection
fun, love, sexual, boring, disgust, surprise, fear, worry, hate, relief and
neutral which takes around six month. Among the 15 categories, the
Currently micro-blogging sites are being used by a large number
base emotion classes are subsequently mapped to 3 higher sentiment
of Bangla speakers (Chowdhury and Chowdhury, 2014), millions of
categories such as happy, enthusiasm, fun, love, surprise and relief
people are commenting and texting in Bangla language using various
belong to positive sentiment category and rest of the others belong to
social media platforms such as Facebook, YouTube, Instagram, TikTok, negative sentiment category and the unidentified or mixed comments
Likee and so on. We have collected Bangla comments by using our belong to neutral category. To test the validity of the annotation
self-developed crawlers from 5 micro-blogging sites, a total of 203,463 process, we have conducted an analysis with 40 native Bangla speakers
Bangla comments were collected and saved in an excel file containing who are graduate students of North Western University, Bangladesh,
5 columns: comment, basic sentiment category, higher sentiment category, each student was given 100 different comments and asked to annotate
reaction number and data source. Six people were involved in data them into 15 predefined sentiment categories. So after running an audit
collection process (4 males and 2 females) and 5 were involved in on a total of 4000 comments (with 800 samples each from Facebook,
data annotation process (3 males and 2 females) and their details YouTube, Instagram, TikTok and Likee) we found that it provides
information are given in Table 2. The overall summary of the domain 94.67% accuracy. The confusion matrix of the annotation process is
based data collection is presented in Table 3. We have collected more shown in Fig. 2 and the complete description of the dataset is given in
data from Facebook and YouTube because Bangla comments are more Table 4, where we have shown category based measurements including
available there. Nowadays Bangladeshis are very interested in using no. of total comments, tokens, unique tokens, top 2 topic keywords etc.
5
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 2
Information of data collectors and annotators.
ID of Gender Profession Role Collection Annotation
participant amount amount
p1 Male MS student/author Data collection 83,652 94,197
and annotation
p2 Male Faculty/author Data collection 10,545 N/A
p3 Male Graduate student Data collection 35,760 35,760
and annotation
p4 Male Graduate student Data collection 40,647 40,647
and annotation
p5 Female Graduate student Data collection 17,350 17,350
and annotation
p6 Female Graduate student Data collection 15,509 15,509
and annotation
Table 3 worry, troll, sexual, bully, neutral etc categories is basically termed
Summary of domain based data collection.
as sentiment detection (Bitto et al., 2023). The problem is to classify
Data No. of Collection
sentiments correctly from a labeled dataset. consider a document in the
source comments period
dataset contains many sentences with a total word count of 𝑁, which is
Facebook 71,429 2022–2023
YouTube 42,884 2022–2023 denoted by 𝐷𝑠𝑒𝑛𝑡 where 𝑃𝑐𝑜𝑚𝑡 is a vocabulary of 𝐾 words. 𝑆𝑐𝑎𝑡𝑒𝑔 of the
Instagram 12,764 2022–2023 𝐷𝑠𝑒𝑛𝑡 in 𝑃𝑐𝑜𝑚𝑡 represents the category of different sentiments, where 𝑟 is
TikTok 52,367 2022–2023 the total sentiment category labels. 𝐸𝑥 is the required output sentiment
Likee 24,019 2022–2023
label for the test instance 𝑥.
6
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 4
Overview of category-wise data collection.
When a dataset’s distribution of examples among its various classes Comments of the microblogging sites contain both Bangla and
is noticeably unbalanced, the term ‘‘data imbalance’’ is used (Chawla English punctuation, hashtag (eg.#), emoticon, slang etc and so on
et al., 2002). In many situations in real life, the issue of imbalanced Chowdhury and Chowdhury (2014). So the raw comments always
data sets might occur. A learning model’s ability to reliably predict contain irrelevant characteristics and noise, it is very important to
actual sentiment category may be hampered by this class imbalance. eliminate them from the dataset (Akther et al., 2022). Noisy raw data
When one class has much more examples than the other, the data is cannot correctly categorize the actual sentiment this is why prepro-
imbalanced, which results in models that are biased and predictions cessing of Bangla comments is so important. In this work, we have
that are incorrect for the minority class. Learning models can produce performed tokenization, removal of punctuation, emoticon, non-Bangla
more precise predictions for all classes, particularly for the minority words, stop words removal and stemming as preprocessing steps.
class, by balancing the data. This enhances decision-making and guards
3.5.1. Tokenization
against biased results. To improve forecasts, address real-world events,
During the tokenization process comments are divided into sen-
and improve decision-making by eliminating prejudice towards the
tences and the sentences are divided into words. The Bangla comment
dominant class, imbalanced data must be balanced (Chawla et al.,
‘‘ ’’, [i like your works very much] after tok-
2002). enizing become ‘‘ ’’(your), ‘‘ ’’(works), ‘‘ ’’(i), ‘‘ ’’(very),
‘‘ ’’(like).
3.4.1. SMOTE
By creating synthetic samples for the minority class, SMOTE (Syn- 3.5.2. Punctuation, non-Bangla words and emoticon removal
thetic Minority Over-sampling Technique) is an oversampling tech- The commonly used punctuation marks in Bangla are ‘‘ ’’, ‘‘?’’,
nique used to correct class imbalance. It seeks to boost the dataset’s ‘‘!’’, ‘‘-’’, ‘‘,’’ etc and so on. Punctuation marks and special characters
representation of the minority class and enhance the effectiveness of and symbols especially ‘‘#’’(hashtags), @, & and braces have been
machine learning models (Chawla et al., 2002). There are many vari- excluded from the dataset. Non-Bangla words especially English words
ants of SMOTE such as the fundamental SMOTE, SMOTENC, SMOTEN, and unnecessary emoticons are also removed from one version of the
ADASYN, BorderlineSMOTE, KMeansSMOTE, SVM-SMOTE etc (Chawla dataset (Shafin et al., 2020). The comment ‘‘what???
et al., 2002). But in this work, we did not implement any specialized ’’, (what??? he also insulted Sheikh Hasina) after the removal
variant of SMOTE rather we used the fundamental SMOTE only. The of punctuation, non-Bangla words and emoticon become ‘‘
description of the fundamental SMOTE is provided below: ’’, (he also insulted Sheikh Hasina).
7
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 5
Top 10 frequent n-grams without removing stop words of the dataset.
3.5.4. Stemming dataset, providing insights into the diversity and vocabulary richness
Stemming is the process of reducing a word to its base or root of the dataset. Without removing the stop words ‘‘ ’’(no) is the first
form Chowdhury and Chowdhury (2014). Stemming algorithms aim ranked word in the dataset having an occurrence frequency of 31,973
to remove suffixes from words so that they can be matched with and after removing the stop words ‘‘ ’’(good) is the first ranked word
other words that have the same root. For example, the words ‘‘ in the dataset having an occurrence frequency of 13,461.
’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’
all have the same root word ‘‘ ’’. By stemming these words, we can 3.6.2. Bigram
match them with other words that have the same root word, such as A bigram is a pair of next-to-one words from a specific passage
‘‘ ’’ or ‘‘ ’’. The Bangla comment ‘‘ of text. For 𝑛 = 2, a bigram is an n-gram. In many applications
’’, (yesterday i was saddened to see the picture of the such as computational linguistics and NLP based works, the frequency
motherless child on facebook) become ‘‘ distribution of every bigram in a text is frequently employed for a quick
’’ after stemming is performed. We have implemented stemming statistical analysis of text (Akther et al., 2022). At this level, we have
on our dataset as a final pre-processing step. extracted bigrams from non-stemmed words. The number of bigrams
utilizing four threshold frequencies are displayed in Table 7. The total
3.6. Statistical dataset visualization and analysis number of times two consecutive words appeared together in a dataset
is referred to as the frequency or count of that bigram. We have utilized
Statistical analysis is a very important aspect almost in every subject 4 threshold frequencies: 20, 50, 100, and 200 to observe the behavior of
to get usual observations from experiments. In this section, we have bigrams in the dataset. The term threshold frequency describes the upper
analyzed various statistical language phenomena on the dataset such boundary of whether to accept or reject a bigram from all bigrams. A
as the usage of n-grams, average word length, character level analysis, bigram frequency (count) must be at least 20 to be accepted; otherwise,
frequency of characters, Zipf’s law and type token information respec- it will be rejected. This is known as the threshold frequency of 20.
tively. From these analysis, we can have a good understanding about The proposed dataset contains 8463 unique bigrams out of a total of
the dataset and its linguistics usefulness. 944,682 bigrams using threshold frequency 20. When we have increased
the threshold frequency to 200, then the total no. of bigrams and no.
3.6.1. Unigram unique bigrams decreased to 131,758 and 321 respectively. The most
An n-gram, also known as a unigram in the domains of probability frequent bigram of the proposed dataset is ‘‘ ’’(very nice) which
and computational linguistics, is made up of just one element from a appears 4083 times together in the dataset.
particular sample of text or speech (Akther et al., 2022). At this level,
non-stemmed unigrams for the entire dataset have been extracted from 3.6.3. Trigram
the preprocessed data. Table 5 represents a list of the top 10 unigrams, A trigram is a grouping of three adjacent words or phrases from
bigrams and trigrams in the dataset. Any language’s function words a text or speech sample. For 𝑛 = 3, a trigram is an n-gram (Akther
or stop words are always its most common words. They are crucial et al., 2022). Like bigrams, the frequency distribution of trigrams can
for determining a dataset’s quality. We found that 32.84% of the total be useful for a straightforward statistical analysis of text. The same
tokens in our dataset are function words. We have taken out the stop threshold frequencies have also been utilized to extract trigrams from
words from one version of the dataset in order to get the content words. the dataset. Table 8 displays the effect of threshold frequencies on the
Table 6 lists the top 10 frequently occurring unigrams, bigrams and trigrams. The proposed dataset contains 1875 unique trigrams out of
trigrams after the stop words have been eliminated. Additionally, we a total of 85,208 trigrams using threshold frequency 20. Interestingly,
have recorded the total number of unique unigrams present in the when the threshold frequency is increased to 200, the total number of
8
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 6
Top 10 frequent n-grams after removing stop words of the dataset.
Table 7
Threshold frequency wise Bigram.
Threshold No. of all No. of all
frequency bigrams unique bigrams
20 494,682 8463
50 317,844 2442
100 214,827 926
200 131,758 321
Table 8
Threshold frequency wise Trigram.
Threshold No. of all No. of all
frequency trigrams unique trigrams
20 85,208 1875
50 39,936 358
100 23,355 106 Fig. 3. Usage of words Vs. word length of the dataset.
200 13,425 36
9
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 9
Top 10 frequent N-letter (n = 1 to 7) words of the dataset.
Table 10
Percentage of occurrence of each letter of the dataset.
3.6.5. Character level analysis according to human nature or not. According to Zipf’s law, there is a
Instead of considering words as the basic unit of analysis, character- correlation between a word’s frequency (𝑑) and its rank (𝑝) in the list
level analysis focuses on the individual characters that make up the (if all the words in a big dataset are listed in order of their frequency
text. By examining individual characters and their context, different of occurrence) (Manning and Schutze, 1999). According to Zipf’s law:
errors or inconsistencies can be identified, and appropriate corrections
1
can be suggested (Akther et al., 2022). 𝑑∝
𝑝
Based on our dataset, we have determined the percentage of times
each Bangla character occurs. The top 5 frequently used characters, For example, this means that the 50th most frequent word should
according to our dataset, are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ and the least appear with twice the frequency of the 100th most frequent word and
frequently used characters are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ respectively. so on. Fig. 5 depicts the Zipf’s curve of the dataset where 𝑥 and 𝑦 axis
Table Table 10 demonstrates the percentage of times each Bangla represents logarithmic rank and frequency of words respectively. The
character is used in our dataset. A statistical study of the dataset at curve is roughly linear (see Fig. 5), so it is proved that our dataset is
the character level reveals that 5.36% of the characters are vowels, holding the Zipf’s law approximately.
45.01% are consonants and 48.79% are allographs. The top 2 most
frequently used characters are both allographs, they are ‘‘ ’’(Aa-kar) 3.6.7. Hapax legomena and vocabulary growth
and ‘‘ ’’(e-kar) covering 10.05% and 6.90% of characters of the dataset.
Table 12 shows the type token information of the dataset. The total
Another character level statistical study on the vocabulary (unique
number of word types are 165,319 and the number of word types
words of the dataset) have been conducted to observe how it behaves.
occurring once are 108,599. About half of the words in the dataset
The percentage of occurrence of initial letter of unique words of the
are referred to as hapax legomena, which appears only once (Akther
dataset are listed in Table 11. We have noticed that the likelihood of
et al., 2022). In our dataset, we have more than half words belonging
the letters ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ occurring as the first alphabet is
to hapax legomena. The vocabulary growth rate is measured as:
higher than it is for the other letters. The words that starts with the
alphabet ‘‘ ’’ covers 11.47% words of the dataset. W(1)
𝐺= (5)
N
3.6.6. Zipf’s law Here the parameters 𝐺, 𝑊 (1) and 𝑁 represents the vocabulary growth
It is impossible to expect human inspection to guarantee the quality rate, the number of word types occurring once and the total number
of a dataset with millions of words. So, Zipf’s distribution of the dataset of words in the dataset respectively. For our dataset, the vocabulary
is examined to check whether it is reflecting the vocabulary usage growth rate is (108,599/204,6150) = 0.053 which is reasonable.
10
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 11
Percentage of occurrence of initial letter of unique words of the dataset.
Table 13
Feature vectors for BOW.
[3, 1, 1, 2, 0, 0, 0, 0, 0]
[3, 0, 0, 0, 1, 1, 0, 0, 0]
[3, 0, 0, 2, 0, 0, 1, 1, 1]
11
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Fig. 6. Word embedding using CBOW. Fig. 7. Word embedding using Skipgram.
3.7.3. TF-IDF-ICF another pre-trained model developed by Google AI works on the basis
The term 𝐼𝐶𝐹 stands for Inverse Class Frequency introduced by of transformers and attention mechanism (Bhattacharjee et al., 2022).
Wang and Zhang (2010) which is calculated according to Eq. (9): We have utilized different hybrid feature extraction techniques
in this work. Fig. 8 briefly describes the process of feature extrac-
P
𝐼𝐶𝐹 = log𝑒 1 + (9) tion using the hybrid method skipBangla-BERT method. Bangla-BERT
Q
(Bangla Bidirectional Encoder Representations from Transformers) has
𝑇 𝐹 × 𝐼𝐷𝐹 × 𝐼𝐶𝐹 = 𝑇 𝐹 − 𝐼𝐷𝐹 − 𝐼𝐶𝐹 (10) two types of encoder representation, one is Bangla-BERT base (12
encoders) and the other is Bangla-BERT large (24 encoders). We have
Here 𝑃 denotes the number of total categories and 𝑄 is the number of utilized the pre-trained Bangla-BERT base model2 along with the Skip-
categories that contain the concerned word. Therefore 𝑇 𝐹 ×𝐼𝐷𝐹 ×𝐼𝐶𝐹 gram shallow neural network model together. The pre-trained Bangla-
is measured by Eq. (10). BERT base consists of 12 layers, 768 hidden layers, 12 self attention
heads and 110 million total number of parameters (Bhattacharjee et al.,
3.7.4. Word embeddings 2022) whereas Skipgram consists of a shallow neural network of 1
Word embedding is a vector based feature extraction metric used input layer, 1 projection or hidden layer and 1 output layer (Rafat
in NLP where each word is converted into a fixed sized vector of et al., 2019). The pre-trained Bangla-BERT base model works on the
real numbers (Sumit et al., 2018). Words are represented in a high basis of two mechanisms namely masked language modeling (MLM)
dimensional space using word embedding where the proximity of sim- and next sentence prediction (NSP) (Bhattacharjee et al., 2022). The
ilar words is very high, in fact the similar words together form a first token of every sequence is represented by a special token ⟨CLS⟩
cluster of words. It can be implemented using Word2Vec, FastTest or and for separating each sentences another special token ⟨SEP⟩ is used.
GloVe methods based on the mechanism CBOW or skipgram (Sumit In Fig. 8, we denote the input masked vectors, input embeddings and
et al., 2018). Table 14 describes the generation process of the training embedding layers as 𝐵𝑊𝑖 , 𝐵𝑡𝑖 and 𝐸𝑖 respectively. Let consider two
samples for word embeddings. The CBOW model learns through context consecutive dummy sentences ‘‘ ’’, ‘‘ ’’(very nice.
words (Camacho-Collados and Pilehvar, 2018) and tries to predict this is very serious) will become ⟨CLS⟩ ⟨‘‘ ’’⟩ ⟨‘‘ ’’⟩ ⟨SEP⟩ ⟨‘‘ ’’⟩
the target word (Fig. 6) whereas skipgram model tries to predict its ⟨‘‘ ’’⟩ ⟨‘‘ ’’⟩. The masked input representation of the sentences
neighbors (context words) using the current word (Fig. 7). The word will pass through the 12 encoders of the pre-trained model and built by
embedding models are based on shallow neural networks of input aggregating the corresponding token embeddings, segment embeddings
layer, projection or hidden layer and an output layer with a softmax and position embeddings respectively (Bhattacharjee et al., 2022). Thus
activation. Here D is the number of words in the vocabulary and P is the encoders will generate a global padded embeddings for all the
the size of the embedding vector (see Figs. 6 and 7). tokens. The next sequential layer is the concatenation layer having
So far, we have discussed the pros and cons of the Word2Vec model a feed forward neural network that will generate the final version
that deals with words, a common problem with this approach is the of embedded vectors from the end of Bangla-BERT base model. The
out-of-vocabulary (OOV) words that it cannot handle. So, an extension output of the concatenated layer will be the input of the next sequential
of the Word2Vec model is introduced further to solve this issue and Skipgram layer which will further projected and output a 1*768 vectors
facilitating the advantage of morphological analysis on character level of final embeddings for each comment in the dataset.
n-grams, it is known as FastText model which can also be implemented
using CBOW or Skipgram (Rafat et al., 2019). 3.8. Bangla sentiment analysis model
The mechanism is same with a slight change in the input layer
instead of pure words it uses character n-grams to fit the neural net- For performing sentiment analysis on Bangla, we have implemented
work. Table 15 describes the generation process of character n-grams Bernoulli Naive Bayes (BNB), Decision Tree (DT), Logistic Regression
for FastText model. For example the word ‘‘ ’’(love) after break- (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) as
ing into Bangla characters (vowels, consonants and their short forms ML models, Random Forest (RF), XGboost (XGB), Gradient Boost (GB)
i.e. kar and fola [allographs]) will become ‘‘ + + + + as EL models and Recurrent Neural Network (RNN), Long Short Term
+ + + ’’ considering the length of the character n-gram as Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM),
3, the produced n-grams are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, Convolutional Neural Network (CNN) as DL models and CNN-BiLSTM
‘‘ ’’ and ‘‘ ’’. Instead of relying on context words like Word2Vec as a hybrid DL model. Except the hybrid model, all the other algorithms
and FastText, another word embedding model called GloVe (Global are base algorithms. In this section we will describe the hybrid CNN-
Vector) works based on global dataset statistics and create fixed sized BiLSTM model. The architecture of the CNN-BiLSTM model is depicted
vectors using a co-occurrence matrix (Cerqueira et al.). Bangla-BERT
2
(Bangla Bidirectional Encoder Representations from Transformers) is https://huggingface.co/sagorsarker/bangla-bert-base
12
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 14
Training samples generation process for word embeddings using window size 5.
Table 15
Character n-gram generation process of FastText model.
in Fig. 9. It consists of six consecutive layers such as the input layer, size dense vector of 100 (𝑚𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ). 𝑀𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ is the size of
convolutional layer, max pooling layer, bidirectional LSTM layer, fully the word vectors. The 𝑖𝑛𝑝𝑢𝑡 𝑙𝑒𝑛𝑔𝑡ℎ parameter specifies the length of
connected dense layer and the output layer. the input sequences (number of words). The second layer is a one-
The first layer is an embedding layer (input layer), it is generally dimensional convolutional layer with 128 filters and a 𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒
used to convert the integer encoded word indices to dense vectors. of 5. The activation function used is 𝑅𝑒𝐿𝑈 (Rectified Linear Unit),
It takes the input sequence and converts each word index to a fixed which introduces non-linearity to the model. The third layer is the max
13
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
pooling layer with a 𝑝𝑜𝑜𝑙 𝑠𝑖𝑧𝑒 of 4. It is used to reduce the spatial Table 16
Split the dataset into training and test sets based on 15 categories.
dimensions of the data and capture the most important features from
the convolutional layer. The fourth layer contains a Bidirectional LSTM Category Total samples Training set Test set
(Long Short-Term Memory) with 64 units (32 forward memory units Love 56,631 45,305 11,326
Enthusiasm 37,965 30,372 7593
and 32 backward memory units). Bidirectional LSTMs process the input
Happy 26,596 21,277 5319
sequence in both forward and backward directions, allowing the model Fun 24,351 19,481 4870
to capture contextual information from both sides of the sequence. The Surprise 3501 2801 700
𝑑𝑟𝑜𝑝𝑜𝑢𝑡 value of 0.2 and 𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑑𝑟𝑜𝑝𝑜𝑢𝑡 of 0.2 are used to apply Relief 765 612 153
dropout regularization to the LSTM layer to prevent overfitting. The Angry 27,054 21,644 5410
Sad 6101 4881 1220
fifth layer is the fully-connected layer where each neuron is connected Sexual 5854 4684 1170
to every neuron in the previous layer, and each connection has its Disgust 3888 3111 777
own weight. Thus, it is very expensive in terms of memory (weights) Boring 2494 1996 498
and computation (connections). This layer flattens the input feature Worry 1810 1448 362
Fear 1442 1154 288
representations into a feature vector and performs the function of high-
Hate 1083 867 216
level reasoning. This layer functions by 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 activation. The output Neutral 3928 3143 785
layer is responsible for providing the prediction for test instance to Total 203,463 162,776 40,687
either any sentiment category from 15 predefined categories. So, in
the output layer, there are 15 channels available where each channel
Table 17
corresponds to a predefined sentiment category. Section 4 describes Split the dataset into training and test sets based on 3 categories.
more about the numerical details of different layers of CNN-biLSTM Category Total samples Training set Test set
model in brief.
Positive 149,809 119,848 29,961
Negative 49,726 39,745 9981
4. Experimental results analysis and discussion Neutral 3928 3143 785
Total 203,463 162,776 40,687
14
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 18 Table 22
The impact of max length on CNN-BiLSTM model. Parameters of CNN-BiLSTM model with their optimal measurements.
Max Training Validation Training Validation Parameter Measurement/Value
length accuracy accuracy loss loss
Max length 100
32 0.87 0.87 0.47 0.43 Conv1D (filters) 128
64 0.91 0.89 0.41 0.39 Conv1D (kernel size) 5
100 0.95 0.93 0.35 0.36 MaxPooling1D (pool size) 4
128 0.90 0.90 0.40 0.44 Bi-LSTM memory units 64
256 0.88 0.91 0.44 0.46 Batch size 32
Learning rate 2.5e−3
Dropout 0.2
Table 19 Activation function ReLU
The impact of learning rate on CNN-BiLSTM model. Predictive function Softmax
Learning Training Validation Training Validation Loss function Sparse categorical
rate accuracy accuracy loss loss cross-entropy
Optimizer Adam
1e−2 0.85 0.85 0.58 0.51 Metrics Accuracy
1.5e−2 0.88 0.84 0.51 0.54 Epochs 500
2e−3 0.92 0.89 0.47 0.45
2.5e−3 0.96 0.94 0.40 0.39
3e−2 0.89 0.90 0.48 0.43
3.5e−3 0.90 0.88 0.46 0.42
Table 22 illustrates the parameters of CNN-BiLSTM model with
their optimal measurements. The one-dimensional convolutional layer
Table 20 contains 128 filters and a kernel size of 5, the pool size of the one
The impact of batch size on CNN-BiLSTM model. dimensional max pooling layer is 4, the total memory units in the
Batch Training Validation Training Validation bidirectional LSTM layer is 64 having a dropout of 0.2 to prevent
size accuracy accuracy loss loss
overfitting. The activation and predictive functions are ReLU (Recti-
5 0.91 0.90 0.62 0.65 fied Linear Unit) and Softmax respectively. Sparse categorical cross-
6 0.92 0.89 0.65 0.63
entropy is the loss function, adam is the optimizer and accuracy is the
7 0.92 0.90 0.60 0.55
8 0.93 0.92 0.57 0.58 performance metrics for CNN-BiLSTM model.
30 0.95 0.93 0.57 0.54
32 0.96 0.94 0.53 0.54 4.2.2. Analysis of different layers in CNN-BiLSTM model
Fig. 9 depicts the architecture of the proposed CNN-BiLSTM model.
Table 21 The first layer is an embedding layer (input layer), it is generally used
The impact of epochs on CNN-BiLSTM model. to convert the integer encoded word indices to dense vectors. It takes
Epochs Training Validation Training Validation the input sequence and converts each word index to a fixed size dense
accuracy accuracy loss loss vector of 100 (max length). The second layer is the convolutional layer
200 0.84 0.81 0.71 0.68 with 128 filters and a kernel size of 5. The activation function used
300 0.87 0.82 0.69 0.61 is ReLU (Rectified Linear Unit), which introduces non-linearity to the
400 0.92 0.85 0.58 0.60
model. The kernel weights of the initial 16 filters of the convolutional
500 0.95 0.93 0.42 0.45
600 0.95 0.93 0.46 0.45 layer are shown in Fig. 10. The change in the weights of the kernel in
700 0.95 0.92 0.49 0.47 the first 16 consecutive filters are random. The weights in the output
channel 1 changes drastically, the weight value starts from 0.038 and
gradually decreases to negative values and then again increases to
0.084. Again in the output channel 2 of the convolutional layer, the
In a similar fashion the experimental results show that the 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 initial weight value is −0.02, further it decreases more up to weight
𝑟𝑎𝑡𝑒 of 2.5e−3 is the best fitted value for CNN-BiLSTM model. The value −0.043, then again the value increased to 0.046. The final
impact of 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 is given in Table 19. It produces a training weight values are all positive on the initial 7 output channels of the
accuracy of 0.96 and validation accuracy of 0.94 which outperform the convolutional layer while for 8th, 9th, 11th and 16th output channels
others. are all negative weights. Fig. 11 shows the kernel weights adaptation
We have also tuned the 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 parameter for the proposed CNN- schemes of final 16 output channels of the convolutional layer. The
BiLSTM model. The total number of training samples those are used 113th output channel of the convolutional layer starts with a positive
in one epoch is referred to as 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 (Alam et al., 2017). Table 20 weight 0.043 which finally adapts to −0.008. The 128th output channel
briefly describes the impact of batch size on CNN-BiLSTM model. For starts with a positive weight 0.058 and then drastically fluctuates to
the sake of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 tuning, we have taken the values of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 as −0.06, then again rapidly increases to 0.039, then again decreases to
5, 6, 7, 8, 30 and 32. For the small values of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒𝑠 (5, 6, 7 and 0.031 and finally stables to weight value 0.069. The third layer is the
8), the training and validation accuracy’s are slightly lower while for max pooling layer with a pool size of 4. It is used to reduce the spatial
the large values (30 and 32) the training and validation accuracy’s are dimensions of the data and capture the most important features from
slightly better. We have recorded the best performance with 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 the convolutional layer.
32 and set it as our optimal value for 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒. The activation status of the initial and final 16 filters of the max
Epoch is another hyper-parameter that needs to be tuned. Epoch pooling layer in CNN-BiLSTM model are shown in Figs. 12 and 13
can also be termed as iteration or cycle. The impact of epochs on CNN- respectively. Activation values for some of the channels are zero (chan-
BiLSTM model is shown in Table 21. For the tuning purpose of no. of nels 1, 4, 5, 7, 10, 11, 13, 14, 16, 113, 118, 119, 121, 122, 124, 127
epochs, we have considered the values as 200, 300, 400, 500, 600 and and 128) while 0.0289, 0.001925 and 0.01878 are the activation values
700. For small values of epochs (200, 300 and 400) the training and for channels 2, 15 and 126 respectively. So, the spatial dimensions are
validation accuracy’s are slightly lower while for the large values (500, being reduced to a great extent compared to the convolutional layer.
600 and 700) the training and validation accuracy’s are slightly better The fourth layer contains a Bidirectional LSTM (Long Short-Term Mem-
but same. The accuracy did not increase after 500 epochs. So, we have ory) with 64 units (32 forward memory units and 32 backward memory
set the epochs as 500 as optimal value. units). Bidirectional LSTMs process the input sequence in both forward
15
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Fig. 10. Kernel weights of Convolutional layer in CNN-BiLSTM model (initial 16 filters).
Fig. 11. Kernel weights of Convolutional layer in CNN-BiLSTM model (final 16 filters).
and backward directions, allowing the model to capture contextual where each channel corresponds to a predefined sentiment category.
information from both sides of the sequence. The dropout value of 0.2 Fig. 15 depicts the activation status of the fully connected dense layer
and recurrent_dropout of 0.2 are used to apply dropout regularization for the proposed CNN-BiLSTM model.
to the LSTM layer to prevent overfitting. The brief graphical overview
of the memory units of the bidirectional LSTM layer in CNN-BiLSTM 4.3. Experimental results analysis
model is shown in Fig. 14. The fourth and final layer is the fully
connected dense layer that is responsible for providing the prediction The proposed work mainly focuses on creating a comprehensive
for test instance to either any sentiment category from 15 predefined dataset for Bangla SA and discovering an efficient feature extrac-
categories. So, in the output layer, there are 15 channels available tion metric, then apply various ML, EL and DL algorithms and make
16
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Fig. 12. Activation status of Max pooling layer in CNN-BiLSTM model (initial 16 filters).
Fig. 13. Activation status of Max pooling layer in CNN-BiLSTM model (final 16 filters).
a comparative analysis among them to find the optimal model as CBOW+Skipgram, GloVe+CBOW, GloVe+Skipgram, GloVe+CBOW+
well as feature metric. Table 23 describes the proposed dataset at Skipgram, Bangla-BERT+CBOW, skipBangla-BERT, Bangla-BERT+
a glance. In this work, we have developed total 21 different hy- CBOW+Skipgram and implemented ML, EL and DL algorithms to
brid feature extraction techniques such as BOW+2-Gram, BOW+3-
evaluate them. In this work, we have implemented Bernoulli Naive
Gram, TF-IDF+2-Gram, TF-IDF+3-Gram, TF-IDF-ICF+2-Gram, TF-IDF-
Bayes (BNB), Decision Tree (DT), Logistic Regression (LR), K-Nearest
ICF+3-Gram, Word2Vec+CBOW(gensim), Word2Vec+Skipgram (gen-
sim), Word2Vec+CBOW+Skipgram (gensim), Word2Vec+CBOW (ten- Neighbors (KNN), Support Vector Machine (SVM) as ML models, Ran-
sorflow), Word-2Vec+Skipgram (tensorflow), Word2Vec+CBOW+ dom Forest (RF), XGboost (XGB), Gradient Boost (GB) as EL models and
Skipgram (tensorflow), FastText+CBOW, FastText+Skipgram, FastText+ Recurrent Neural Network (RNN), Long Short Term Memory (LSTM),
17
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Fig. 14. Memory units of Bidirectional LSTM layer in CNN-BiLSTM model (initial and final 4 memory units for both forward and backward steps).
18
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
19
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 24
Effect of different model performance after balancing the dataset.
Domain Algorithm Accuracy without Accuracy with Accuracy without Accuracy with
SMOTE for SMOTE for SMOTE for SMOTE for
15-categories 15-categories 3-categories 3-categories
(%) (%) (%) (%)
BNB 57.91 66.85 75.26 83.31
DT 78.15 84.37 76.89 83.21
ML LR 80.12 83.77 78.34 88.74
KNN 59.89 72.69 69.95 81.58
SVM 80.52 84.88 79.58 92.37
RF 80.98 85.77 80.13 91.16
EL XGB 79.47 83.59 79.92 92.55
GB 75.67 84.91 80.03 91.87
RNN 80.19 88.91 82.17 93.14
LSTM 81.11 86.90 83.23 94.37
DL Bi-LSTM 81.73 85.96 82.65 94.68
CNN 80.38 88.33 82.91 94.55
CNN-BiLSTM 82.16 90.24 83.64 95.71
Table 25
Obtained results of applied algorithms using different feature extraction techniques based on 15 sentiment categories.
Feature Accuracy (%)
BNB DT LR KNN SVM RF XGB GB RNN LSTM Bi-LSTM CNN CNN-
BiLSTM
BOW+2-Gram 61.13 73.25 74.51 39.78 77.63 80.31 79.88 80.03 N/A N/A N/A N/A N/A
BOW+3-Gram 61.33 73.29 74.50 39.91 77.67 80.23 79.87 79.98 N/A N/A N/A N/A N/A
TF-IDF+2-Gram 65.72 78.93 77.67 42.96 80.34 82.52 82.31 82.19 N/A N/A N/A N/A N/A
TF-IDF+3-Gram 65.81 78.97 77.98 43.10 80.77 82.63 81.69 82.33 N/A N/A N/A N/A N/A
TF-IDF-ICF+2-Gram 64.33 80.14 78.40 43.22 82.19 83.12 82.13 82.44 N/A N/A N/A N/A N/A
TF-IDF-ICF+3-Gram 64.27 80.31 78.42 43.23 82.76 83.12 82.15 82.53 N/A N/A N/A N/A N/A
Word2Vec+CBOW 28.71 58.16 60.67 57.90 62.61 71.18 70.13 70.69 72.39 73.65 74.68 75.89 82.31
(gensim)
Word2Vec+Skipgram 35.95 57.84 62.94 60.62 67.98 71.79 71.05 69.98 74.23 75.63 76.92 80.39 83.26
(gensim)
Word2Vec+CBOW+ 28.55 57.81 60.43 56.89 64.69 71.19 69.72 68.87 73.13 76.98 75.97 80.14 82.79
Skipgram (gensim)
Word2Vec+CBOW 24.13 55.79 51.07 43.26 60.53 64.67 63.19 62.74 79.54 81.26 82.39 81.37 83.76
(tensorflow)
Word2Vec+Skipgram 25.26 57.81 54.32 44.17 61.29 64.98 63.55 61.87 79.98 82.05 82.31 82.91 82.77
(tensorflow)
Word2Vec+CBOW+ 24.89 55.86 53.68 43.29 60.95 64.83 63.14 61.68 78.81 82.39 80.39 79.68 81.26
Skipgram (tensorflow)
FastText+CBOW 56.82 67.64 71.62 72.69 72.86 76.21 73.42 72.87 77.37 79.81 80.45 80.11 82.38
FastText+Skipgram 66.85 67.78 71.61 72.58 72.59 76.29 74.51 73.27 79.91 81.29 84.25 83.22 84.27
FastText+CBOW+ 59.87 67.74 71.64 72.62 72.89 76.23 75.85 72.96 78.63 80.94 82.71 81.21 83.57
Skipgram
GloVe+CBOW 55.72 66.52 71.23 57.89 72.43 74.89 75.76 70.83 76.39 78.35 79.42 78.98 80.53
GloVe+Skipgram 65.35 67.39 70.54 58.93 71.99 75.61 74.38 69.87 78.13 75.53 81.32 80.32 81.26
GloVe+CBOW+ 61.37 65.74 71.27 64.79 70.99 72.94 71.29 70.89 73.41 78.91 79.47 78.96 80.93
Skipgram
Bangla-BERT+CBOW 63.47 84.26 83.49 67.91 84.67 85.79 83.41 84.62 88.89 86.41 86.49 87.91 89.13
skipBangla-BERT 64.36 83.19 83.77 66.43 84.88 85.34 83.59 84.91 88.91 85.73 85.96 88.33 90.24
Bangla-BERT+CBOW+ 64.53 84.37 83.73 67.05 84.79 85.77 82.97 84.77 88.86 86.90 86.81 88.13 89.91
Skipgram
0.85 respectively. Among all the models, KNN got the worst f1-score The performance of a learning model depends mostly on the ex-
of 0.79, as it obtained the worst precision and recall values. Among all tracted features from the dataset. Feature extraction is very important
the implemented algorithms, KNN is the only learning model that is in text mining related works. So, it is an important aspect to know
lazy learner. It has no training phase. This is the reason for its worst which feature extraction method is more efficient with our present task
outcomes. The proposed dataset contain human Bangla comments, so it at hand. In our work, we have examined different feature extraction
is very rear to have the use of standard language always. Rather it can techniques. The average performance of different feature extraction
have local Bangla words, local folk words, slang’s etc. So, training is a methods are summarized in Fig. 19, where Bangla-BERT outperforms
must to get promising outcomes from a model. Except KNN, the other all other methods with an average accuracy of 92.24%. FastText model
models are eager learners and they produced significant results in both shows an accuracy of 81.09% being the 2nd highest metric for feature
15 and 3 categories versions of the dataset. extraction. TF-IDF-ICF metric outperforms the traditional TF-IDF and
20
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
21
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 26
Obtained results of applied algorithms using different feature extraction techniques based on 3 sentiment categories.
Feature Accuracy (%)
BNB DT LR KNN SVM RF XGB GB RNN LSTM Bi-LSTM CNN CNN-
BiLSTM
BOW+2-Gram 80.32 82.37 83.04 57.33 82.43 83.78 82.47 82.88 N/A N/A N/A N/A N/A
BOW+3-Gram 80.13 81.69 82.34 59.75 81.66 83.96 82.39 81.89 N/A N/A N/A N/A N/A
TF-IDF+2-Gram 82.69 88.23 88.04 58.39 89.47 90.33 90.45 89.97 N/A N/A N/A N/A N/A
TF-IDF+3-Gram 83.05 88.21 88.19 61.46 89.48 90.41 90.23 90.17 N/A N/A N/A N/A N/A
TF-IDF-ICF+2-Gram 83.18 88.95 88.61 58.20 89.92 91.07 90.86 89.99 N/A N/A N/A N/A N/A
TF-IDF-ICF+3-Gram 83.31 89.02 88.74 59.16 90.01 91.16 89.56 90.45 N/A N/A N/A N/A N/A
Word2Vec+CBOW 65.82 78.83 80.84 81.11 81.86 85.78 84.35 83.26 83.49 83.99 84.74 84.56 89.57
(gensim)
Word2Vec+Skipgram 69.19 79.92 81.46 80.31 83.20 86.44 84.53 84.36 82.43 87.94 86.24 85.41 90.13
(gensim)
Word2Vec+CBOW+ 64.83 78.54 80.91 81.09 81.89 85.99 84.49 85.23 84.58 86.78 89.05 88.17 92.48
Skipgram (gensim)
Word2Vec+CBOW 54.72 73.35 75.66 71.26 78.59 80.46 82.33 81.16 83.45 85.90 87.43 89.99 89.93
(tensorflow)
Word2Vec+Skipgram 56.13 77.91 78.52 74.89 79.94 80.09 84.06 83.34 84.65 86.19 89.72 87.70 92.30
(tensorflow)
Word2Vec+CBOW+ 54.97 77.18 76.99 74.57 78.92 79.99 84.14 82.73 81.94 86.31 90.18 90.02 91.61
Skipgram (tensorflow)
FastText+CBOW 65.82 76.64 80.61 81.11 81.86 85.21 86.49 84.45 85.73 89.90 88.15 91.15 92.35
FastText+Skipgram 75.85 76.78 81.27 81.58 82.34 85.29 87.95 88.21 89.55 90.31 92.19 90.95 93.54
FastText+CBOW+ 75.83 76.63 80.97 81.61 81.93 85.23 85.69 87.89 90.14 90.78 91.44 91.08 93.25
Skipgram
GloVe+CBOW 67.23 75.19 78.47 79.63 80.96 85.35 84.14 86.51 88.90 86.05 87.68 89.96 91.55
GloVe+Skipgram 73.89 75.82 80.94 80.38 81.53 85.67 85.19 88.98 84.93 87.34 89.95 90.42 92.33
GloVe+CBOW+ 76.12 76.07 79.57 80.44 81.29 85.12 85.32 87.93 85.66 86.26 89.92 90.49 92.28
Skipgram
Bangla-BERT+CBOW 80.13 83.16 86.91 78.92 92.14 91.03 92.34 90.68 92.89 94.13 94.45 93.44 94.47
skipBangla-BERT 82.15 83.17 87.72 80.14 92.37 91.09 92.55 91.87 93.03 94.37 94.66 94.26 95.71
Bangla-BERT+CBOW+ 81.99 83.21 87.79 79.54 92.29 91.14 92.47 91.74 93.14 94.26 94.68 94.55 95.27
Skipgram
Table 27 2022 (Prottasha et al., 2022) used six datasets and obtained highest f1-
Best performance measurements of different domains using (skipBangla-BERT) based
score and accuracy with CNN-BiLSTM model of 0.93 and 94.15%. The
on 15 categories.
very recent work (Kabir et al., 2023) used a moderately large dataset
Domain Best model Precision Recall F1-score Accuracy (%)
of 158,065 instances and achieved highest f1-score 0.93 using BERT
ML SVM 0.81 0.86 0.83 84.88
EL RF 0.78 0.91 0.84 85.34 model. Another new work (Bitto et al., 2023) used a dataset of only
DL CNN-BiLSTM 0.86 0.93 0.89 90.24 1400 reviews from food delivery startup and achieved an accuracy
of 91.07% and f1-score of 0.85. Our proposed dataset contain more
Table 28
instances than the compared 4 other existing recent works. To the best
Best performance measurements of different domains using skipBangla-BERT based on of our knowledge, we have created one of the largest document-level
3 categories. Bangla SA corpus of 203,463 comments from social media. Most of
Domain Best model Precision Recall F1-score Accuracy (%) the work on Bangla sentiment analysis consider only 3 basic sentiment
ML SVM 0.96 0.91 0.93 92.37 categories (positive, negative and neutral). But in this work we have
EL XGB 0.89 0.99 0.94 92.55
examined sentiment analysis on both 15 categories and 3 categories and
DL CNN-BiLSTM 0.97 0.94 0.95 95.71
noticed that accuracy (i.e. results) and no. of categories are inversely
proportional to each other. Moreover, the proposed work also discover
a new hybrid feature extraction method (skipBangla-BERT) for Bangla
BOW metrics. Word2Vec along with gensim and tensorflow libraries textual data which outperforms 20 other hybrid methods. We have
show an accuracy of 65.73% and 62.81% respectively. In our experi- implemented 13 different algorithms from 3 different domains from
ment, we have found that Word2Vec performed worst compared to the ML, EL and DL, among them the hybrid method (CNN-BiLSTM) from
other feature metrics.
DL domain outperforms the others. The best achieved accuracy is
The comparison among existing recent works and our proposed
90.24% in 15 categories and 95.71% in 3 categories. Another graphical
work is illustrated in Table [?] where we have measured the dataset
performance metric that is very useful for machine learning algorithms
used, no. of categories, feature metric, used model, f1-score and accu-
racy as comparative features. In 2022 (Bhowmik et al., 2022) a method is the receiver operating characteristic (ROC) curve (Bradley, 1997).
was proposed for Bangla sentiment analysis using extended lexicon We have shown the ROC curves for both the 15 and 3 categories in
dictionary and deep learning algorithms. They used two aspect based Figs. 20 and 21 respectively. Individual area under curve (AUC) values
datasets on Cricket and Restaurant and obtained highest accuracy of are also given in the ROC curves to observe the significance of each
84.18% using the hybrid method BERT-LSTM. Another method from category (see Table 29).
22
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
Table 29
Comparison among existing works and our proposed work.
Name Year Dataset No. of Feature Best F1-Score Accuracy
used categories metric model (%)
Bhowmik et al. (2022) 2022 2900 & 3 Word2Vec BERT- N/A 84.18
2600 LSTM
Prottasha et al. (2022) 2022 2900 & 3 BERT CNN- 0.93 94.15
2600 & BiLSTM
4 others
Kabir et al. (2023) 2023 (BanglaBook) 3 BOW+ BERT 0.93 N/A
158,065 N-Gram
Bitto et al. (2023) 2023 1400 2 Word2Vec LSTM 0.85 91.07
Islam and Alam (2023b) 2023 44,491 3 TF-IDF CNN- 0.90 90.96
BiGRU
Mia et al. (2024) 2024 6500 3 TF-IDF, GloVe BERT 0.65 65.00
3 skipBangla- CNN- 0.96 95.71
Our Proposed (BangDSA) BERT BiLSTM
2024
(SA_Bangla) 203,463
15 skipBangla- CNN- 0.91 90.24
BERT BiLSTM
To observe the statistical significance of the obtained results for Though the proposed dataset (BangDSA) is one of the largest doc-
both 15 and 3 categories, we have examined the non-parametric statis- ument level sentiment analysis datasets, but it is not a balanced one.
tical test called Friedman test (Liu and Xu, 2022) on the results obtained
3
for the best performed models from ML, EL and DL domains. In case https://www.socscistatistics.com/tests/friedman/default.aspx
23
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
24
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069
time for the design and implementation of this research work and Hassan, M., Shakil, S., Moon, N.N., Islam, M.M., Hossain, R.A., Mariam, A., Nur, F.N.,
also thank to some students from the Department of Computer Science 2022. Sentiment analysis on bangla conversation using machine learning approach.
Int. J. Electr. Comput. Eng. (IJECE) 12, 5562–5572. http://dx.doi.org/10.11591/
and Engineering, North Western University, Bangladesh for being in-
ijece.v12i5.pp5562--5572.
volved in the dataset collection and annotation process of this research Islam, M.S., Alam, K.M., 2023a. An empiric study on bangla sentiment analysis
work. using hybrid feature extraction techniques. In: 14th International Conference on
Computing Communication and Networking Technologies. (ICCCNT), pp. 1–7. http:
References //dx.doi.org/10.1109/ICCCNT56998.2023.10308114.
Islam, M.S., Alam, K.M., 2023b. Sentiment analysis on bangla food reviews using
machine learning and explainable NLP. In: 26th International Conference on
Akther, A., Islam, M.S., Sultana, H., Rahman, A.R., Saha, S., Alam, K.M., Debnath, R.,
Computer and Information Technology. (ICCIT), pp. 1–6. http://dx.doi.org/10.
2022. Compilation, analysis and application of a comprehensive bangla corpus
1109/ICCIT60459.2023.10441309.
kumono. IEEE Access 10, 79999–80014. http://dx.doi.org/10.1109/ACCESS.2022.
Junaid, M.I.H., Hossain, F., Upal, U.S., Tameem, A., Kashim, A., Fahmin, A., 2022.
3195236.
Bangla food review sentimental analysis using machine learning. In: IEEE 12th
Alam, M.H., Rahoman, M.M., Azad, M.A.K., 2017. Sentiment analysis for bangla
Annual Computing and Communication Workshop and Conference. (CCWC), pp.
sentences using convolutional neural network. In: 20th International Conference
0347–0353. http://dx.doi.org/10.1109/CCWC54503.2022.9720761.
of Computer and Information Technology. (ICCIT), pp. 1–6. http://dx.doi.org/10.
Kabir, M., Mahfuz, O.B., Raiyan, S.R., Mahmud, H., Hasan, M.K., 2023. BanglaBook: A
1109/ICCITECHN.2017.8281840.
large-scale bangla dataset for sentiment analysis from book reviews. Comput. Lang.
Alvi, N., Talukder, K.H., Uddin, A.H., 2022. Sentiment analysis of bangla text using
http://dx.doi.org/10.48550/arXiv.2305.06595.
gated recurrent neural network. In: International Conference on Innovative Com-
Liu, J., Xu, Y., 2022. T-friedman test: A new statistical test for multiple comparison
puting and Communications Advances in Intelligent Systems and Computing, vol.
with an adjustable conservativeness measure. Int. J. Comput. Intell. Syst. 15 (29),
1388, pp. 77–86. http://dx.doi.org/10.1007/978--981--16--2597--8_7.
29. http://dx.doi.org/10.1007/s44196--022--00083--8.
Amin, A., Hossain, I., Akther, A., Alam, K.M., 2019. Bengali VADER: A sentiment
Manning, C., Schutze, H., 1999. Foundations of Statistical Natural Language Processing.
analysis approach using modied VADER. In: International Conference on Electrical,
MIT Press, Cambridge, MA, USA.
Computer and Communication Engineering. (ECCE), pp. 1–6. http://dx.doi.org/10.
Mia, M., Das, P., Habib, A., 2024. Verse-based emotion analysis of bengali music from
1109/ECACE.2019.8679144.
lyrics using machine learning and neural network classifiers. Int. J. Comput. Digital
Azmin, S., Dhar, K., 2019. Emotion detection from bangla text corpus using naive bayes
Syst. 15 (1), 359–370. http://dx.doi.org/10.12785/ijcds/150128.
classifier. In: 4th International Conference on Electrical Information and Commu-
Nafisa, N., Maisha, S.J., Masum, A.K.M., 2023. Document level comparative sentiment
nication Technology. (EICT), pp. 1–6. http://dx.doi.org/10.1109/EICT48899.2019.
analysis of bangla news using deep learning-based approach LSTM and machine
9068797.
learning approaches. Appl. Intell. Ind. 4.0 198–211.
Bhattacharjee, A., Hasan, T., Ahmad, W., Mubasshir, K.S., Islam, M.S., Iqbal, A.,
Prottasha, N.J., Sami, A.A., Kowsher, M., Murad, S.A., Bairagi, A.K., Masud, M.,
Rahman, M.S., Shahriyar, R., 2022. BanglaBERT: Language model pretraining
Baz, M., 2022. Transfer learning for sentiment analysis using BERT based supervised
and benchmarks for low-resource language understanding evaluation in bangla.
fine-tuning. Sensors 22, 4157. http://dx.doi.org/10.3390/s22114157.
In: Findings of the Association for Computational Linguistics: NAACL 2022. pp.
Rafat, A.A.A., Salehin, M., Khan, F.R., Hossain, S.A., Abujar, S., 2019. Vector rep-
1318–1327. http://dx.doi.org/10.18653/v1/2022.findings--naacl.98.
resentation of bengali word using various word embedding model. In: 2019 8th
Bhowmik, N.R., Arifuzzaman, M., Mondal, M.R.H., 2022. Sentiment analysis on bangla
International Conference System Modeling and Advancement in Research Trends.
text using extended lexicon dictionary and deep learning algorithms. Array 3,
(SMART), pp. 27–30. http://dx.doi.org/10.1109/SMART46866.2019.9117386.
100123. http://dx.doi.org/10.1016/j.array.2021.100.
Rahman, F., 2019. An annotated bangla sentiment analysis corpus. In: International
Bitto, A.K., Bijoy, M.H.I., Arman, M.S., Mahmud, I., Das, A., Majumder, J., 2023.
Conference on Bangla Speech and Language Processing. (ICBSLP), pp. 1–5. http:
Sentiment analysis from Bangladeshi food delivery startup based on user reviews
//dx.doi.org/10.1109/ICBSLP47725.2019.201474.
using machine learning and deep learning. Bull. Electr. Eng. Inform. 12, 2282–2291.
Rahman, M.A., Dey, E.k., 2018. Datasets for aspect-based sentiment analysis in bangla
http://dx.doi.org/10.11591/eei.v12i4.4135.
and its baseline evaluation. Data 3, 15. http://dx.doi.org/10.3390/data3020015.
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of
Rashid, M.R.A., Hasan, K.F., Hasan, R., Das, A., Sultana, M., Hasan, M., 2024. A
machine learning algorithms. Pattern Recognit. 30 (7), 1145–1159. http://dx.doi.
comprehensive dataset for sentiment and emotion classification from Bangladesh
org/10.1016/S0031--3203(96)00142--2.
e-commerce reviews. Data Brief 53, 110052. http://dx.doi.org/10.1016/j.dib.2024.
Camacho-Collados, J., Pilehvar, M.T., 2018. From word to sense embeddings: a survey
110052.
on vector representations of meaning. J. Artificial Intelligence Res. 63 (1), 743–788.
Shafin, M.A., Hasan, M.M., Alam, M.R., Mithu, M.A., Nur, A.U., Faruk, M.O.,
http://dx.doi.org/10.1613/jair.1.11259.
2020. Product review sentiment analysis by using NLP and machine learning in
Cerqueira, T., Ribeiro, F.M., Pinto, V.H., Lima, J., Gonçalves, G., Glove prototype for
bangla language. In: 23rd International Conference on Computer and Informa-
feature extraction applied to learning by demonstration purposes. Appl. Sci. 12
tion Technology. (ICCIT), pp. 1–5. http://dx.doi.org/10.1109/ICCIT51783.2020.
(21), 2076–3417. http://dx.doi.org/10.3390/app122110752.
9392733.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: Synthetic
Sharmin, S., Chakma, D., 2021. Attention-based convolutional neural network for
minority over-sampling technique. J. Artificial Intelligence Res. 16, 321–357. http:
bangla sentiment analysis. AI Soc. 36, 381–396. http://dx.doi.org/10.1007/
//dx.doi.org/10.1613/jair.953.
s00146--020--01011--0.
Chowdhury, S., Chowdhury, W., 2014. Performing sentiment analysis in bangla
Sumit, S.H., Hossan, M.Z., Al Muntasir, T., Sourov, T., 2018. Exploring word embedding
microblog posts. In: IEEE International Conference on Informatics, Electronics &
for bangla sentiment analysis. In: International Conference on Bangla Speech and
Vision. (ICIEV), pp. 1–6. http://dx.doi.org/10.1109/ICIEV.2014.6850712.
Language Processing. pp. 1–5. http://dx.doi.org/10.1109/ICBSLP.2018.8554443.
Dash, N.S., 2005. Corpus Linguistics and Language Technology: With Reference to
Tabassum, N., Khan, M.I., 2019. Design an empirical framework for sentiment analysis
Indian Languages. Mittal Publications, New Delhi, India.
from bangla text using machine learning. In: International Conference on Electrical,
Habibullah, M., Islam, M.S., Jahura, F.T., Biswas, J., 2023. Bangla document clas-
Computer and Communication Engineering. (ECCE), pp. 1–5. http://dx.doi.org/10.
sification based on machine learning and explainable NLP. In: 6th International
1109/ECACE.2019.8679347.
Conference on Electrical Information and Communication Technology. (EICT), pp.
Tuhin, R.A., Paul, B.K., Nawrine, F., Akter, M., Das, A.K., 2019. An automated system of
1–6. http://dx.doi.org/10.1109/EICT61409.2023.10427766.
sentiment analysis from bangla text using supervised learning techniques. (ICCCS),
Hassan, A., Amin, M.R., Azad, A.K.A., Mohammed, N., 2016. Sentiment analysis on
pp. 360–364. http://dx.doi.org/10.1109/CCOMS.2019.8821658.
bangla and romanized bangla text using deep recurrent models. In: International
Wang, D., Zhang, H., 2010. Inverse-category-frequency based supervised term weighting
Workshop on Computational Intelligence. (IWCI), pp. 51–56. http://dx.doi.org/10.
scheme for text categorization. p. 15, arXiv preprint arXiv:1012.2609.
1109/IWCI.2016.7860338.
25