0% found this document useful (0 votes)
75 views25 pages

1 s2.0 S2949719124000177 Main

Uploaded by

Bidisha Patra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views25 pages

1 s2.0 S2949719124000177 Main

Uploaded by

Bidisha Patra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Natural Language Processing Journal 7 (2024) 100069

Contents lists available at ScienceDirect

Natural Language Processing Journal


journal homepage: www.elsevier.com/locate/nlp

Sentiment analysis of Bangla language using a new comprehensive dataset


BangDSA and the novel feature metric skipBangla-BERT
Md. Shymon Islam ∗, Kazi Masudul Alam
Khulna University, Khulna 9208, Bangladesh

ARTICLE INFO ABSTRACT


Keywords: In this modern technologically advanced world, Sentiment Analysis (SA) is a very important topic in every
Sentiment analysis language due to its various trendy applications. But SA in Bangla language is still in a dearth level. This
Bangla dataset work focuses on examining different hybrid feature extraction techniques and learning algorithms on Bangla
Bangla-BERT
Document level Sentiment Analysis using a new comprehensive dataset (BangDSA) of 203,493 comments
Skipgram
collected from various microblogging sites. The proposed BangDSA dataset approximately follows the Zipf’s
CNN
Bi-LSTM
law, covering 32.84% function words with a vocabulary growth rate of 0.053, tagged both on 15 and 3
categories. In this study, we have implemented 21 different hybrid feature extraction methods including Bag
of Words (BOW), N-gram, TF-IDF, TF-IDF-ICF, Word2Vec, FastText, GloVe, Bangla-BERT etc with CBOW and
Skipgram mechanisms. The proposed novel method (Bangla-BERT+Skipgram), skipBangla-BERT outperforms
all other feature extraction techniques in machine leaning (ML), ensemble learning (EL) and deep learning (DL)
approaches. Among the built models from ML, EL and DL domains the hybrid method CNN-BiLSTM surpasses
the others. The best acquired accuracy for the CNN-BiLSTM model is 90.24% in 15 categories and 95.71% in
3 categories. Friedman test has been performed on the obtained results to observe the statistical significance.
For both real 15 and 3 categories, the results of the statistical test are significant.

1. Introduction predefined categories such as positive, negative or neutral (traditional


approach) classes, or in a broader sense to happy, sad, angry, love,
Sentiment analysis (SA) which is also known as opinion mining is a fun, enthusiasm, boring, disgust, fear, hate, worry, troll, sexual, bully,
process of determining a person’s views on a particular topic (Kabir neutral etc categories is basically termed as sentiment analysis (Bitto
et al., 2023; Islam and Alam, 2023a). SA is the mining study of et al., 2023).
human opinion that analyzes people’s opinions, feelings, evaluations Why sentiment analysis is important? Let us search for the answer
and judgment towards social entities such as services, products, people, of this asking. People believe in human review on a topic more than tra-
events, organizations etc (Kabir et al., 2023; Islam and Alam, 2023b). ditional advertising (Prottasha et al., 2022). Nowadays anybody go for
Sentiments can vary across cultures and languages (Nafisa et al., 2023). purchasing a product or service firstly seek the reviews of the previous
It classifies the polarity of a document as whether the opinion being buyers of that similar product or service. So, public opinion towards a
communicated through blog, review, tweet, news, comment etc. is product or service is an important issue for the buyers. Buyers always
positive, negative, or neutral. Traditional approach that is generally consider the thing in mind that what the common people are saying
implied in SA is to classify the human comment, review, blog, news about that product or service. Nowadays almost every organization
as positive, negative or neutral class. Human opinions, feelings, evalu- maintains their own website from where buyers can buy their necessary
ations and judgments towards social entities cannot be limited to only products or services from online. After the period of COVID-19, this
positive, negative and neutral categories. In a broader sense, positive is the most common form of product or service dealings (Bhowmik
class itself carries several different sentimental forms like happy, love, et al., 2022). Also the organizations need to know the opinion of the
joy, enthusiasm, fun, care, excited, surprise, relief, bliss, satisfaction customers towards their product to stay up to the mark at the market
etc and so on. Similarly the negative category itself carries various place. To stay alive in the market place, organizations always analyze
emotional forms such as sad, angry, boring, disgust, fear, hate, worry, the customer reviews towards their products and the products from
troll, sexual, bully etc. Some comments cannot be defined in any of rival parties. By analyzing customer sentiments companies try to keep
the predefined categories that are referred to as neutral categories. their reputation. So, opinion mining which is also known as sentiment
The task of correctly identifying the polarity of a comment to some analysis is a very important issue.

∗ Corresponding author.
E-mail addresses: [email protected] (Md.S. Islam), [email protected] (K.M. Alam).

https://doi.org/10.1016/j.nlp.2024.100069
Received 30 July 2023; Received in revised form 4 March 2024; Accepted 30 March 2024

2949-7191/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Nowadays internet has become the most valuable part of our life. methods (Shafin et al., 2020). If machine learning systems are trained
In this century, one cannot imagine a day without using the internet, with benchmark instances of different sentiments/emotions, machines
browsing different social media accounts, posting different types of can automatically learn how to detect sentiment without the help of
content on his profile. Everyone is maintaining a huge network on human interaction (Shafin et al., 2020). Supervised machine learning
the internet with which he interacts daily (Sumit et al., 2018). So, (ML) algorithms such as Naive Bayes (NB), Decision Tree (DT), Lo-
millions of posts, blogs, comments, reviews, opinions are gathered on gistic Regression (LR), K-Nearest Neighbors (K-NN), Support Vector
the internet everyday (Habibullah et al., 2023). People express their Machine (SVM) etc. ensemble learning (EL) algorithms such as Random
thoughts, feelings, opinions, evaluations on a particular topic generally Forest (RF), Gradient Boost (GB), XGboost (XGB), LightGBM etc. deep
in the form of text on the Internet in different languages and platforms. learning (DL) algorithms such as Convolutional Neural Network (CNN),
Numerous research studies have been done on sentiment analysis in Recurrent Neural Network (RNN), Long Short Term Memory (LSTM),
English, Chinese, Hindi, Japanese, Arabic and Urdu languages while Bidirectional Long Short Term Memory (BiLSTM) etc are greatly ap-
sentiment analysis in Bengali language is still in a dearth level (Nafisa plicable for sentiment analysis (Nafisa et al., 2023; Prottasha et al.,
et al., 2023; Bitto et al., 2023; Junaid et al., 2022). Few research 2022).
works have been conducted in Bengali language on sentiment analysis In this work, we have proposed a method for sentiment anal-
due to lack of resources, datasets/corpus and complexity of Bengali ysis on Bangla language based on a new comprehensive document
language (Habibullah et al., 2023; Bitto et al., 2023; Amin et al., 2019). level dataset and machine learning and deep learning approaches. The
Bangla (Bengali), an ancient Indo-European language, the seventh most dataset contain a total of 203,463 Bangla comments collected from
spoken language and is used daily by more than 250 million people various microblogging sites. We have examined various hybrid feature
in the world, it is the primary language of Bangladesh and secondary metrics and various ML, EL and DL algorithms. The followings are the
language of India (Habibullah et al., 2023; Bhowmik et al., 2022). Its contributions of the proposed work:
use is becoming more prevalent with the recent growth of online micro-
1. A newly created comprehensive Bangla sentiment corpus of
blogging sites (Azmin and Dhar, 2019). Bangladeshis are increasingly
203,463 comments from 5 microblogging sites (Facebook,
involved in online activities such as connecting with friends and fam-
YouTube, Instagram, TikTok, Likee), manually tagging them into
ily through social media, expressing their opinions and thoughts on
15 categories containing 204,6150 tokens and 165,319 unique
popular micro-blogging and social networking sites, sharing opinions
tokens.
and thoughts through comments on online news portals, online market
2. Validate the dataset by 40 native Bangla speakers with a valida-
places and so on Hassan et al. (2022). This brings about a large
tion accuracy of 94.67%.
amount of user-generated information on various sites, which can be
3. Examining various feature extraction techniques such as BOW,
used for many applications. Sites need to examine the millions of
N-gram, TF-IDF, TF-IDF-ICF, Word2Vec, FastText, GloVe,
messages posted daily, extract all relevant posts for that product or
Bangla-BERT etc and groping them to form hybrid feature met-
service, analyze various types of user feedback, and finally outline user
rics and make a comparative study among them on the created
feedback to gain useful information. This task can be done manually
Bangla sentiment dataset to extract important features and make
by humans but it is very time consuming and tedious. This is why
a comparative study among the techniques applied.
the concept of creating automated systems for sentiment analysis has
4. The proposed novel hybrid feature extraction method (Bangla-
become so important.
BERT+Skipgram), skipBangla-BERT, outperforms all other tech-
Sentiment analysis is also termed as opinion mining, opinion ex-
niques.
traction, sentiment extraction, sentiment mining, subjectivity analy-
5. Applying ML, EL and DL algorithms that generates better per-
sis, emotion analysis, review mining, polarity analysis, emotional AI
formance in different metrics compared to state-of-the-art tech-
etc (Hassan et al., 2022; Prottasha et al., 2022). Sentiment analysis has
niques. The hybrid CNN-BiLSTM model surpasses the existing
a lot of empirical and practical applications such as product analysis,
state-of-the-art methods.
social media monitoring, market analysis (Nafisa et al., 2023), product
review analysis (Bitto et al., 2023), market trend analysis (Bhowmik In this chapter, we briefly describe the basic concepts of sentiment
et al., 2022), customer interest analysis, movie review analysis (Hassan analysis or opinion mining and its present perspective on Bangla nat-
et al., 2022), political review analysis (Tabassum and Khan, 2019) etc. ural language processing in Section 1, related works in Section 2, the
Sentiment analysis is very important for business industries, NGOs, proposed methodology for sentiment analysis on Bangla language in
Governments and other organizations (Hassan et al., 2022). Sentiment Section 3, experimental results analysis and discussion in Section 4 and
analysis can be performed in three different levels. They are document Section 5 concludes the work with some future remarks.
level, sentence level and aspect level (Prottasha et al., 2022). The
document level considers that a document has an opinion on an entity, 2. Related works
and the task is to classify whether an entire document expresses a
positive or negative sentiment (traditional SA). The task at the sen- In this modern technically advanced world, sentiment analysis is a
tence level are with sentences and from a sentence it can be decided topic of great importance in every language. Some of the recent studies
whether the sentence is positive, negative or neutral (traditional SA). of Bangla SA are discussed and summarized here.
The aspect level broadly known as aspect-based sentiment analysis An extensive dataset for sentiment classification from Bangladesh
performs a fine-grained analysis that recognizes aspects of a provided e-commerce reviews (Daraz and Pickaboo) was conducted by Rashid
document or sentence and the sentiment expressed towards each as- et al. (2024), this study proposed a dataset consisting of 78,130 reviews
pect (Rahman and Dey, 2018). For example, cricket is an aspect where including 67,268 positive and 10,862 negative reviews from a wide
sentiment analysis can be performed, all the comments or reviews of range of products. Verse-based emotion analysis of Bangla music from
people are related to that aspect (cricket). Similarly restaurant, election, lyrics was proposed by Mia et al. (2024), the authors developed a
football, world cup, fifa, movie, cinema, drama, viral person can be new dataset comprising of 6500 verses from 2152 Bangla songs and
some examples of different aspects. In each different aspect SA can be manually annotated them into 3 emotion classes namely Love, Sad and
applied. Idealistic, furthermore used several machine learning and neural net-
Sentiment analysis is a well known application of natural language work based classifiers. Among the models BERT outperformed others
processing (NLP) (Bitto et al., 2023). SA is widely implemented using with an accuracy of 65%. A machine learning based method for the sen-
machine learning in different areas (Nafisa et al., 2023). Sentiment of timent analysis of Bangla food reviews was proposed by Islam and Alam
texts can be impact fully analyzed with the help of machine learning (2023b) creating a dataset of 44,491 reviews from different Facebook

2
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

pages and groups. The authors implemented several machine learning 60% and 70% data as test dataset. Using 30% data as test dataset they
algorithms such as the MNB, SVM, KNN, LR and RF; and various deep obtained accuracy 88.81% in SVM, 85.92% in Random Forest, 80.14%
learning algorithms namely CNN, LSTM, GRU, Bi-LSTM, Bi-GRU, CNN- in KNN, 88.09% in Logistic Regression, 83.03% in Decision Tree. In the
LSTM, CNN-BiLSTM, CNN-GRU and CNN-BiGRU. Among these models, test dataset the real reviews were distributed as 62.5% positive reviews
RF and CNN-BiGRU outperformed others from ML and DL domains re- and 37.5% negative reviews while the model predicted 58.3% positive
spectively with an accuracy of 88.73% and 90.96%, furthermore, they reviews and 41.7% negative reviews.
also considered the Friedman statistical test and explainable NLP to An automated system for sentiment analysis was proposed by Tuhin
explain the performance of the best fitted model. BanglaBook which is a et al. (2019) from Bangla text using supervised learning techniques.
new large-scale Bangla dataset collected from book reviews was (Kabir The authors implemented sentence and document level of sentiment
et al., 2023) proposed that consists of 158,065 samples. The authors analysis. They created a sentiment dataset consisting of 7500 sentences
used BOW and N-grams as feature extraction methods, afterwards im- which was tagged manually by them into six basic emotion categories
plemented ML and DL classifiers and obtained highest 0.9331 f1-score namely happy, sad, excited, angry, tender and scared. From the six
with BERT. Another work from Nafisa et al. (2023) compiled a method emotion categories, they mapped the happy, excited and tender cate-
for bipolar SA of online news comments, implemented six ML models gories to the higher emotion level positive category and rest of the three
along with BOW and TF-IDF transformers and a DL approach LSTM emotion categories (sad, angry and scared) to negative class. They split
along with Word2Vec metric. They obtained highest 80% accuracy their built corpus into training dataset and test dataset using holdout
with RF and 83% with LSTM. Another new study was performed (Bitto method and took 7400 sentences for training and only 100 sentences for
et al., 2023) for the user reviews collected from food delivery startups. testing. They implemented two classification algorithms: Naive Bayes
They collected 1400 reviews from 4 food delivery Facebook pages and and Topical approach and compared their proposed work with two
applied bipolar SA. Applying ML and DL algorithms, they obtained other papers on sentiment analysis from Bengali texts. Topical approach
highest accuracy of 89.64% using XGB and 91.07% from LSTM. An acquired highest accuracy above 90% but only for 100 test sentences.
extended lexicon dictionary based method was (Bhowmik et al., 2022) Two new datasets for aspect-based sentiment analysis (ABSA) in
proposed where the authors utilized DL algorithms and deployed in two Bangla language were introduced by Rahman and Dey (2018). Senti-
aspect based datasets collected from Kabir et al. (2023). They obtained ment analysis can be performed in three different levels. They are doc-
highest 84.18% accuracy using a hybrid model BERT-LSTM. At Hassan ument level, sentence level and aspect level (Rahman and Dey, 2018).
et al. (2022), the authors proposed a method for Bangla conversation They created two new datasets (name: Cricket dataset and Restaurant
reviews, they collected 1141 data from Bangla movies and short film dataset) for ABSA. A total of 2900 comments from the Cricket domain
scripts and implemented seven ML algorithms and recorded highest covering 5 aspect categories make up the Cricket dataset and 2600
85.59% accuracy with SVM. Another study carried by Prottasha et al. reviews from Restaurants make up the Restaurant dataset. The authors
(2022) focused on transfer learning strategy of BERT based supervised collected comments from different Facebook pages, BBC Bangla, Daily
fine tuning. They examined 6 different publicly available datasets and Prothom Alo etc. for Cricket dataset. The Cricket dataset was annotated
obtained highest results with the hybrid model CNN-BiLSTM along by six native graduate students from second year, one faculty member,
with BERT based fine tuning. They proved by experiments that BERT one MS student and two employees from Institute of Information
outperform Word2Vec, FastText, GloVe feature extractor techniques. Technology, University of Dhaka, into batting, bowling, team, team
Another DL based study was performed in Alvi et al. (2022) using management and other aspects. They validated the cricket dataset
LSTM, GRU and BLSTM classifiers along with 10-fold cross validation based on zipf’s law and measured an intraclass correlation value of 0.71
and achieved highest 78.41% accuracy score. to validate the annotation method. To build the Restaurant dataset,
A method for Bangla text sentiment analysis using supervised ma- they directly took assistance from the English standard Restaurant
chine learning with extended lexicon dictionary was proposed by dataset. Abstract translations of all comments were done with their
Bhowmik et al. (2022). The authors collected datasets from Rahman proper annotations into Bangla. A total of 2800 comments were con-
and Dey (2018), there were two aspect based datasets, one is the tained in the main English dataset. The Restaurant dataset was based
Cricket dataset with a total of 2059 comments and the other is the on five aspect categories food, price, service, ambiance and miscella-
Restaurant dataset with 2979 comments both covering three sentiment neous. Both the datasets were labeled into three target sentiment labels
categories positive, negative and neutral. They created a weighted cat- positive, negative and neutral. They applied TF-IDF to extract features
egorical lexicon data dictionary (LDD) for extracting sentiments from from texts. They have implemented three supervised machine learning
Bangla texts. There were a total of 1056 and 1115 active sentimental algorithms: SVM, Random Forest and KNN. On the Cricket dataset, they
tokens as well as 970 and 2190 contradictory tokens in Cricket and obtained the highest f1-score of 0.37 in the Random Forest classifier
Restaurant datasets respectively. They also developed the weighted list while the highest f1-score obtained for the Restaurant dataset was 0.42
of dictionaries for Bangla conjunction, adjective and adverb quantifier. from KNN.
They developed a rule based BTSC algorithm of 30 steps that can A method for sentiment analysis on Bangla and Romanized Bangla
classify the polarity of a Bangla document or sentence. The BTSC text was proposed by Hassan et al. (2016) using deep recurrent models.
algorithm basically works on the basis of LDD and POS tagging and The authors considered standard Bangla, Banglish (mixing of Bangla
produces an SCS value (score of sentence) that is either less than 0 words with English words) and Romanized Bangla in this research
(belonging to the negative category), or greater than 0 (belonging to work. They created a dataset with 9337 samples (BRBT dataset), of
the positive category) or equal to 0 (belongs to neutral category). which 6698 samples are Bangla and 2639 samples are Romanized
For the product review sentiment analysis a method was proposed Bangla (RB dataset). They collected data from five different sources:
by Shafin et al. (2020) using NLP and machine learning in Bangla 4621 samples from Facebook, 2610 samples from Twitter, 801 from
language. Online marketing become very popular after the period of YouTube, 1255 from online news portals, and 50 samples from product
COVID-19 and Bikroy, Daraz, Evaly, Chaldal.com are some popular review pages and tagged them into three emotion categories — posi-
e-commerce sites in Bangladesh (Shafin et al., 2020). The authors tive, negative and ambiguous. They utilized two native (Bangla) human
collected 1020 reviews from Bangla e-commerce sites. They prepro- experts to annotate all the test samples for a total of two validations
cessed their data and used TF-IDF to extract important features from one annotator knew nothing about the other’s decisions. To show the
the dataset before go for ML models. Their dataset contained 52.2% validity of the performed tagging procedure, they showed a confusion
positive reviews and 47.8% negative reviews. They implemented five matrix about all the annotated test samples and from this it was proved
supervised ML algorithms, they are — SVM, Random Forest, KNN, that the annotators agreed on 75% of the test samples. They mainly
Logistic Regression and Decision Tree. They examined 30%, 40%, 50%, used Recurrent Neural Network (RNN), if we want to be more specific

3
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 1
Summary of the recent works for the sentiment analysis from Bangla text
Name Year Dataset No. of Dataset Dataset publicly Feature Best model Best model
used classes ownership available? metric (ML) (DL)
2024 6500 3 Self No TF-IDF SGD BERT
Mia et al. (2024) GloVe
Rashid et al. (2024) 2024 78,130 2 Self Yes N/A N/A N/A
Islam and Alam 2023 44,491 3 Self No TF-IDF RF CNN-BiGRU
(2023b)
Kabir et al. (2023) 2023 1,58,065 3 Self No BOW RF Bangla-BERT
N-Gram
Bitto et al. (2023) 2023 1400 2 Self No Word2Vec XGB LSTM
Hassan et al. (2022) 2022 1141 3 Self No N/A N/A Bangla-BERT
Junaid et al. (2022) 2022 1040 2 Self No BOW LR LSTM
N-Gram
TF-IDF
Word2Vec
GloVe
Prottasha et al. (2022) 2022 2900, 3 Collected Yes Word2Vec SVM CNN-BiLSTM
2600 FastText
GloVe
BERT
Bhowmik et al. (2022) 2022 2900, 3 Collected Yes Word2Vec N/A BERT-LSTM
2600
Alvi et al. (2022) 2022 7000 3 Collected Yes Word2Vec N/A GRU

we will say they used LSTM based neural networks which contained study, it is clear that no benchmark dataset is still available for sen-
three layers namely the embedding layer, LSTM layer and a fully timent analysis on Bangla language. So, there is a scope to build one
connected layer containing three nodes. They obtained a maximum or more benchmark datasets for Bangla sentiment analysis.
accuracy of 78% with 2 categories, and 70% with 3 categories on the
Bangla dataset, 55% accuracy on the BRBT dataset with 2 categories, 2.1.2. Fewer number of categories (polarity labels)
and 22% accuracy on 3 categories using categorical cross entropy loss.
Traditional approach that is generally implied in SA is to classify
A method was proposed by Chowdhury and Chowdhury (2014) for
the human review as positive, negative or neutral class. The authors
performing sentiment analysis in Bangla microblog posts. The authors
of Kabir et al. (2023), Bitto et al. (2023), Hassan et al. (2022, 2016),
collected 1300 Bangla tweets using the Twitter API and created their
Shafin et al. (2020), Chowdhury and Chowdhury (2014), Rahman and
dataset. Instead of manual annotation they applied a semi-supervised
Dey (2018), Bhowmik et al. (2022), Prottasha et al. (2022) used this
bootstrapping method (constructing a lexicon dictionary of 737 single
traditional approach. Another work from Tuhin et al. (2019) used six
words) to annotate tweets into positive or negative sentiment cate-
basic emotion categories namely happy, sad, excited, angry, tender and
gories. They split their dataset into training and test sets using the
holdout method, leaving 1000 instances for training and 300 for test- scared. So, there is a scope to consider more polarity labels in Bangla
ing. Two state-of-the-art supervised learning models, namely SVM and SA instead of using only the traditional classes which can reflect more
MaxEnt, were used in this study. Thirteen different features were used actual feelings of human being.
for both classifiers and f-scores were measured for both positive and
negative categories. They achieved highest f-score 0.93 and accuracy 2.1.3. The use of traditional feature extraction methods
93% using SVM for both categories with (unigram+emoticon) feature. For text mining related works, TF-IDF, N-gram and BOW is the most
common traditionally used feature extraction metrics for ML models.
2.1. Observations from literature review The authors of Kabir et al. (2023), Nafisa et al. (2023), Shafin et al.
(2020), Tuhin et al. (2019) used these metrics. A new feature extractor
We have several observations from the literature being studied. The named lexicon data dictionary (LDD) is used by Bhowmik et al. (2022).
summary of the recent works for the sentiment analysis from Bangla The authors of Nafisa et al. (2023), Prottasha et al. (2022) used
text is given in Table 1. The observed findings are summarized in the word embedding model of Word2Vec and BERT respectively. But the
subsequent sections. authors (Chowdhury and Chowdhury, 2014) used 13 different hybrid
feature extractors. Therefore, it is noticed that most of the works mainly
2.1.1. Small dataset size used the traditional feature extraction methods. So, there is a scope
An annotated high quality dataset is the pre-requisite of any NLP
to use different hybrid feature extraction methods for the sentiment
based classification task (Rahman, 2019). The datasets developed by
analysis on Bangla language.
Kabir et al. (2023) contain 158,065 samples, dataset of Bitto et al.
(2023) contain only 1400 reviews from Facebook pages, dataset of Has-
san et al. (2022) consists of 1141 data samples from Bangla movies 2.1.4. Domination of deep learning models over fundamental machine
and short film scripts, dataset of Hassan et al. (2016) have only 9337 learning models
samples (BRBT dataset), dataset of Shafin et al. (2020) contain 1020 Most of the recent methods on Bangla sentiment analysis focuses
reviews from Bangla e-commerce sites, dataset of Tuhin et al. (2019) on both the ML and DL models, but the obtained results of DL models
containing 7500 sentences and dataset of Chowdhury and Chowdhury surpass the traditional ML models. The methods proposed by Kabir
(2014) contains 1300 Bangla tweets. Two aspect based datasets on et al. (2023), Bitto et al. (2023), Bhowmik et al. (2022), Prottasha et al.
Cricket (2900 samples) and Restaurant (2600 samples) were proposed (2022) achieved better results with DL models and so does for others
by Rahman and Dey (2018), and later the authors of Bhowmik et al. also. So, there is a clear domination of deep learning models over the
(2022), Prottasha et al. (2022) used those datasets. From the above fundamental machine learning models.

4
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

new social sites like TikTok and Likee (small video community) so we
have also considered these two new sites. Bangla comments are rare
on Instagram, so we could not collect more comments from Instagram.
We have collected data using our self-developed crawler and named it
as Sentiment Analysis Dataset Crawler (SAD_crawler). The pseudocode
of the developed crawler is shown in Algorithm 1 where X, Y, P and Q
are all dynamic classes; getElementsByClassName(), replace(), trim(),
include() are several used methods and copied_content is the output
variable to store Bangla comments.
Algorithm 1 SAD_Crawler
1: Take a content d to collect Bangla comments
2: Take an empty variable alldata to store comments
3: alldata ← NULL
4: englishWordPattern (EWP) ← [a-zA-Z]
5: comments ← d.getElementsByClassName(‘‘X")[0]
6: c ← comments
7: length ← c.getElementsByClassName(‘‘Y").length
8: for 𝑖 𝑖𝑛 𝑙𝑒𝑛𝑔𝑡ℎ do
9: rc ← getElementsByClassName(‘‘P")[0]
10: rc ← rc.innerText.replace(EWP, ‘‘ ")
11: co ← c.getElementsByClassName(‘‘Y")[i].rc
12: if 𝑐𝑜.𝑡𝑟𝑖𝑚() == 𝑁𝑈 𝐿𝐿 then
13: continue
14: end if
15: alldata ← alldata + comment + ‘‘\t"
16: try
17: r ← d.getElementsByClassName(‘‘Y")
18: s ← r[i]
19: t ← s.getElementsByClassName(‘‘Q")
Fig. 1. Workflow of the proposed Bangla SA system. 20: likes ← t[0].innerText
21: if 𝑙𝑖𝑘𝑒𝑠.𝑖𝑛𝑐𝑙𝑢𝑑𝑒(}}𝑙𝑖𝑘𝑒𝑠ε) then
22: alldata ← alldata + likes + ‘‘\n"
23: else
3. Proposed methodology for sentiment analysis on bangla lan-
24: alldata ← alldata + ‘‘ " + ‘‘\n"
guage
25: end if
26: catch Expected exception
Sentiment analysis (SA) is the mining study of human opinion that
27: alldata ← alldata + ‘‘ " + ‘‘\n"
analyzes people’s opinions, feelings, evaluations and judgment towards
28: end try
social entities such as services, products, people, events, organizations
29: end for
etc (Nafisa et al., 2023). An annotated high quality dataset is the
30: copied_content ← copy(alldata)
pre-requisite of any NLP based classification task (Rahman, 2019). In
31: Output: In copied_content variable, Bangla comment and its visible
this work (SA_Bangla), we have proposed a method for Bangla SA
reaction number (likes) are copied.
using a new novel comprehensive dataset and applying various hybrid
32: Paste the copied content to an excel file to save the data
feature extraction techniques. Our work starts with data collection,
then gradually we adopt several more steps such as data preprocessing,
data visualization, split the dataset into training and test sets, feature 3.2. Data annotation and validation
extraction, building models and evaluate results etc. Fig. 1 illustrates
the workflow of the proposed method. We have annotated the collected 203,463 Bangla comments manu-
ally by five human experts (4 graduate students and 1 MS student) into
15 basic sentiment categories such as happy, sad, angry, enthusiasm,
3.1. Data collection
fun, love, sexual, boring, disgust, surprise, fear, worry, hate, relief and
neutral which takes around six month. Among the 15 categories, the
Currently micro-blogging sites are being used by a large number
base emotion classes are subsequently mapped to 3 higher sentiment
of Bangla speakers (Chowdhury and Chowdhury, 2014), millions of
categories such as happy, enthusiasm, fun, love, surprise and relief
people are commenting and texting in Bangla language using various
belong to positive sentiment category and rest of the others belong to
social media platforms such as Facebook, YouTube, Instagram, TikTok, negative sentiment category and the unidentified or mixed comments
Likee and so on. We have collected Bangla comments by using our belong to neutral category. To test the validity of the annotation
self-developed crawlers from 5 micro-blogging sites, a total of 203,463 process, we have conducted an analysis with 40 native Bangla speakers
Bangla comments were collected and saved in an excel file containing who are graduate students of North Western University, Bangladesh,
5 columns: comment, basic sentiment category, higher sentiment category, each student was given 100 different comments and asked to annotate
reaction number and data source. Six people were involved in data them into 15 predefined sentiment categories. So after running an audit
collection process (4 males and 2 females) and 5 were involved in on a total of 4000 comments (with 800 samples each from Facebook,
data annotation process (3 males and 2 females) and their details YouTube, Instagram, TikTok and Likee) we found that it provides
information are given in Table 2. The overall summary of the domain 94.67% accuracy. The confusion matrix of the annotation process is
based data collection is presented in Table 3. We have collected more shown in Fig. 2 and the complete description of the dataset is given in
data from Facebook and YouTube because Bangla comments are more Table 4, where we have shown category based measurements including
available there. Nowadays Bangladeshis are very interested in using no. of total comments, tokens, unique tokens, top 2 topic keywords etc.

5
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 2
Information of data collectors and annotators.
ID of Gender Profession Role Collection Annotation
participant amount amount
p1 Male MS student/author Data collection 83,652 94,197
and annotation
p2 Male Faculty/author Data collection 10,545 N/A
p3 Male Graduate student Data collection 35,760 35,760
and annotation
p4 Male Graduate student Data collection 40,647 40,647
and annotation
p5 Female Graduate student Data collection 17,350 17,350
and annotation
p6 Female Graduate student Data collection 15,509 15,509
and annotation

Fig. 2. Confusion matrix of annotation process.

Table 3 worry, troll, sexual, bully, neutral etc categories is basically termed
Summary of domain based data collection.
as sentiment detection (Bitto et al., 2023). The problem is to classify
Data No. of Collection
sentiments correctly from a labeled dataset. consider a document in the
source comments period
dataset contains many sentences with a total word count of 𝑁, which is
Facebook 71,429 2022–2023
YouTube 42,884 2022–2023 denoted by 𝐷𝑠𝑒𝑛𝑡 where 𝑃𝑐𝑜𝑚𝑡 is a vocabulary of 𝐾 words. 𝑆𝑐𝑎𝑡𝑒𝑔 of the
Instagram 12,764 2022–2023 𝐷𝑠𝑒𝑛𝑡 in 𝑃𝑐𝑜𝑚𝑡 represents the category of different sentiments, where 𝑟 is
TikTok 52,367 2022–2023 the total sentiment category labels. 𝐸𝑥 is the required output sentiment
Likee 24,019 2022–2023
label for the test instance 𝑥.

𝐷𝑠𝑒𝑛𝑡 = (𝐵1 , 𝐵2 , 𝐵3 , 𝐵4 , 𝐵5 , … , 𝐵𝑁 ) (1)


3.3. Problem definition
𝑃𝑐𝑜𝑚𝑡 = (𝑉1 , 𝑉2 , 𝑉3 , 𝑉4 , 𝑉5 , … , 𝐵𝐾 ) (2)
Now all we have an annotated sentiment dataset, we need to know
what our problem definition actually is or what we actually want to
do with the dataset. The task of correctly identifying the polarity of 𝑆𝑐𝑎𝑡𝑒𝑔 = (𝑙1 , 𝑙2 , 𝑙3 , 𝑙4 , 𝑙5 , … , 𝑙𝑟 ) (3)
a document to some predefined categories such as positive, negative The output is represented as:
or neutral (traditional approach) classes, or in a broader sense to
happy, sad, angry, love, fun, enthusiasm, boring, disgust, fear, hate, 𝐸𝑥 = 𝐹𝑚𝑎𝑥 (𝑉𝑟 , 𝑃𝑟 )𝐷𝑠𝑒𝑛𝑡 (4)

6
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 4
Overview of category-wise data collection.

3.4. Balance dataset 3.5. Data preprocessing

When a dataset’s distribution of examples among its various classes Comments of the microblogging sites contain both Bangla and
is noticeably unbalanced, the term ‘‘data imbalance’’ is used (Chawla English punctuation, hashtag (eg.#), emoticon, slang etc and so on
et al., 2002). In many situations in real life, the issue of imbalanced Chowdhury and Chowdhury (2014). So the raw comments always
data sets might occur. A learning model’s ability to reliably predict contain irrelevant characteristics and noise, it is very important to
actual sentiment category may be hampered by this class imbalance. eliminate them from the dataset (Akther et al., 2022). Noisy raw data
When one class has much more examples than the other, the data is cannot correctly categorize the actual sentiment this is why prepro-
imbalanced, which results in models that are biased and predictions cessing of Bangla comments is so important. In this work, we have
that are incorrect for the minority class. Learning models can produce performed tokenization, removal of punctuation, emoticon, non-Bangla
more precise predictions for all classes, particularly for the minority words, stop words removal and stemming as preprocessing steps.
class, by balancing the data. This enhances decision-making and guards
3.5.1. Tokenization
against biased results. To improve forecasts, address real-world events,
During the tokenization process comments are divided into sen-
and improve decision-making by eliminating prejudice towards the
tences and the sentences are divided into words. The Bangla comment
dominant class, imbalanced data must be balanced (Chawla et al.,
‘‘ ’’, [i like your works very much] after tok-
2002). enizing become ‘‘ ’’(your), ‘‘ ’’(works), ‘‘ ’’(i), ‘‘ ’’(very),
‘‘ ’’(like).
3.4.1. SMOTE
By creating synthetic samples for the minority class, SMOTE (Syn- 3.5.2. Punctuation, non-Bangla words and emoticon removal
thetic Minority Over-sampling Technique) is an oversampling tech- The commonly used punctuation marks in Bangla are ‘‘ ’’, ‘‘?’’,
nique used to correct class imbalance. It seeks to boost the dataset’s ‘‘!’’, ‘‘-’’, ‘‘,’’ etc and so on. Punctuation marks and special characters
representation of the minority class and enhance the effectiveness of and symbols especially ‘‘#’’(hashtags), @, & and braces have been
machine learning models (Chawla et al., 2002). There are many vari- excluded from the dataset. Non-Bangla words especially English words
ants of SMOTE such as the fundamental SMOTE, SMOTENC, SMOTEN, and unnecessary emoticons are also removed from one version of the
ADASYN, BorderlineSMOTE, KMeansSMOTE, SVM-SMOTE etc (Chawla dataset (Shafin et al., 2020). The comment ‘‘what???
et al., 2002). But in this work, we did not implement any specialized ’’, (what??? he also insulted Sheikh Hasina) after the removal
variant of SMOTE rather we used the fundamental SMOTE only. The of punctuation, non-Bangla words and emoticon become ‘‘
description of the fundamental SMOTE is provided below: ’’, (he also insulted Sheikh Hasina).

1. Consider a member of a minority class, 𝑇𝑖 . 3.5.3. Stop words removal


2. From the minority class examples, determine 𝑇𝑖 ’s 𝑃 closest The function words that are used repeatedly in a language but do
neighbors. A parameter set by the user determines the value of not have any domain based meaning i.e. those words are repeated in
𝑃. any domain are mainly referred to as stop words (Rahman and Dey,
3. Randomly select one of the k nearest neighbors, 𝑇𝑗 . 2018). There are many noteworthy stop words available in Bangla such
4. Create a synthetic example, Enew, by using the following equa- as ‘‘ ’’(rather), ‘‘ ’’(but), ‘‘ ’’(or), ‘‘ ’’(if), ‘‘ ’’(you), ‘‘ ’’
tion to interpolate between 𝑇𝑖 and 𝑇𝑗 : 𝑇𝑛𝑒𝑤 = 𝑇𝑖 + (𝑇𝑗 − 𝑇𝑖 ) (therefore), ‘‘ ’’(now) etc. We have excluded all stop words from
𝑏 our developed dataset using the Bangla stop words1 list. The Bangla
comment ‘‘
Here, 𝑏 is a randomly chosen constant in the range [0, 1]. The equation ’’, (don’t understand why the people of Bangladesh watch Indian
calculates the difference between 𝑇𝑖 and 𝑇𝑗 and scales it by 𝑏. Adding serials when there are so many great dramas) become ‘‘
this scaled difference to 𝑇𝑖 creates a new example, 𝑇𝑛𝑒𝑤 , which lies ’’, (don’t understand the people of
on the line segment between 𝑇𝑖 and 𝑇𝑗 . The minority class space is Bangladesh Indian serials many great dramas) after removing the stop
effectively expanded by SMOTE’s generation of a collection of synthetic words. The stop words ‘‘ ’’ (so), ‘‘ ’’(when there are), ‘‘ ’’ (why),
instances by repeating this process for various minority class examples. ‘‘ ’’(watch) have been removed from the comment.
To generate a balanced representation of the classes, these artificial
1
instances are subsequently included in the initial dataset. https://www.ranks.nl/stopwords/bengali

7
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 5
Top 10 frequent n-grams without removing stop words of the dataset.

3.5.4. Stemming dataset, providing insights into the diversity and vocabulary richness
Stemming is the process of reducing a word to its base or root of the dataset. Without removing the stop words ‘‘ ’’(no) is the first
form Chowdhury and Chowdhury (2014). Stemming algorithms aim ranked word in the dataset having an occurrence frequency of 31,973
to remove suffixes from words so that they can be matched with and after removing the stop words ‘‘ ’’(good) is the first ranked word
other words that have the same root. For example, the words ‘‘ in the dataset having an occurrence frequency of 13,461.
’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’
all have the same root word ‘‘ ’’. By stemming these words, we can 3.6.2. Bigram
match them with other words that have the same root word, such as A bigram is a pair of next-to-one words from a specific passage
‘‘ ’’ or ‘‘ ’’. The Bangla comment ‘‘ of text. For 𝑛 = 2, a bigram is an n-gram. In many applications
’’, (yesterday i was saddened to see the picture of the such as computational linguistics and NLP based works, the frequency
motherless child on facebook) become ‘‘ distribution of every bigram in a text is frequently employed for a quick
’’ after stemming is performed. We have implemented stemming statistical analysis of text (Akther et al., 2022). At this level, we have
on our dataset as a final pre-processing step. extracted bigrams from non-stemmed words. The number of bigrams
utilizing four threshold frequencies are displayed in Table 7. The total
3.6. Statistical dataset visualization and analysis number of times two consecutive words appeared together in a dataset
is referred to as the frequency or count of that bigram. We have utilized
Statistical analysis is a very important aspect almost in every subject 4 threshold frequencies: 20, 50, 100, and 200 to observe the behavior of
to get usual observations from experiments. In this section, we have bigrams in the dataset. The term threshold frequency describes the upper
analyzed various statistical language phenomena on the dataset such boundary of whether to accept or reject a bigram from all bigrams. A
as the usage of n-grams, average word length, character level analysis, bigram frequency (count) must be at least 20 to be accepted; otherwise,
frequency of characters, Zipf’s law and type token information respec- it will be rejected. This is known as the threshold frequency of 20.
tively. From these analysis, we can have a good understanding about The proposed dataset contains 8463 unique bigrams out of a total of
the dataset and its linguistics usefulness. 944,682 bigrams using threshold frequency 20. When we have increased
the threshold frequency to 200, then the total no. of bigrams and no.
3.6.1. Unigram unique bigrams decreased to 131,758 and 321 respectively. The most
An n-gram, also known as a unigram in the domains of probability frequent bigram of the proposed dataset is ‘‘ ’’(very nice) which
and computational linguistics, is made up of just one element from a appears 4083 times together in the dataset.
particular sample of text or speech (Akther et al., 2022). At this level,
non-stemmed unigrams for the entire dataset have been extracted from 3.6.3. Trigram
the preprocessed data. Table 5 represents a list of the top 10 unigrams, A trigram is a grouping of three adjacent words or phrases from
bigrams and trigrams in the dataset. Any language’s function words a text or speech sample. For 𝑛 = 3, a trigram is an n-gram (Akther
or stop words are always its most common words. They are crucial et al., 2022). Like bigrams, the frequency distribution of trigrams can
for determining a dataset’s quality. We found that 32.84% of the total be useful for a straightforward statistical analysis of text. The same
tokens in our dataset are function words. We have taken out the stop threshold frequencies have also been utilized to extract trigrams from
words from one version of the dataset in order to get the content words. the dataset. Table 8 displays the effect of threshold frequencies on the
Table 6 lists the top 10 frequently occurring unigrams, bigrams and trigrams. The proposed dataset contains 1875 unique trigrams out of
trigrams after the stop words have been eliminated. Additionally, we a total of 85,208 trigrams using threshold frequency 20. Interestingly,
have recorded the total number of unique unigrams present in the when the threshold frequency is increased to 200, the total number of

8
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 6
Top 10 frequent n-grams after removing stop words of the dataset.

Table 7
Threshold frequency wise Bigram.
Threshold No. of all No. of all
frequency bigrams unique bigrams
20 494,682 8463
50 317,844 2442
100 214,827 926
200 131,758 321

Table 8
Threshold frequency wise Trigram.
Threshold No. of all No. of all
frequency trigrams unique trigrams
20 85,208 1875
50 39,936 358
100 23,355 106 Fig. 3. Usage of words Vs. word length of the dataset.
200 13,425 36

trigrams is 13,425 and no. of unique trigrams is 36 only. The top 2


frequent trigrams of the proposed dataset are ‘‘ ’’(plz plz plz)
and ‘‘ ’’(feel very good) which appear 1433 and 602 times
together in the dataset.

3.6.4. Average word length


The dataset contains a total of 116,60,683 characters (excluding
punctuation and spaces), with an average of 5.69 letters per word.
An average of 4.5 letters are used in each word in everyday English
text (Dash, 2005). Bangla words are longer than English words because
they begin with 11 vowels, 39 consonants and 20 allographs whereas
English words starts with only 5 vowels and 21 consonants (Akther
et al., 2022). Fig. 3 plots the percentage of words used in our dataset
versus word length. The top 2 occurrence frequencies are recorded for
4 and 5 letters per word of 17.25% and 16.69% respectively. We found
that 32.84% of the total tokens in our dataset are function words. Fig. 4 Fig. 4. Usage of unique words Vs. word length of the dataset.
is affected due to the frequent use of function words. The average letters
per word for the unique words is increased to 7.84 (see Fig. 4) where
we have recorded the top 2 occurrence frequencies for 7 and 8 letters 1 to 7) words where we have shown the top 10 n-letter words in our
per word of 14.67% and 14.89% respectively. Table 9 describes the
relationship between the word length and frequency of n-letter (n = dataset.

9
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 9
Top 10 frequent N-letter (n = 1 to 7) words of the dataset.

Table 10
Percentage of occurrence of each letter of the dataset.

3.6.5. Character level analysis according to human nature or not. According to Zipf’s law, there is a
Instead of considering words as the basic unit of analysis, character- correlation between a word’s frequency (𝑑) and its rank (𝑝) in the list
level analysis focuses on the individual characters that make up the (if all the words in a big dataset are listed in order of their frequency
text. By examining individual characters and their context, different of occurrence) (Manning and Schutze, 1999). According to Zipf’s law:
errors or inconsistencies can be identified, and appropriate corrections
1
can be suggested (Akther et al., 2022). 𝑑∝
𝑝
Based on our dataset, we have determined the percentage of times
each Bangla character occurs. The top 5 frequently used characters, For example, this means that the 50th most frequent word should
according to our dataset, are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ and the least appear with twice the frequency of the 100th most frequent word and
frequently used characters are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ respectively. so on. Fig. 5 depicts the Zipf’s curve of the dataset where 𝑥 and 𝑦 axis
Table Table 10 demonstrates the percentage of times each Bangla represents logarithmic rank and frequency of words respectively. The
character is used in our dataset. A statistical study of the dataset at curve is roughly linear (see Fig. 5), so it is proved that our dataset is
the character level reveals that 5.36% of the characters are vowels, holding the Zipf’s law approximately.
45.01% are consonants and 48.79% are allographs. The top 2 most
frequently used characters are both allographs, they are ‘‘ ’’(Aa-kar) 3.6.7. Hapax legomena and vocabulary growth
and ‘‘ ’’(e-kar) covering 10.05% and 6.90% of characters of the dataset.
Table 12 shows the type token information of the dataset. The total
Another character level statistical study on the vocabulary (unique
number of word types are 165,319 and the number of word types
words of the dataset) have been conducted to observe how it behaves.
occurring once are 108,599. About half of the words in the dataset
The percentage of occurrence of initial letter of unique words of the
are referred to as hapax legomena, which appears only once (Akther
dataset are listed in Table 11. We have noticed that the likelihood of
et al., 2022). In our dataset, we have more than half words belonging
the letters ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’ occurring as the first alphabet is
to hapax legomena. The vocabulary growth rate is measured as:
higher than it is for the other letters. The words that starts with the
alphabet ‘‘ ’’ covers 11.47% words of the dataset. W(1)
𝐺= (5)
N
3.6.6. Zipf’s law Here the parameters 𝐺, 𝑊 (1) and 𝑁 represents the vocabulary growth
It is impossible to expect human inspection to guarantee the quality rate, the number of word types occurring once and the total number
of a dataset with millions of words. So, Zipf’s distribution of the dataset of words in the dataset respectively. For our dataset, the vocabulary
is examined to check whether it is reflecting the vocabulary usage growth rate is (108,599/204,6150) = 0.053 which is reasonable.

10
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 11
Percentage of occurrence of initial letter of unique words of the dataset.

Table 13
Feature vectors for BOW.
[3, 1, 1, 2, 0, 0, 0, 0, 0]
[3, 0, 0, 0, 1, 1, 0, 0, 0]
[3, 0, 0, 2, 0, 0, 1, 1, 1]

as document classification, sentiment analysis and many other classifi-


cation tasks (Tabassum and Khan, 2019). There are several well-known
metrics for extracting features from texts such as BOW, TF-IDF, Word
Embeddings and so on Prottasha et al. (2022).

3.7.1. Bag of words(BOW)


BOW is frequently used in NLP that works on the basis of term
frequencies (Kabir et al., 2023). This is the one of the simplest way
of extracting features from texts. Considering a Bangla sample dataset
of three comments as ‘‘ ’’, ‘‘ ’’, ‘‘
’’, (i do not feel good. i am upset. do not mess with me),
counting BOW and get the features as ‘‘ ’’(my): 3, ‘‘ ’’(good): 1,
Fig. 5. Zipf’s curve of the dataset.
‘‘ ’’(feel): 1, ‘‘ ’’(no): 2, ‘‘ ’’(am): 1, ‘‘ ’’ (upset): 1, ‘‘ ’’(with):
1, ‘‘ ’’(mess): 1, ‘‘ ’’(do): 1, for simplicity we do not consider the
Table 12
Type token information of the dataset. steps of the preprocessing. Thus the dataset will produce the feature
Words frequency count No. of vectors for the three comments as in Table 13.
words
Total number of word types: 165,319 3.7.2. TF-IDF
Word-types occurring once: 108,599
Word-types occurring Twice: 41,186
It is the most widely used feature extraction metric for NLP based
Word-types occurring 3–50 Times: 306,382 classification tasks (Rahman and Dey, 2018) which is calculated accord-
Word-types occurring 51–100 Times: 123,121 ing to Eq. (8). The number of times a word occurs in a document is
Word-types occurring 100–1000 Times: 546,685 counted as Term Frequency: 𝑇 𝐹 (Eq. (6)) and how important a word
Word-types occurring 1000–5000 Times: 508,160
Word-types occurring 5000–10000 Times: 205,424
is in a document is measured by the Inverse Document Frequency:
Word-types occurring More than 10000 Times: 206,593 𝐼𝐷𝐹 (Eq. (7)) (Hassan et al., 2022). In order to extract features from
Bangla comments, we employed the well-known feature extraction
metric 𝑇 𝐹 𝐼𝐷𝐹 (Term-Frequency - Inverse Document Frequency).

3.7. Feature extraction T


𝑇𝐹 = (6)
K

Extracting relevant features from textual data is an important aspect D


𝐼𝐷𝐹 = log𝑒 (7)
in NLP (Rahman and Dey, 2018). Models based on ML cannot directly L
deal with data in textual form, so there should be an intermediate
𝑇 𝐹 𝐼𝐷𝐹 = 𝑇 𝐹 − 𝐼𝐷𝐹 (8)
process that will bridge the connection between the raw Bangla texts
data and the ML model by transforming the raw textual data into some Eq. (6) stands for 𝑇 𝐹 where 𝑇 is the frequency of a word in
kind of strategic numerical form i.e. by extracting relevant features a comment and 𝐾 demonstrates the total number of words in that
from texts broadly known as feature extraction (Islam and Alam, 2023a; comment, while Eq. (7) is for 𝐼𝐷𝐹 where 𝐷 denote the total number
Sharmin and Chakma, 2021). These features can be used further to of comments for SA and 𝐿 is the number of comments that contain the
train ML models in order to perform various classification tasks such concerned word. Therefore 𝑇 𝐹 𝐼𝐷𝐹 is determined using Eq. (8).

11
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 6. Word embedding using CBOW. Fig. 7. Word embedding using Skipgram.

3.7.3. TF-IDF-ICF another pre-trained model developed by Google AI works on the basis
The term 𝐼𝐶𝐹 stands for Inverse Class Frequency introduced by of transformers and attention mechanism (Bhattacharjee et al., 2022).
Wang and Zhang (2010) which is calculated according to Eq. (9): We have utilized different hybrid feature extraction techniques
in this work. Fig. 8 briefly describes the process of feature extrac-
P
𝐼𝐶𝐹 = log𝑒 1 + (9) tion using the hybrid method skipBangla-BERT method. Bangla-BERT
Q
(Bangla Bidirectional Encoder Representations from Transformers) has
𝑇 𝐹 × 𝐼𝐷𝐹 × 𝐼𝐶𝐹 = 𝑇 𝐹 − 𝐼𝐷𝐹 − 𝐼𝐶𝐹 (10) two types of encoder representation, one is Bangla-BERT base (12
encoders) and the other is Bangla-BERT large (24 encoders). We have
Here 𝑃 denotes the number of total categories and 𝑄 is the number of utilized the pre-trained Bangla-BERT base model2 along with the Skip-
categories that contain the concerned word. Therefore 𝑇 𝐹 ×𝐼𝐷𝐹 ×𝐼𝐶𝐹 gram shallow neural network model together. The pre-trained Bangla-
is measured by Eq. (10). BERT base consists of 12 layers, 768 hidden layers, 12 self attention
heads and 110 million total number of parameters (Bhattacharjee et al.,
3.7.4. Word embeddings 2022) whereas Skipgram consists of a shallow neural network of 1
Word embedding is a vector based feature extraction metric used input layer, 1 projection or hidden layer and 1 output layer (Rafat
in NLP where each word is converted into a fixed sized vector of et al., 2019). The pre-trained Bangla-BERT base model works on the
real numbers (Sumit et al., 2018). Words are represented in a high basis of two mechanisms namely masked language modeling (MLM)
dimensional space using word embedding where the proximity of sim- and next sentence prediction (NSP) (Bhattacharjee et al., 2022). The
ilar words is very high, in fact the similar words together form a first token of every sequence is represented by a special token ⟨CLS⟩
cluster of words. It can be implemented using Word2Vec, FastTest or and for separating each sentences another special token ⟨SEP⟩ is used.
GloVe methods based on the mechanism CBOW or skipgram (Sumit In Fig. 8, we denote the input masked vectors, input embeddings and
et al., 2018). Table 14 describes the generation process of the training embedding layers as 𝐵𝑊𝑖 , 𝐵𝑡𝑖 and 𝐸𝑖 respectively. Let consider two
samples for word embeddings. The CBOW model learns through context consecutive dummy sentences ‘‘ ’’, ‘‘ ’’(very nice.
words (Camacho-Collados and Pilehvar, 2018) and tries to predict this is very serious) will become ⟨CLS⟩ ⟨‘‘ ’’⟩ ⟨‘‘ ’’⟩ ⟨SEP⟩ ⟨‘‘ ’’⟩
the target word (Fig. 6) whereas skipgram model tries to predict its ⟨‘‘ ’’⟩ ⟨‘‘ ’’⟩. The masked input representation of the sentences
neighbors (context words) using the current word (Fig. 7). The word will pass through the 12 encoders of the pre-trained model and built by
embedding models are based on shallow neural networks of input aggregating the corresponding token embeddings, segment embeddings
layer, projection or hidden layer and an output layer with a softmax and position embeddings respectively (Bhattacharjee et al., 2022). Thus
activation. Here D is the number of words in the vocabulary and P is the encoders will generate a global padded embeddings for all the
the size of the embedding vector (see Figs. 6 and 7). tokens. The next sequential layer is the concatenation layer having
So far, we have discussed the pros and cons of the Word2Vec model a feed forward neural network that will generate the final version
that deals with words, a common problem with this approach is the of embedded vectors from the end of Bangla-BERT base model. The
out-of-vocabulary (OOV) words that it cannot handle. So, an extension output of the concatenated layer will be the input of the next sequential
of the Word2Vec model is introduced further to solve this issue and Skipgram layer which will further projected and output a 1*768 vectors
facilitating the advantage of morphological analysis on character level of final embeddings for each comment in the dataset.
n-grams, it is known as FastText model which can also be implemented
using CBOW or Skipgram (Rafat et al., 2019). 3.8. Bangla sentiment analysis model
The mechanism is same with a slight change in the input layer
instead of pure words it uses character n-grams to fit the neural net- For performing sentiment analysis on Bangla, we have implemented
work. Table 15 describes the generation process of character n-grams Bernoulli Naive Bayes (BNB), Decision Tree (DT), Logistic Regression
for FastText model. For example the word ‘‘ ’’(love) after break- (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) as
ing into Bangla characters (vowels, consonants and their short forms ML models, Random Forest (RF), XGboost (XGB), Gradient Boost (GB)
i.e. kar and fola [allographs]) will become ‘‘ + + + + as EL models and Recurrent Neural Network (RNN), Long Short Term
+ + + ’’ considering the length of the character n-gram as Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM),
3, the produced n-grams are ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, ‘‘ ’’, Convolutional Neural Network (CNN) as DL models and CNN-BiLSTM
‘‘ ’’ and ‘‘ ’’. Instead of relying on context words like Word2Vec as a hybrid DL model. Except the hybrid model, all the other algorithms
and FastText, another word embedding model called GloVe (Global are base algorithms. In this section we will describe the hybrid CNN-
Vector) works based on global dataset statistics and create fixed sized BiLSTM model. The architecture of the CNN-BiLSTM model is depicted
vectors using a co-occurrence matrix (Cerqueira et al.). Bangla-BERT
2
(Bangla Bidirectional Encoder Representations from Transformers) is https://huggingface.co/sagorsarker/bangla-bert-base

12
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 8. Feature extraction using skipBangla-BERT mechanism.

Table 14
Training samples generation process for word embeddings using window size 5.

Table 15
Character n-gram generation process of FastText model.

Fig. 9. Architecture of the CNN-BiLSTM model.

in Fig. 9. It consists of six consecutive layers such as the input layer, size dense vector of 100 (𝑚𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ). 𝑀𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ is the size of
convolutional layer, max pooling layer, bidirectional LSTM layer, fully the word vectors. The 𝑖𝑛𝑝𝑢𝑡 𝑙𝑒𝑛𝑔𝑡ℎ parameter specifies the length of
connected dense layer and the output layer. the input sequences (number of words). The second layer is a one-
The first layer is an embedding layer (input layer), it is generally dimensional convolutional layer with 128 filters and a 𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒
used to convert the integer encoded word indices to dense vectors. of 5. The activation function used is 𝑅𝑒𝐿𝑈 (Rectified Linear Unit),
It takes the input sequence and converts each word index to a fixed which introduces non-linearity to the model. The third layer is the max

13
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

pooling layer with a 𝑝𝑜𝑜𝑙 𝑠𝑖𝑧𝑒 of 4. It is used to reduce the spatial Table 16
Split the dataset into training and test sets based on 15 categories.
dimensions of the data and capture the most important features from
the convolutional layer. The fourth layer contains a Bidirectional LSTM Category Total samples Training set Test set

(Long Short-Term Memory) with 64 units (32 forward memory units Love 56,631 45,305 11,326
Enthusiasm 37,965 30,372 7593
and 32 backward memory units). Bidirectional LSTMs process the input
Happy 26,596 21,277 5319
sequence in both forward and backward directions, allowing the model Fun 24,351 19,481 4870
to capture contextual information from both sides of the sequence. The Surprise 3501 2801 700
𝑑𝑟𝑜𝑝𝑜𝑢𝑡 value of 0.2 and 𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑑𝑟𝑜𝑝𝑜𝑢𝑡 of 0.2 are used to apply Relief 765 612 153
dropout regularization to the LSTM layer to prevent overfitting. The Angry 27,054 21,644 5410
Sad 6101 4881 1220
fifth layer is the fully-connected layer where each neuron is connected Sexual 5854 4684 1170
to every neuron in the previous layer, and each connection has its Disgust 3888 3111 777
own weight. Thus, it is very expensive in terms of memory (weights) Boring 2494 1996 498
and computation (connections). This layer flattens the input feature Worry 1810 1448 362
Fear 1442 1154 288
representations into a feature vector and performs the function of high-
Hate 1083 867 216
level reasoning. This layer functions by 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 activation. The output Neutral 3928 3143 785
layer is responsible for providing the prediction for test instance to Total 203,463 162,776 40,687
either any sentiment category from 15 predefined categories. So, in
the output layer, there are 15 channels available where each channel
Table 17
corresponds to a predefined sentiment category. Section 4 describes Split the dataset into training and test sets based on 3 categories.
more about the numerical details of different layers of CNN-biLSTM Category Total samples Training set Test set
model in brief.
Positive 149,809 119,848 29,961
Negative 49,726 39,745 9981
4. Experimental results analysis and discussion Neutral 3928 3143 785
Total 203,463 162,776 40,687

As datasets and resources are the preliminary obstacle for dealing


with Bangla NLP based works, so dataset is the prime concern when we
are dealing with sentiment analysis on Bangla language (Kabir et al., on Bangla, we have implemented Bernoulli Naive Bayes (BNB), Deci-
2023; Nafisa et al., 2023). So, the principle target of this work is to sion Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN),
develop a new comprehensive sentiment dataset for Bangla language. Support Vector Machine (SVM) as ML models, Random Forest (RF),
According to our goal, we have developed a large Bangla sentiment XGboost (XGB), Gradient Boost (GB) as EL models and Recurrent Neural
dataset of 203,463 comments. We have examined the proposed work by Network (RNN), Long Short Term Memory (LSTM), Bidirectional LSTM
using Python language (Integrated development environment: Jupyter (Bi-LSTM), Convolutional Neural Network (CNN) as DL models and
Notebook, Google Colab) with device specifications: Processor — In- CNN-BiLSTM as a hybrid DL model. In this section, we will describe the
tel(R) Core(TM) i5-7200U CPU @2.50 GHz 2.71 GHz, RAM - 8.00 GB parameter tuning process for the best fit model, experimental results of
(7.90 GB usable), System type — 64-bit operating system, x64-based different algorithms, their comparisons, different findings etc.
processor, OS — Windows 10 Pro edition. To observe its baseline
evaluation we have implemented different machine learning (ML), 4.1. Split the dataset
ensemble learning (EL) and deep learning (DL) algorithms. To evaluate
the performance of the proposed Bangla sentiment analysis models, The proposed dataset contain a total of 203,463 comments from
we have used several evaluation metrics such as Accuracy, Precision, social media. We split our dataset into 80% for training dataset and
Recall, and F1-score. They are formulated as follows: 20% for test dataset. The overview of the distribution of training and
𝑇𝑃 + 𝑇𝑁 test sets based on 15 and 3 categories are shown in Tables 16 and
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (11) 17 respectively. We have to use the training datasets to train different
𝑃 +𝑁
learning models so that they can predict unknown instances according
𝑇𝑃 to their training, and to know about how well the models are being
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (12)
𝑇𝑃 + 𝐹𝑃 trained, we have to use the test dataset to measure its performance
𝑇𝑃 through different metrics such as precision, recall, f1-score, accuracy
𝑅𝑒𝑐𝑎𝑙𝑙 = (13)
𝑇𝑃 + 𝐹𝑁 and so on.
2 − 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 − 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 × 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (14) 4.2. CNN-BiLSTM model
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Where,
The architecture of the CNN-BiLSTM model is depicted in Fig. 9.
TP: correct positive prediction, In this section we will briefly describe about the parameters setting,
FP: incorrect positive prediction, experimental performance of different layers and final outcomes.
TN: correct negative prediction,
FN: incorrect negative prediction, 4.2.1. Setting parameters
P: TP+FP, The input layer contain the word embedding vectors of a fixed size.
N: TN+FN. So, a question can be arisen that what will the 𝑚𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ of each
Classification accuracy refers to the ratio of correct predictions embedding vector? We have tuned this parameter (𝑚𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ). The
to total number of predictions made by the built model (Eq. (11)). impact of max length parameter on CNN-BiLSTM model is given in
Precision is the ratio of true positives to the true positives and false Table 18. For max length tuning, we have taken the values of max
positives prediction (Eq. (12)). Recall is defined as the ratio of true length as 32, 64, 100, 128 and 256. The experimental results show that
positives to the true positives and false negatives (Eq. (13)). F1-score the highest training accuracy (0.95) and validation accuracy (0.93) as
or F-measure is the balance measure to express the performance in well as the lowest training loss (0.35) and validation loss (0.36) are
a single quantity. It is the harmonic mean (Eq. (14)) of precision found when max length is 100. Therefore, we set 100 as the value for
and recall (Prottasha et al., 2022). For performing sentiment analysis 𝑚𝑎𝑥 𝑙𝑒𝑛𝑔𝑡ℎ.

14
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 18 Table 22
The impact of max length on CNN-BiLSTM model. Parameters of CNN-BiLSTM model with their optimal measurements.
Max Training Validation Training Validation Parameter Measurement/Value
length accuracy accuracy loss loss
Max length 100
32 0.87 0.87 0.47 0.43 Conv1D (filters) 128
64 0.91 0.89 0.41 0.39 Conv1D (kernel size) 5
100 0.95 0.93 0.35 0.36 MaxPooling1D (pool size) 4
128 0.90 0.90 0.40 0.44 Bi-LSTM memory units 64
256 0.88 0.91 0.44 0.46 Batch size 32
Learning rate 2.5e−3
Dropout 0.2
Table 19 Activation function ReLU
The impact of learning rate on CNN-BiLSTM model. Predictive function Softmax
Learning Training Validation Training Validation Loss function Sparse categorical
rate accuracy accuracy loss loss cross-entropy
Optimizer Adam
1e−2 0.85 0.85 0.58 0.51 Metrics Accuracy
1.5e−2 0.88 0.84 0.51 0.54 Epochs 500
2e−3 0.92 0.89 0.47 0.45
2.5e−3 0.96 0.94 0.40 0.39
3e−2 0.89 0.90 0.48 0.43
3.5e−3 0.90 0.88 0.46 0.42
Table 22 illustrates the parameters of CNN-BiLSTM model with
their optimal measurements. The one-dimensional convolutional layer
Table 20 contains 128 filters and a kernel size of 5, the pool size of the one
The impact of batch size on CNN-BiLSTM model. dimensional max pooling layer is 4, the total memory units in the
Batch Training Validation Training Validation bidirectional LSTM layer is 64 having a dropout of 0.2 to prevent
size accuracy accuracy loss loss
overfitting. The activation and predictive functions are ReLU (Recti-
5 0.91 0.90 0.62 0.65 fied Linear Unit) and Softmax respectively. Sparse categorical cross-
6 0.92 0.89 0.65 0.63
entropy is the loss function, adam is the optimizer and accuracy is the
7 0.92 0.90 0.60 0.55
8 0.93 0.92 0.57 0.58 performance metrics for CNN-BiLSTM model.
30 0.95 0.93 0.57 0.54
32 0.96 0.94 0.53 0.54 4.2.2. Analysis of different layers in CNN-BiLSTM model
Fig. 9 depicts the architecture of the proposed CNN-BiLSTM model.
Table 21 The first layer is an embedding layer (input layer), it is generally used
The impact of epochs on CNN-BiLSTM model. to convert the integer encoded word indices to dense vectors. It takes
Epochs Training Validation Training Validation the input sequence and converts each word index to a fixed size dense
accuracy accuracy loss loss vector of 100 (max length). The second layer is the convolutional layer
200 0.84 0.81 0.71 0.68 with 128 filters and a kernel size of 5. The activation function used
300 0.87 0.82 0.69 0.61 is ReLU (Rectified Linear Unit), which introduces non-linearity to the
400 0.92 0.85 0.58 0.60
model. The kernel weights of the initial 16 filters of the convolutional
500 0.95 0.93 0.42 0.45
600 0.95 0.93 0.46 0.45 layer are shown in Fig. 10. The change in the weights of the kernel in
700 0.95 0.92 0.49 0.47 the first 16 consecutive filters are random. The weights in the output
channel 1 changes drastically, the weight value starts from 0.038 and
gradually decreases to negative values and then again increases to
0.084. Again in the output channel 2 of the convolutional layer, the
In a similar fashion the experimental results show that the 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 initial weight value is −0.02, further it decreases more up to weight
𝑟𝑎𝑡𝑒 of 2.5e−3 is the best fitted value for CNN-BiLSTM model. The value −0.043, then again the value increased to 0.046. The final
impact of 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 is given in Table 19. It produces a training weight values are all positive on the initial 7 output channels of the
accuracy of 0.96 and validation accuracy of 0.94 which outperform the convolutional layer while for 8th, 9th, 11th and 16th output channels
others. are all negative weights. Fig. 11 shows the kernel weights adaptation
We have also tuned the 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 parameter for the proposed CNN- schemes of final 16 output channels of the convolutional layer. The
BiLSTM model. The total number of training samples those are used 113th output channel of the convolutional layer starts with a positive
in one epoch is referred to as 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 (Alam et al., 2017). Table 20 weight 0.043 which finally adapts to −0.008. The 128th output channel
briefly describes the impact of batch size on CNN-BiLSTM model. For starts with a positive weight 0.058 and then drastically fluctuates to
the sake of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 tuning, we have taken the values of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 as −0.06, then again rapidly increases to 0.039, then again decreases to
5, 6, 7, 8, 30 and 32. For the small values of 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒𝑠 (5, 6, 7 and 0.031 and finally stables to weight value 0.069. The third layer is the
8), the training and validation accuracy’s are slightly lower while for max pooling layer with a pool size of 4. It is used to reduce the spatial
the large values (30 and 32) the training and validation accuracy’s are dimensions of the data and capture the most important features from
slightly better. We have recorded the best performance with 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 the convolutional layer.
32 and set it as our optimal value for 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒. The activation status of the initial and final 16 filters of the max
Epoch is another hyper-parameter that needs to be tuned. Epoch pooling layer in CNN-BiLSTM model are shown in Figs. 12 and 13
can also be termed as iteration or cycle. The impact of epochs on CNN- respectively. Activation values for some of the channels are zero (chan-
BiLSTM model is shown in Table 21. For the tuning purpose of no. of nels 1, 4, 5, 7, 10, 11, 13, 14, 16, 113, 118, 119, 121, 122, 124, 127
epochs, we have considered the values as 200, 300, 400, 500, 600 and and 128) while 0.0289, 0.001925 and 0.01878 are the activation values
700. For small values of epochs (200, 300 and 400) the training and for channels 2, 15 and 126 respectively. So, the spatial dimensions are
validation accuracy’s are slightly lower while for the large values (500, being reduced to a great extent compared to the convolutional layer.
600 and 700) the training and validation accuracy’s are slightly better The fourth layer contains a Bidirectional LSTM (Long Short-Term Mem-
but same. The accuracy did not increase after 500 epochs. So, we have ory) with 64 units (32 forward memory units and 32 backward memory
set the epochs as 500 as optimal value. units). Bidirectional LSTMs process the input sequence in both forward

15
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 10. Kernel weights of Convolutional layer in CNN-BiLSTM model (initial 16 filters).

Fig. 11. Kernel weights of Convolutional layer in CNN-BiLSTM model (final 16 filters).

and backward directions, allowing the model to capture contextual where each channel corresponds to a predefined sentiment category.
information from both sides of the sequence. The dropout value of 0.2 Fig. 15 depicts the activation status of the fully connected dense layer
and recurrent_dropout of 0.2 are used to apply dropout regularization for the proposed CNN-BiLSTM model.
to the LSTM layer to prevent overfitting. The brief graphical overview
of the memory units of the bidirectional LSTM layer in CNN-BiLSTM 4.3. Experimental results analysis
model is shown in Fig. 14. The fourth and final layer is the fully
connected dense layer that is responsible for providing the prediction The proposed work mainly focuses on creating a comprehensive
for test instance to either any sentiment category from 15 predefined dataset for Bangla SA and discovering an efficient feature extrac-
categories. So, in the output layer, there are 15 channels available tion metric, then apply various ML, EL and DL algorithms and make

16
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 12. Activation status of Max pooling layer in CNN-BiLSTM model (initial 16 filters).

Fig. 13. Activation status of Max pooling layer in CNN-BiLSTM model (final 16 filters).

a comparative analysis among them to find the optimal model as CBOW+Skipgram, GloVe+CBOW, GloVe+Skipgram, GloVe+CBOW+
well as feature metric. Table 23 describes the proposed dataset at Skipgram, Bangla-BERT+CBOW, skipBangla-BERT, Bangla-BERT+
a glance. In this work, we have developed total 21 different hy- CBOW+Skipgram and implemented ML, EL and DL algorithms to
brid feature extraction techniques such as BOW+2-Gram, BOW+3-
evaluate them. In this work, we have implemented Bernoulli Naive
Gram, TF-IDF+2-Gram, TF-IDF+3-Gram, TF-IDF-ICF+2-Gram, TF-IDF-
Bayes (BNB), Decision Tree (DT), Logistic Regression (LR), K-Nearest
ICF+3-Gram, Word2Vec+CBOW(gensim), Word2Vec+Skipgram (gen-
sim), Word2Vec+CBOW+Skipgram (gensim), Word2Vec+CBOW (ten- Neighbors (KNN), Support Vector Machine (SVM) as ML models, Ran-
sorflow), Word-2Vec+Skipgram (tensorflow), Word2Vec+CBOW+ dom Forest (RF), XGboost (XGB), Gradient Boost (GB) as EL models and
Skipgram (tensorflow), FastText+CBOW, FastText+Skipgram, FastText+ Recurrent Neural Network (RNN), Long Short Term Memory (LSTM),

17
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 14. Memory units of Bidirectional LSTM layer in CNN-BiLSTM model (initial and final 4 memory units for both forward and backward steps).

The brief overview of the category-wise data collection is given in


Table 4 that shows the developed Bangla sentiment dataset is imbal-
anced which can generate overfitted results in different learning mod-
els. So, we have used the synthetic minority oversampling technique
(SMOTE) to balance our dataset as well as get rid of this overfitting situ-
ation. Table 24 demonstrates the effect of different model performance
after balancing the dataset. For both 15 and 3 sentiment categories,
BNB has the worst results compared to other algorithms while SMOTE
significantly improves its accuracy from 57.91% to 66.85% (15 cate-
gories) and 75.26% to 83.31% (3 categories). From the ML domain SVM
outperforms the other 4 algorithms for both 15 and 3 categories along
with and without SMOTE. Results from ML domain algorithms without
SMOTE show that 15 categories provide better results than 3 categories
in DT, LR and SVM algorithms while BNB and KNN provide better
results with respect to 3 categories. But after balancing the dataset
Fig. 15. Activation status of fully connected dense layer in CNN-BiLSTM model.
using SMOTE, all the model performance except DT show that results
of 3 categories are better. From the EL domain RF outperforms the
other 2 algorithms for both 15 and 3 categories along with and without
Bidirectional LSTM (Bi-LSTM), Convolutional Neural Network (CNN) SMOTE. The results of all three models with SMOTE from EL have been
as DL models and CNN-BiLSTM as a hybrid DL model. increased significantly. SMOTE significantly improves the accuracy of

18
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 16. Precision scores of the implemented algorithms.

Table 23 in bold signs. Among ML algorithms based on 15 sentiment categories,


The proposed dataset at a glance.
SVM (classification accuracy 84.88%) outperforms all other methods
Feature Measurement/Result Comment using skipBangla-BERT while EL and DL algorithms achieved high-
Dataset size 203,463 Moderately good est accuracy 85.77% (Bangla-BERT+CBOW+Skipgram) and 90.24%
Total no. of 204,6150 Good skipBangla-BERT using RF and the hybrid model CNN-BiLSTM. The
words feature metric skipBangla-BERT performs better in most of the cases
Total no. of 116,60,683 Good compared to the other feature metrics. So, we have shown two separate
characters tables to show the best results found from this feature metric. The
Vocabulary size 165,319 Good obtained results for 15 and 3 categories using skipBangla-BERT feature
Data annotation 94.67% Accuracy metric are summarized and shown in Tables 27 and 28 respectively.
quality? Examining skipBangla-BERT mechanism on 15 sentiment categories, it
Dataset balance? No Not Good is observed that SVM, RF and CNN-BiLSTM model achieved the best
No. of categories 15 and 3 Covering more results of 84.88%, 85.34% and 90.24% accuracy from ML, EL and
considered sentiments DL doamins respectively. In case of the results obtained based on 3
Function words 32.84% Good sentiment categories (see Tables 26 and 28): SVM, XGB and CNN-
Top 10 unigrams All are As expected BiLSTM acquired highest accuracy 92.37%, 92.55% and 95.71% from
stopwords ML, EL and DL respectively using the hybrid feature extraction method
Average word 5.69 N/A skipBangla-BERT. The TF-IDF-ICF also performed well and obtained
length better results than traditional TF-IDF. The experimental results show
Average unique 7.84 N/A that Skipgram outperforms CBOW (observe Tables 25 and 26).
word length To measure the performance of the implemented algorithms, we
Zipf’s law Roughly follow Good have considered the commonly used feature metrics such as precision,
Hapax legomina More than half As expected recall, f1-score and accuracy. Precision is the ratio of true positives to
of the dataset the true positives and false positives prediction (Eq. (12)). The precision
Vocabulary 0.053 Good scores of all the implemented algorithms are shown in Fig. 16. SVM
growth rate and CNN-BiLSTM models achieved the highest precision score of 0.96
whereas XGB and CNN acquired 2nd highest precision score of 0.94
and LR obtained 0.93. These 5 models are classifying unknown test
RF from 80.98% to 85.77% and 80.13% to 91.16% for 15 and 3 cate- instances more precisely than the others. The average precision scores
gories respectively. From the DL domain CNN-BiLSTM outperforms the for DL, EL and ML domains are 0.92, 0.90 and 0.85 respectively. From
other 4 algorithms for both 15 and 3 categories along with and without the implemented algorithms KNN produced the worst precision score of
SMOTE. The hybrid model outperforms all the algorithms from ML, EL 0.76. Recall is defined as the ratio of true positives to the true positives
and DL and achieves an accuracy of 95.71% with SMOTE. From the and false negatives (Eq. (13)). Fig. 17 depicts the recall scores of all the
implemented algorithms. LSTM got the highest recall value of 0.99 and
above results analysis, it is observed that there is a relationship between
CNN, CNN-BiLSTM and RNN obtained 2nd highest recall of 0.98 and
the number of categories and the model performance. Let the number
RF obtained 0.94. Very high precision and recall values almost near to
of categories and the model performance be c and p respectively. The
1 is desirable. It proves the efficiency and effectiveness of the trained
relationship between this two parameters are as follows:
models. The average recall values for DL, EL and ML domains are 0.98,
1 0.93 and 0.86 respectively. As like the value of precision, KNN also
𝑝∝
𝑐 produced the worst recall score of 0.82. F1-score or F-measure is the
When the number of categories increases to 15 categories, the perfor- balance measure to express the performance in a single quantity. It is
mance of the model decreases and vice versa. So, for 3 categories, the the harmonic mean of precision and recall (Eq. (14)). F1-scores of all
performance of most models is better than for 15 categories scenario. the implemented algorithms are shown in Fig. 18. The hybrid model
The obtained results of applied algorithms using 21 different hy- CNN-BiLSTM obtained the highest f1-score of 0.96 whereas CNN got
brid feature metrics based on 15 and 3 categories are shown in Ta- the 2nd highest f1-score of 0.95 and Bi-LSTM and SVM achieved 0.94.
bles 25 and 26 respectively. The best achieved results are highlighted The average f1-scores from DL, EL and ML domains are 0.94, 0.91 and

19
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 24
Effect of different model performance after balancing the dataset.
Domain Algorithm Accuracy without Accuracy with Accuracy without Accuracy with
SMOTE for SMOTE for SMOTE for SMOTE for
15-categories 15-categories 3-categories 3-categories
(%) (%) (%) (%)
BNB 57.91 66.85 75.26 83.31
DT 78.15 84.37 76.89 83.21
ML LR 80.12 83.77 78.34 88.74
KNN 59.89 72.69 69.95 81.58
SVM 80.52 84.88 79.58 92.37
RF 80.98 85.77 80.13 91.16
EL XGB 79.47 83.59 79.92 92.55
GB 75.67 84.91 80.03 91.87
RNN 80.19 88.91 82.17 93.14
LSTM 81.11 86.90 83.23 94.37
DL Bi-LSTM 81.73 85.96 82.65 94.68
CNN 80.38 88.33 82.91 94.55
CNN-BiLSTM 82.16 90.24 83.64 95.71

Table 25
Obtained results of applied algorithms using different feature extraction techniques based on 15 sentiment categories.
Feature Accuracy (%)
BNB DT LR KNN SVM RF XGB GB RNN LSTM Bi-LSTM CNN CNN-
BiLSTM
BOW+2-Gram 61.13 73.25 74.51 39.78 77.63 80.31 79.88 80.03 N/A N/A N/A N/A N/A
BOW+3-Gram 61.33 73.29 74.50 39.91 77.67 80.23 79.87 79.98 N/A N/A N/A N/A N/A
TF-IDF+2-Gram 65.72 78.93 77.67 42.96 80.34 82.52 82.31 82.19 N/A N/A N/A N/A N/A
TF-IDF+3-Gram 65.81 78.97 77.98 43.10 80.77 82.63 81.69 82.33 N/A N/A N/A N/A N/A
TF-IDF-ICF+2-Gram 64.33 80.14 78.40 43.22 82.19 83.12 82.13 82.44 N/A N/A N/A N/A N/A
TF-IDF-ICF+3-Gram 64.27 80.31 78.42 43.23 82.76 83.12 82.15 82.53 N/A N/A N/A N/A N/A
Word2Vec+CBOW 28.71 58.16 60.67 57.90 62.61 71.18 70.13 70.69 72.39 73.65 74.68 75.89 82.31
(gensim)
Word2Vec+Skipgram 35.95 57.84 62.94 60.62 67.98 71.79 71.05 69.98 74.23 75.63 76.92 80.39 83.26
(gensim)
Word2Vec+CBOW+ 28.55 57.81 60.43 56.89 64.69 71.19 69.72 68.87 73.13 76.98 75.97 80.14 82.79
Skipgram (gensim)
Word2Vec+CBOW 24.13 55.79 51.07 43.26 60.53 64.67 63.19 62.74 79.54 81.26 82.39 81.37 83.76
(tensorflow)
Word2Vec+Skipgram 25.26 57.81 54.32 44.17 61.29 64.98 63.55 61.87 79.98 82.05 82.31 82.91 82.77
(tensorflow)
Word2Vec+CBOW+ 24.89 55.86 53.68 43.29 60.95 64.83 63.14 61.68 78.81 82.39 80.39 79.68 81.26
Skipgram (tensorflow)
FastText+CBOW 56.82 67.64 71.62 72.69 72.86 76.21 73.42 72.87 77.37 79.81 80.45 80.11 82.38
FastText+Skipgram 66.85 67.78 71.61 72.58 72.59 76.29 74.51 73.27 79.91 81.29 84.25 83.22 84.27
FastText+CBOW+ 59.87 67.74 71.64 72.62 72.89 76.23 75.85 72.96 78.63 80.94 82.71 81.21 83.57
Skipgram
GloVe+CBOW 55.72 66.52 71.23 57.89 72.43 74.89 75.76 70.83 76.39 78.35 79.42 78.98 80.53
GloVe+Skipgram 65.35 67.39 70.54 58.93 71.99 75.61 74.38 69.87 78.13 75.53 81.32 80.32 81.26
GloVe+CBOW+ 61.37 65.74 71.27 64.79 70.99 72.94 71.29 70.89 73.41 78.91 79.47 78.96 80.93
Skipgram
Bangla-BERT+CBOW 63.47 84.26 83.49 67.91 84.67 85.79 83.41 84.62 88.89 86.41 86.49 87.91 89.13
skipBangla-BERT 64.36 83.19 83.77 66.43 84.88 85.34 83.59 84.91 88.91 85.73 85.96 88.33 90.24
Bangla-BERT+CBOW+ 64.53 84.37 83.73 67.05 84.79 85.77 82.97 84.77 88.86 86.90 86.81 88.13 89.91
Skipgram

0.85 respectively. Among all the models, KNN got the worst f1-score The performance of a learning model depends mostly on the ex-
of 0.79, as it obtained the worst precision and recall values. Among all tracted features from the dataset. Feature extraction is very important
the implemented algorithms, KNN is the only learning model that is in text mining related works. So, it is an important aspect to know
lazy learner. It has no training phase. This is the reason for its worst which feature extraction method is more efficient with our present task
outcomes. The proposed dataset contain human Bangla comments, so it at hand. In our work, we have examined different feature extraction
is very rear to have the use of standard language always. Rather it can techniques. The average performance of different feature extraction
have local Bangla words, local folk words, slang’s etc. So, training is a methods are summarized in Fig. 19, where Bangla-BERT outperforms
must to get promising outcomes from a model. Except KNN, the other all other methods with an average accuracy of 92.24%. FastText model
models are eager learners and they produced significant results in both shows an accuracy of 81.09% being the 2nd highest metric for feature
15 and 3 categories versions of the dataset. extraction. TF-IDF-ICF metric outperforms the traditional TF-IDF and

20
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 17. Recall scores of the implemented algorithms.

Fig. 18. F1-scores of the implemented algorithms.

Fig. 19. Average performance of different feature extraction techniques.

21
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 26
Obtained results of applied algorithms using different feature extraction techniques based on 3 sentiment categories.
Feature Accuracy (%)
BNB DT LR KNN SVM RF XGB GB RNN LSTM Bi-LSTM CNN CNN-
BiLSTM
BOW+2-Gram 80.32 82.37 83.04 57.33 82.43 83.78 82.47 82.88 N/A N/A N/A N/A N/A
BOW+3-Gram 80.13 81.69 82.34 59.75 81.66 83.96 82.39 81.89 N/A N/A N/A N/A N/A
TF-IDF+2-Gram 82.69 88.23 88.04 58.39 89.47 90.33 90.45 89.97 N/A N/A N/A N/A N/A
TF-IDF+3-Gram 83.05 88.21 88.19 61.46 89.48 90.41 90.23 90.17 N/A N/A N/A N/A N/A
TF-IDF-ICF+2-Gram 83.18 88.95 88.61 58.20 89.92 91.07 90.86 89.99 N/A N/A N/A N/A N/A
TF-IDF-ICF+3-Gram 83.31 89.02 88.74 59.16 90.01 91.16 89.56 90.45 N/A N/A N/A N/A N/A
Word2Vec+CBOW 65.82 78.83 80.84 81.11 81.86 85.78 84.35 83.26 83.49 83.99 84.74 84.56 89.57
(gensim)
Word2Vec+Skipgram 69.19 79.92 81.46 80.31 83.20 86.44 84.53 84.36 82.43 87.94 86.24 85.41 90.13
(gensim)
Word2Vec+CBOW+ 64.83 78.54 80.91 81.09 81.89 85.99 84.49 85.23 84.58 86.78 89.05 88.17 92.48
Skipgram (gensim)
Word2Vec+CBOW 54.72 73.35 75.66 71.26 78.59 80.46 82.33 81.16 83.45 85.90 87.43 89.99 89.93
(tensorflow)
Word2Vec+Skipgram 56.13 77.91 78.52 74.89 79.94 80.09 84.06 83.34 84.65 86.19 89.72 87.70 92.30
(tensorflow)
Word2Vec+CBOW+ 54.97 77.18 76.99 74.57 78.92 79.99 84.14 82.73 81.94 86.31 90.18 90.02 91.61
Skipgram (tensorflow)
FastText+CBOW 65.82 76.64 80.61 81.11 81.86 85.21 86.49 84.45 85.73 89.90 88.15 91.15 92.35
FastText+Skipgram 75.85 76.78 81.27 81.58 82.34 85.29 87.95 88.21 89.55 90.31 92.19 90.95 93.54
FastText+CBOW+ 75.83 76.63 80.97 81.61 81.93 85.23 85.69 87.89 90.14 90.78 91.44 91.08 93.25
Skipgram
GloVe+CBOW 67.23 75.19 78.47 79.63 80.96 85.35 84.14 86.51 88.90 86.05 87.68 89.96 91.55
GloVe+Skipgram 73.89 75.82 80.94 80.38 81.53 85.67 85.19 88.98 84.93 87.34 89.95 90.42 92.33
GloVe+CBOW+ 76.12 76.07 79.57 80.44 81.29 85.12 85.32 87.93 85.66 86.26 89.92 90.49 92.28
Skipgram
Bangla-BERT+CBOW 80.13 83.16 86.91 78.92 92.14 91.03 92.34 90.68 92.89 94.13 94.45 93.44 94.47
skipBangla-BERT 82.15 83.17 87.72 80.14 92.37 91.09 92.55 91.87 93.03 94.37 94.66 94.26 95.71
Bangla-BERT+CBOW+ 81.99 83.21 87.79 79.54 92.29 91.14 92.47 91.74 93.14 94.26 94.68 94.55 95.27
Skipgram

Table 27 2022 (Prottasha et al., 2022) used six datasets and obtained highest f1-
Best performance measurements of different domains using (skipBangla-BERT) based
score and accuracy with CNN-BiLSTM model of 0.93 and 94.15%. The
on 15 categories.
very recent work (Kabir et al., 2023) used a moderately large dataset
Domain Best model Precision Recall F1-score Accuracy (%)
of 158,065 instances and achieved highest f1-score 0.93 using BERT
ML SVM 0.81 0.86 0.83 84.88
EL RF 0.78 0.91 0.84 85.34 model. Another new work (Bitto et al., 2023) used a dataset of only
DL CNN-BiLSTM 0.86 0.93 0.89 90.24 1400 reviews from food delivery startup and achieved an accuracy
of 91.07% and f1-score of 0.85. Our proposed dataset contain more
Table 28
instances than the compared 4 other existing recent works. To the best
Best performance measurements of different domains using skipBangla-BERT based on of our knowledge, we have created one of the largest document-level
3 categories. Bangla SA corpus of 203,463 comments from social media. Most of
Domain Best model Precision Recall F1-score Accuracy (%) the work on Bangla sentiment analysis consider only 3 basic sentiment
ML SVM 0.96 0.91 0.93 92.37 categories (positive, negative and neutral). But in this work we have
EL XGB 0.89 0.99 0.94 92.55
examined sentiment analysis on both 15 categories and 3 categories and
DL CNN-BiLSTM 0.97 0.94 0.95 95.71
noticed that accuracy (i.e. results) and no. of categories are inversely
proportional to each other. Moreover, the proposed work also discover
a new hybrid feature extraction method (skipBangla-BERT) for Bangla
BOW metrics. Word2Vec along with gensim and tensorflow libraries textual data which outperforms 20 other hybrid methods. We have
show an accuracy of 65.73% and 62.81% respectively. In our experi- implemented 13 different algorithms from 3 different domains from
ment, we have found that Word2Vec performed worst compared to the ML, EL and DL, among them the hybrid method (CNN-BiLSTM) from
other feature metrics.
DL domain outperforms the others. The best achieved accuracy is
The comparison among existing recent works and our proposed
90.24% in 15 categories and 95.71% in 3 categories. Another graphical
work is illustrated in Table [?] where we have measured the dataset
performance metric that is very useful for machine learning algorithms
used, no. of categories, feature metric, used model, f1-score and accu-
racy as comparative features. In 2022 (Bhowmik et al., 2022) a method is the receiver operating characteristic (ROC) curve (Bradley, 1997).
was proposed for Bangla sentiment analysis using extended lexicon We have shown the ROC curves for both the 15 and 3 categories in
dictionary and deep learning algorithms. They used two aspect based Figs. 20 and 21 respectively. Individual area under curve (AUC) values
datasets on Cricket and Restaurant and obtained highest accuracy of are also given in the ROC curves to observe the significance of each
84.18% using the hybrid method BERT-LSTM. Another method from category (see Table 29).

22
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Table 29
Comparison among existing works and our proposed work.
Name Year Dataset No. of Feature Best F1-Score Accuracy
used categories metric model (%)
Bhowmik et al. (2022) 2022 2900 & 3 Word2Vec BERT- N/A 84.18
2600 LSTM
Prottasha et al. (2022) 2022 2900 & 3 BERT CNN- 0.93 94.15
2600 & BiLSTM
4 others
Kabir et al. (2023) 2023 (BanglaBook) 3 BOW+ BERT 0.93 N/A
158,065 N-Gram
Bitto et al. (2023) 2023 1400 2 Word2Vec LSTM 0.85 91.07
Islam and Alam (2023b) 2023 44,491 3 TF-IDF CNN- 0.90 90.96
BiGRU
Mia et al. (2024) 2024 6500 3 TF-IDF, GloVe BERT 0.65 65.00
3 skipBangla- CNN- 0.96 95.71
Our Proposed (BangDSA) BERT BiLSTM
2024
(SA_Bangla) 203,463
15 skipBangla- CNN- 0.91 90.24
BERT BiLSTM

Fig. 22. Friedman test based on 15 categories.

Fig. 20. ROC curve of CNN-BiLSTM model based on 15 categories.

Fig. 23. Friedman test based on 3 categories.

of 15 categories SVM, RF and CNN-BiLSTM are the best models from


ML, EL and DL domains (see Table 25). In a similar fashion, for the 3
categories SVM, XGB and CNN-BiLSTM are the best models from ML,
EL and DL domains (see Table 26). The obtained accuracy scores for
these 3 models using different features are taken to run the Friedman
test. The environment of the test: the level of significance is 0.05. An
open source online statistical calculator3 is used to run the test. The
calculation summary of the 𝜒𝑟 2 statistic is shown in Figs. 22 and 23 for
the 15 and 3 categories respectively.
The obtained results from the test based on 15 categories, Fried-
man 𝜒𝑟 2 statistic is 31.05, the degrees of freedom (𝑑𝑓 ) = 2, 𝑝-value
< 0.00001. The result is significant based on 15 categories at 𝑝 < 0.05.
The obtained results from the test based on 3 categories, Friedman
Fig. 21. ROC curve of CNN-BiLSTM model based on 3 categories. 𝜒𝑟 2 statistic is 20.9333, the degrees of freedom (𝑑𝑓 ) = 2, 𝑝-value is
0.00003. The result is significant based on 3 categories at 𝑝 < 0.05. So,
both the test results are significant at 𝑝 < 0.05.
4.4. Statistical test (friedman test)
4.5. Limitation of the proposed work

To observe the statistical significance of the obtained results for Though the proposed dataset (BangDSA) is one of the largest doc-
both 15 and 3 categories, we have examined the non-parametric statis- ument level sentiment analysis datasets, but it is not a balanced one.
tical test called Friedman test (Liu and Xu, 2022) on the results obtained
3
for the best performed models from ML, EL and DL domains. In case https://www.socscistatistics.com/tests/friedman/default.aspx

23
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

Fig. 24. Performance of 15 sentiment categories using skipBangla-BERT.

sentiment categories by 5 native annotators and we have validated


the annotation process by different 40 native Bangla speakers with a
validation accuracy of 94.67%. The proposed dataset approximately
follows the Zipf’s law, covering 32.84% function words with a vocab-
ulary growth rate of 0.053, tagged both on 15 and 3 categories. So, it
can be said that it is a good dataset indeed. The proposed work also
focuses on examining different hybrid feature extraction techniques.
TF-IDF-ICF outperforms the traditional TF-IDF and BOW method, 3-
gram slightly performs better than 2-gram. Among the word embedding
models FastText performs better than Word2Vec and GloVe, Skipgram
surpasses the traditional CBOW mechanism. This work examined 21
different hybrid feature extraction techniques and found the novel
method (skipBangla-BERT) which outperforms all other techniques.
For the classification task, we have implemented 5 ML algorithms;
Fig. 25. Performance of 3 sentiment categories using skipBangla-BERT. 3 EL algorithms; 4 DL algorithms and a hybrid DL method CNN-
BiLSTM. Among the implemented algorithms, KNN obtained the worst
performance as it is a lazy learner, KNN achieved highest 72.69%
Some categories such as the surprise, relief, worry, fear, and hate contain and 81.58% accuracy for 15 and 3 categories respectively; rest of the
very less instances compared to the other categories which can cause other classifiers are all eager learners and they require training phase.
the overfitting situations. The performance comparison among the 15 The hybrid method CNN-BiLSTM along with feature metric skipBangla-
and 3 sentiment categories using skipBangla-BERT are given in Figs. 24 BERT produced the best results in both 15 and 3 categories versions of
and 25 respectively. The Fear, Hate and Neutral categories did not the dataset. The best acquired accuracy for the CNN-BiLSTM model is
perform well compared to the other sentiment categories, as these 90.24% in 15 categories and 95.71% in 3 categories. It is noticed by
categories have less number of samples in training. We expect that experiments that the number of categories and model’s performance is
by increasing the training samples of those categories the performance inversely proportional. A statistical test (Friedman test) was performed
can be increased more. Though feature importance understanding is on the obtained results to observe the statistical significance with 0.05
essential for the digital development of a language. But in this work, we level of significance. The statistical test shows that the obtained results
did not consider any explainable artificial intelligence (AI) techniques. are significant in both 15 and 3 categories. In the future, we want to
The fundamental problem of AI and learning-related tasks is the lack of enrich and balance our dataset more and convert it to a representative
interpretability of a model being performing well or poorly, which fea- one for Bangla SA and explore evolutionary algorithms for extracting
tures are responsible for classifying a document into a certain category, features from texts.
which are the top most important features, and so on.
CRediT authorship contribution statement
5. Conclusion
Md. Shymon Islam: Formal analysis, Investigation, Methodology,
In this modern technologically advanced world, Sentiment Analysis
Visualization, Writing – original draft, Resources, Software, Validation.
(SA) is a very important topic in every language due to its various
Kazi Masudul Alam: Conceptualization, Supervision.
trendy applications. But it is a matter of regret that a few research
works have been conducted in Bangla language on SA compared to
other languages due to lack of publicly available datasets and resources. Declaration of competing interest
The principle objective of this work is to develop a new comprehensive
dataset for Bangla SA, then search for a better hybrid feature metric The authors declare that they have no conflicts of interest.
that can assist different learning models to make efficient predictions,
and finally find one or more efficient learning models from either Acknowledgments
machine learning, ensemble learning or deep learning techniques. The
proposed dataset contain 203,493 Bangla comments collected from 5 Authors would like to thank the Computer Science and Engineering
microblogging sites. The comments are annotated into 15 predefined Discipline, Khulna University, Bangladesh for providing resources and

24
Md.S. Islam and K.M. Alam Natural Language Processing Journal 7 (2024) 100069

time for the design and implementation of this research work and Hassan, M., Shakil, S., Moon, N.N., Islam, M.M., Hossain, R.A., Mariam, A., Nur, F.N.,
also thank to some students from the Department of Computer Science 2022. Sentiment analysis on bangla conversation using machine learning approach.
Int. J. Electr. Comput. Eng. (IJECE) 12, 5562–5572. http://dx.doi.org/10.11591/
and Engineering, North Western University, Bangladesh for being in-
ijece.v12i5.pp5562--5572.
volved in the dataset collection and annotation process of this research Islam, M.S., Alam, K.M., 2023a. An empiric study on bangla sentiment analysis
work. using hybrid feature extraction techniques. In: 14th International Conference on
Computing Communication and Networking Technologies. (ICCCNT), pp. 1–7. http:
References //dx.doi.org/10.1109/ICCCNT56998.2023.10308114.
Islam, M.S., Alam, K.M., 2023b. Sentiment analysis on bangla food reviews using
machine learning and explainable NLP. In: 26th International Conference on
Akther, A., Islam, M.S., Sultana, H., Rahman, A.R., Saha, S., Alam, K.M., Debnath, R.,
Computer and Information Technology. (ICCIT), pp. 1–6. http://dx.doi.org/10.
2022. Compilation, analysis and application of a comprehensive bangla corpus
1109/ICCIT60459.2023.10441309.
kumono. IEEE Access 10, 79999–80014. http://dx.doi.org/10.1109/ACCESS.2022.
Junaid, M.I.H., Hossain, F., Upal, U.S., Tameem, A., Kashim, A., Fahmin, A., 2022.
3195236.
Bangla food review sentimental analysis using machine learning. In: IEEE 12th
Alam, M.H., Rahoman, M.M., Azad, M.A.K., 2017. Sentiment analysis for bangla
Annual Computing and Communication Workshop and Conference. (CCWC), pp.
sentences using convolutional neural network. In: 20th International Conference
0347–0353. http://dx.doi.org/10.1109/CCWC54503.2022.9720761.
of Computer and Information Technology. (ICCIT), pp. 1–6. http://dx.doi.org/10.
Kabir, M., Mahfuz, O.B., Raiyan, S.R., Mahmud, H., Hasan, M.K., 2023. BanglaBook: A
1109/ICCITECHN.2017.8281840.
large-scale bangla dataset for sentiment analysis from book reviews. Comput. Lang.
Alvi, N., Talukder, K.H., Uddin, A.H., 2022. Sentiment analysis of bangla text using
http://dx.doi.org/10.48550/arXiv.2305.06595.
gated recurrent neural network. In: International Conference on Innovative Com-
Liu, J., Xu, Y., 2022. T-friedman test: A new statistical test for multiple comparison
puting and Communications Advances in Intelligent Systems and Computing, vol.
with an adjustable conservativeness measure. Int. J. Comput. Intell. Syst. 15 (29),
1388, pp. 77–86. http://dx.doi.org/10.1007/978--981--16--2597--8_7.
29. http://dx.doi.org/10.1007/s44196--022--00083--8.
Amin, A., Hossain, I., Akther, A., Alam, K.M., 2019. Bengali VADER: A sentiment
Manning, C., Schutze, H., 1999. Foundations of Statistical Natural Language Processing.
analysis approach using modied VADER. In: International Conference on Electrical,
MIT Press, Cambridge, MA, USA.
Computer and Communication Engineering. (ECCE), pp. 1–6. http://dx.doi.org/10.
Mia, M., Das, P., Habib, A., 2024. Verse-based emotion analysis of bengali music from
1109/ECACE.2019.8679144.
lyrics using machine learning and neural network classifiers. Int. J. Comput. Digital
Azmin, S., Dhar, K., 2019. Emotion detection from bangla text corpus using naive bayes
Syst. 15 (1), 359–370. http://dx.doi.org/10.12785/ijcds/150128.
classifier. In: 4th International Conference on Electrical Information and Commu-
Nafisa, N., Maisha, S.J., Masum, A.K.M., 2023. Document level comparative sentiment
nication Technology. (EICT), pp. 1–6. http://dx.doi.org/10.1109/EICT48899.2019.
analysis of bangla news using deep learning-based approach LSTM and machine
9068797.
learning approaches. Appl. Intell. Ind. 4.0 198–211.
Bhattacharjee, A., Hasan, T., Ahmad, W., Mubasshir, K.S., Islam, M.S., Iqbal, A.,
Prottasha, N.J., Sami, A.A., Kowsher, M., Murad, S.A., Bairagi, A.K., Masud, M.,
Rahman, M.S., Shahriyar, R., 2022. BanglaBERT: Language model pretraining
Baz, M., 2022. Transfer learning for sentiment analysis using BERT based supervised
and benchmarks for low-resource language understanding evaluation in bangla.
fine-tuning. Sensors 22, 4157. http://dx.doi.org/10.3390/s22114157.
In: Findings of the Association for Computational Linguistics: NAACL 2022. pp.
Rafat, A.A.A., Salehin, M., Khan, F.R., Hossain, S.A., Abujar, S., 2019. Vector rep-
1318–1327. http://dx.doi.org/10.18653/v1/2022.findings--naacl.98.
resentation of bengali word using various word embedding model. In: 2019 8th
Bhowmik, N.R., Arifuzzaman, M., Mondal, M.R.H., 2022. Sentiment analysis on bangla
International Conference System Modeling and Advancement in Research Trends.
text using extended lexicon dictionary and deep learning algorithms. Array 3,
(SMART), pp. 27–30. http://dx.doi.org/10.1109/SMART46866.2019.9117386.
100123. http://dx.doi.org/10.1016/j.array.2021.100.
Rahman, F., 2019. An annotated bangla sentiment analysis corpus. In: International
Bitto, A.K., Bijoy, M.H.I., Arman, M.S., Mahmud, I., Das, A., Majumder, J., 2023.
Conference on Bangla Speech and Language Processing. (ICBSLP), pp. 1–5. http:
Sentiment analysis from Bangladeshi food delivery startup based on user reviews
//dx.doi.org/10.1109/ICBSLP47725.2019.201474.
using machine learning and deep learning. Bull. Electr. Eng. Inform. 12, 2282–2291.
Rahman, M.A., Dey, E.k., 2018. Datasets for aspect-based sentiment analysis in bangla
http://dx.doi.org/10.11591/eei.v12i4.4135.
and its baseline evaluation. Data 3, 15. http://dx.doi.org/10.3390/data3020015.
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of
Rashid, M.R.A., Hasan, K.F., Hasan, R., Das, A., Sultana, M., Hasan, M., 2024. A
machine learning algorithms. Pattern Recognit. 30 (7), 1145–1159. http://dx.doi.
comprehensive dataset for sentiment and emotion classification from Bangladesh
org/10.1016/S0031--3203(96)00142--2.
e-commerce reviews. Data Brief 53, 110052. http://dx.doi.org/10.1016/j.dib.2024.
Camacho-Collados, J., Pilehvar, M.T., 2018. From word to sense embeddings: a survey
110052.
on vector representations of meaning. J. Artificial Intelligence Res. 63 (1), 743–788.
Shafin, M.A., Hasan, M.M., Alam, M.R., Mithu, M.A., Nur, A.U., Faruk, M.O.,
http://dx.doi.org/10.1613/jair.1.11259.
2020. Product review sentiment analysis by using NLP and machine learning in
Cerqueira, T., Ribeiro, F.M., Pinto, V.H., Lima, J., Gonçalves, G., Glove prototype for
bangla language. In: 23rd International Conference on Computer and Informa-
feature extraction applied to learning by demonstration purposes. Appl. Sci. 12
tion Technology. (ICCIT), pp. 1–5. http://dx.doi.org/10.1109/ICCIT51783.2020.
(21), 2076–3417. http://dx.doi.org/10.3390/app122110752.
9392733.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: Synthetic
Sharmin, S., Chakma, D., 2021. Attention-based convolutional neural network for
minority over-sampling technique. J. Artificial Intelligence Res. 16, 321–357. http:
bangla sentiment analysis. AI Soc. 36, 381–396. http://dx.doi.org/10.1007/
//dx.doi.org/10.1613/jair.953.
s00146--020--01011--0.
Chowdhury, S., Chowdhury, W., 2014. Performing sentiment analysis in bangla
Sumit, S.H., Hossan, M.Z., Al Muntasir, T., Sourov, T., 2018. Exploring word embedding
microblog posts. In: IEEE International Conference on Informatics, Electronics &
for bangla sentiment analysis. In: International Conference on Bangla Speech and
Vision. (ICIEV), pp. 1–6. http://dx.doi.org/10.1109/ICIEV.2014.6850712.
Language Processing. pp. 1–5. http://dx.doi.org/10.1109/ICBSLP.2018.8554443.
Dash, N.S., 2005. Corpus Linguistics and Language Technology: With Reference to
Tabassum, N., Khan, M.I., 2019. Design an empirical framework for sentiment analysis
Indian Languages. Mittal Publications, New Delhi, India.
from bangla text using machine learning. In: International Conference on Electrical,
Habibullah, M., Islam, M.S., Jahura, F.T., Biswas, J., 2023. Bangla document clas-
Computer and Communication Engineering. (ECCE), pp. 1–5. http://dx.doi.org/10.
sification based on machine learning and explainable NLP. In: 6th International
1109/ECACE.2019.8679347.
Conference on Electrical Information and Communication Technology. (EICT), pp.
Tuhin, R.A., Paul, B.K., Nawrine, F., Akter, M., Das, A.K., 2019. An automated system of
1–6. http://dx.doi.org/10.1109/EICT61409.2023.10427766.
sentiment analysis from bangla text using supervised learning techniques. (ICCCS),
Hassan, A., Amin, M.R., Azad, A.K.A., Mohammed, N., 2016. Sentiment analysis on
pp. 360–364. http://dx.doi.org/10.1109/CCOMS.2019.8821658.
bangla and romanized bangla text using deep recurrent models. In: International
Wang, D., Zhang, H., 2010. Inverse-category-frequency based supervised term weighting
Workshop on Computational Intelligence. (IWCI), pp. 51–56. http://dx.doi.org/10.
scheme for text categorization. p. 15, arXiv preprint arXiv:1012.2609.
1109/IWCI.2016.7860338.

25

You might also like