0% found this document useful (0 votes)
165 views7 pages

Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model

The document discusses a study presented at the 2024 IEEE International Conference on Computing, Applications and Systems, focusing on classifying Bengali texts into 'saint' (Sadhu bhasa) and 'common' (Cholito bhasa) forms using various machine learning algorithms. The research utilized a dataset of 2948 texts and achieved the highest accuracy of 92.33% with the Support Vector Machine (SVM) algorithm. The study aims to address the linguistic mixing issue in Bengali and contributes to the field of Natural Language Processing (NLP) for better text classification in the language.

Uploaded by

Ameer Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views7 pages

Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model

The document discusses a study presented at the 2024 IEEE International Conference on Computing, Applications and Systems, focusing on classifying Bengali texts into 'saint' (Sadhu bhasa) and 'common' (Cholito bhasa) forms using various machine learning algorithms. The research utilized a dataset of 2948 texts and achieved the highest accuracy of 92.33% with the Support Vector Machine (SVM) algorithm. The study aims to address the linguistic mixing issue in Bengali and contributes to the field of Natural Language Processing (NLP) for better text classification in the language.

Uploaded by

Ameer Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2024 IEEE International Conference on Computing, Applications and Systems (COMPAS),

September 25-26, 2024, Bangladesh

Bengali Text Classification: Distinguishing Saintly


and Common Forms using Machine Learning
Model
2024 IEEE International Conference on Computing, Applications and Systems (COMPAS) | 979-8-3315-2976-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/COMPAS60761.2024.10796448

Umme Ayman Md. Mohammad Asifur Rahim Saiham Zaman Mridul


Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering Engineering Engineering
Daffodil International University Daffodil International University Daffodil International University
Dhaka-1216 Dhaka-1216 Dhaka-1216
[email protected] [email protected] [email protected]

Md. Tanvir Ahmed Akash Narayan Ranjan Chakraborty Md. Hasan Imam Bijoy
Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering Engineering Engineering
Daffodil International University Daffodil International University Daffodil International University
Dhaka-1216 Dhaka-1216 Dhaka-1216
[email protected] [email protected] [email protected]

Abstract—In the realm of AI technological advancements, Bengali languages is utilized in everyday communication
Natural Language Processing (NLP) stands out as a branch official documents or legal papers. Again, many prominent
dedicated to enhancing machine or system processing of human persons in the subcontinent have published books of poems,
language tasks. Text classification, a crucial area within NLP, stories, fiction, dramas, and other works in saint form. The
has significantly fortified natural language processing tasks. most prevalent variety of Bengali languages is utilized in
This study zeroes in on language processing tasks in Bengali, our everyday communication among Bengalis. People utilize it
mother tongue. Notably, there's a dearth of research on Bengali primarily for casual communication. Again, it is used for both
text classification, especially concerning its two linguistic forms: writing and speaking.
“saint” (Sadhu bhasa) and “common” (Cholito bhasa), with a
limited dataset size available. Thus, this study aims to classify In recent times, we have seen people mingling the two
Bengali texts based on these two forms, utilizing around 2948 varieties of Bengali, which is wrong. As a result, it has become
pieces of data gathered from Bengali literature, blogs, and critical to classify saint(sadhu) and common (Cholito) forms
articles. To classify Bengali texts, various supervised machine- of the Bengali language. By facing the mixing issue in
learning algorithms have been employed, including MNB, RF, language, The Dhaka University of Bangladesh published a
DT, SVM, KNN, and XGB. Before implementing these paper on the language situation in Bangladesh [3].
algorithms, dataset preprocessing techniques were applied, such
as primary cleaning, regular expression removal, stopwords To address such linguistic issues in the Bengali language,
removal, digit removal, null value removal, and tokenization. we need to focus on natural language processing (NLP) tasks
Among all the applied machine learning techniques or classifiers in Bangla as our mother tongue. Natural language processing
for predicting texts in the two forms—saint and common—SVM (NLP) is the intelligent and practical process by which
achieved the highest output accuracy of 92.33%. Additionally, computers assess, comprehend, and extract meaning from
RF, XGB, MNB, DT, and KNN attained accuracies of 91.87%, human language. Natural Language Processing allows
91.20%, 88.04%, 86.68%, and 84.88%, respectively. Ultimately, developers to organize and structure information for tasks
this study opens up avenues for further research in Bangla such as text categorization or classification, automatic text
language processing, particularly in the realm of text summarization, translation, named entity identification,
classification.
connection extraction, sentiment analysis, audio recognition,
Keywords—Bangla text classification, Shadhu, Cholito,
and topic segmentation. NLP scholars are becoming
Machine learning, Natural language processing increasingly interested in text classification, which is
incorporated with machine learning to enhance language
I. INTRODUCTION processing tasks. One of the key areas of study in natural
language processing research is text classification or
Bengali is engendered from Magadhi prakrit, pali, and
categorization, and text classification is the process of
Sanskrit and which is also known as Indo-Aryan Language[1].
classifying text into different contexts. Various texts are
Almost 100 million people speak Bengali around the world
examined and then classified into distinct groups or classes
and it is the official language of Bangladesh with the rank of
using Natural Language Processing (NLP).
sixth worldwide. Bengali has two standard speech styles: saint
(Sadhu bhasa) (elegant or genteel) and common (Cholito This study basically presents our main purpose of
bhasa) (current or colloquial) [2].People commonly use the classifying Bengali texts based on saint (sadhu) and common
saint form, or Sadhu Bhasha, in official documents or legal (Cholito) forms by applying machine learning (ML). We used
papers. Again, many prominent persons in the subcontinent machine learning classifiers to classify the text. The proposed
have published books of poems, stories, fiction, dramas, and work will assist us in avoiding a combination of these kinds
other works in saint form. The most prevalent variety of

979-8-3315-2976-5/24/$31.00 ©2024 IEEE


Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
of written communication. So, regardless of whether the Rahman et al. [7] introduces an innovative fostering a
writing is in Sadhu or Cholito Bhasha, we will be able to grasp dynamic approach where sentiment classification in Bengali
it. In this study, a methodology is proposed that employs literature would employ a word2vec model. A dataset of
various machine learning classifiers or algorithms to classify Applied models achieved 75% of accuracy on Bengali dataset
Bengali text into binary-level classifications: saint (Sadhu of 11000 data collected from Bengali articles, newspapers by
basha) and common (Cholito). There are different text utilizing continuous bags of words.
classification techniques available in NLP, such as multi-level
classification and binary-level classification. In this paper, we Wadud et al. [8] employing LSTM networks that
are focusing on binary-level classification, which means the acquired 93.38% accuracy by levelized Bengali dataset with
classification of text into two levels or classes. 20,000 text data collected form social media text with six
categories as race, religion, ethnicity, sexual, gender, and
The prime contributions of this proposed work are physical disability. Hasan et al. [09] utilizes Bengali dataset
illustated as follows: to report sentiment analysis by considering three of such
methods, FR, (DT) and LSTM. A maximum accuracy of
• Gathering approximately 2948 self-made data from classifying positive sentiments was 89.42% and negative
various historical books, novels, and literary works. sentiments, 88.20% for a database of 1824 Bangla
• Preprocessing the data using several techniques to documents. Using a multilingual BERT model that has
ensure quality and machine feasibility. already been trained, Islam et al. [10] investigate sentiment
• Applying six machine learning algorithms to pre- analysis in Bengali news articles that contains positive or
processed datasets and utilizing accuracy, precision, negative sentiment to accomplish the Bengali text
recall, f1-score, TPR, TNR, FPR, FNR and error-rate classification un terms of these sentiments. Using a dataset of
to evaluate their performance. 17,852 Bengali news entries, they were able to categorize
• Adjusting the parameters of the machine learning Bengali news articles into three sentiment categories
models during the model implementation to improve (positive, negative, and neutral) with an accuracy of 71%.
accuracy. Siam, Abu Sayem Md, et al. [12] also examine the usage of
• Analyzing the results through comparative analysis of two prediction models namely, the decision tree and the naïve
performance metrics, confusion matrices, and Bayes models, for classifying Bangla news with a success
prediction results of input data by each classifier. rate of 90%. Nevertheless, it has shortcomings as it does not
The structure of the rest of the paper as follows: Section II focus on ‘how’ RNNs are used for news classification as well
presents the literature review, Section III illustrates the as does not mention dataset’s size specifications.
methodology, Section IV demonstrates the experiment and Using models like CRF and BanglaBERT, H.A.Z et al.
result analysis and Section V includes conclusion. [12] use a technique for named entity recognition in Bengali
with the dataset BanglaCoNER, which contains 16100 items,
II. LITERATURE REVIEW and achieve an amazing f1 score of 0.79 using the validation
Saint and common classification and conversion of set. Nonetheless, that does not say that the evaluation
Bengali language text using machine learning is one of the measures provided would completely depict the enhancement
most unique topics for Natural Language Processing (NLP) of model performance in terms Bengali text dataset
research in Bangladesh. Many experts have already worked utilization. Bitto et al. [13] utilizes Linear Regression,
on this classification of languages around the world. Some Decision tree, Random Forest, Multinomial Naïve Bayes, and
employed their datasets, while others used publicly KNN on Bengali text dataset of 1499 Bengali texts for
accessible or merged datasets to train their classifiers. In this classifying then as good or bad with accuracy of 89.31% that
paper, we observed those papers to develop our research is achieved by Multinomial Naïve Bayes. There is scope to
work. evaluate the model with large volume of data. Mandal et al.
[14] explored many machines learning methods, including
Bitto et al. [4] examined machine learning techniques for NB, SVM, DT, KNN, Random Forest (RF), and Logistic
categorizing Bengali social media text as “good” or “bad”. Regression (LR), for categorizing Bangla news items from
They test Logistic Regression (LR), Decision Tree Classifiers diverse sources into business, sports, health, technology, and
(DTC), Random Forests (RF), Multinomial Naive Bayes education categories. They achieved 89.14% accuracy with a
(MNB), and K-Nearest Neighbors (KNN) on a dataset of collection of 1000 Bangla news documents. Haque, et al. [15]
1499 Bengali text documents, all achieved great accuracy, utilizes four categories - acceptable, political, sexual, and
with MNB coming out on top with 89.31%. religious- are proposed for Bangali social media comments
Hasan et al. [5] investigated emotion recognition in using a supervised deep learning model called CLSTM,
Bengali speech using recurrent neural networks (RNNs). which combines CNN & LSTM on data of 42036 Facebook
They categorized emotions based on Bengali music and comments, the model outperformed six baseline models, with
movie speech as well as the researchers obtained an accuracy an accuracy of 85.8% and an F1 score of 0.86. However, there
rate of 47.66% to 51.33% for six emotions: joy, sorrow, is scope to enhance the overall performance. Bitto et al. [16]
anger, surprise, fear, and disgust. Rahman et al. [6] proposed study sentiment analysis in analyses of Bangladeshi food
a dynamic approach using a Word2Vec model for sentiment delivery businesses using machine learning (ML) and deep
classification in Bengali literature. Using a dataset learning (DL) approaches. They examined a variety of
comprising 11,000 Bengali articles, newspapers, and techniques, including Logistic Regression (LR), Random
continuous bags of words (CBOW), they were able to reach Forest (RF), Multinomial Naive Bayes (MNB),
a 75% accuracy rate for sentiment classification of happy, Convolutional Neural Networks (CNN), Long Short-Term
furious, and excited categories. Memory (LSTM), and Recurrent Neural Networks (RNN).
Their greatest findings came from Machine Learning
techniques, with Random Forest scoring 89.64% for positive

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
sentiment and 91.07% for negative sentiment. Roy et al. [17] চির হীন, লালসালু, etc. With the help of this Saint and
used several machine learning techniques to identify Bengali Common Bengali Language dataset, machine learning (ML)
news items, including Support Vector Machines (SVM), can be used to learn the machine from the data, leading to
Random Forest (RF), Multinomial Naive Bayes (MNB), revolutionary results in language processing. Table 1 and
Decision Tree (DT), K-Nearest Neighbors (KNN), and Table 2 presents the sample words of saint and common form,
Logistic Regression (LR). They attained the greatest sample data, respectively. Fig. 2 shows the data statistics.
accuracy of 0.9261 and 0.9521 using a Support Vector
Machine (SVM), Xo-Bert respectively on a dataset of 14,451 TABLE I. SAMPLE BENGALI WORDS IN SAINT AND COMMON FORM
Bengali news pieces.Alam et al. [18] evaluated the
No. Saint Form Common Form
effectiveness of using BERT and XLM-RoBERTa big
language models, as well as a beam search decoder, for text 01 দিখয়ােছন দেখেছন
summarization of web pages, social media posts, online news 02 খাইয়ােছন খেয়েছন
portals, emails, online shops, user reviews, and customer care 03 কিরয়ােছা কেরছ
question and answer sessions. They found 93.8% on a dataset
of Bengali and English documents. Sarkar et al. [19] 04 তাহারা তারা
investigated a framework for identifying Bangla news articles 05 তাহােক তােক
and comments with Convolutional Neural Networks (CNNs) 06 ইহারা ইহা
and Bidirectional Long Short-Term Memory (BiLSTM)
networks. They obtained an accuracy of 83.77% using a TABLE II. DATA SAMPLE AFTER PREPROCESSING
dataset of 13,803 Bengali news items and comments.
Naughton et al. [20] used Support Vector Machines (SVMs) No. Context Class
to classify text from diverse genres into event categories such
িশ া িত ান সমাজ গঠেন পূণ $
as "die" and "attack". They test two SVM models on a set of 01 Common
ভূ িমকা পালন কের।
English text data from news articles and blogs. A skip-gram
model, a variation of n-grams, is used by Yajni et al. [21] to মা'টর উবরতা
$ বজায় রাখার জন* ফসেলর
investigate sentiment analysis in Nepali literature. Sentiment 02 Common
আবত$ন কৃিষেত এক'ট সাধারণ অভ*াস।
analysis is accomplished to classify of texts in terms of
positive and negative texts. They achieved an 89% 03 যাহা চািহয়ািছলাম তাহা পাই নাই Saint
classification accuracy for positive and negative statements তাহার জন* আমার কথায় ব*াঘাত িকছ5
using a dataset of 4,200 sentences in Nepali. 04 Saint
ঘ'টয়ােছ
III. RESEARCH METHODOLOGY
Bengali text classification into two forms is a very
significant and trendy topic. Thus, choosing the appropriate
methodology is a fundamental step towards determining the
effectiveness of categorization and prediction analysis, and a
proper assortment for the method analysis will result in a
system with specifications. In this part, we present our
methodology workflow to show how our model works to
classify Bengali text into two forms, Saint and Common, as
well as to predict them appropriately. There are some parts in
Fig. 1 to show the workflow of the methodology, respectively.

Fig. 2: Statistics of data.


Fig. 1: Research methodology diagram of proposed work.
B. Data Preprocessing
A. Dataset Collection and Description Data preprocessing is like creating a road map for
Datasets play a vital role in research, and machine learning machines to optimize the complexity of information. Only
gains traction as data is gathered more regularly to feed the machine-accessible data can provide accurate predictions in
computer. We have collected 2948 Bengali texts in two any context. We preprocess our data in five phases to improve
classes—saint and common—covering a variety of its efficiency. After data collection, the data labeling process
subdomains, including agriculture, medicine, crime, has been accomplished by three annotators who are expert in
education, social media, feelings, food, finance, etc. Some of Bengali grammar in terms of identifying saint and common
the saint sentences are made by ourselves according to those from Bengali texts and finally labelled the data as 0 for saint
subdomains, while others are taken from numerous historical (sadhu) form and 1 for common (Cholito) form. Table 3
books, novels, and literary works, such as চােখর বািল, illustrates the data labelling or data class mapping.

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
C. Feature Extraction
Once the data has been cleaned up, we tokenize our text.
The method of tokenizing a statement involves reducing it to
a single word. Firstly, the Countvectorizer is used to tokenize
our data and it has broken down the individual text data into
single words based on whitespaces. For example, the text as
Fig. 3: Applied data preprocessing techniques “িবজয়ী বীর অভ*থনা $ পল” has been transformed into an
array individual words: [“িবজয়” , “বীর” , “অভ*থনা” $ ,
In Table 3, it is observed that data are labelled in Bangla as
Sadhu and Cholito, in English as Saint and Common, as well “ পল” ] for applying tokenization techniques on data.
as in numerical as 0 and 1. Eventually, the labelling Secondly, CountVectorizer has built a vocabulary of all
approaches ensures the clarity, authenticity, simplicity, and unique words presented throughout entire text corpus and as
an output it returns a vocabulary with 67 unique words by
well-defined acknowledgement. After data labelling, primary
analyzing our dataset. Finally, it converts each unique word
cleaning, remove expression, remove emoji, and remove null into a numerical vector form and provides lexicon of the same
values are applied sequentially as pre-processing techniques density of 35.38.
to enhance the computational efficiency of any classifiers or
models. D. Dataset Splitting
After feature extraction, we split our cleaned and
TABLE III. DATA LABELLING
simplified dataset into train and test data. After having 2948
Bangla English Numerical preprocessed data, we trained our model with 85% data and
Sadhu(সাধু) Saint 0 used 15% data for testing. We used much data for training to
Cholito(চিলত) Common 1 achieve higher accuracy. The amount of training data is 2505
and test data is almost 443.
Then, we examined our data to see whether the sentences
had the three qualities of coherence, clarity, and imagery, and E. Model Implementation
whether the words were correctly spelled as a part of primary Several supervised and unsupervised models are offered
clean. If there is any problem that has come into view, then we in machine learning to enhance the language processing
have manually removed it. As we have collected large tasks, especially for text classification. We have applied six
amounts of data, there is a possibility of scraping the same supervised machine learning models or classifiers on our
sentence multiple times. We check for duplicate values and Bengali dataset to classify Bengali text or sentence
then remove them from our dataset to avoid model overfitting classification into two forms as Saint and Common with
circumstances. Textual data may contain emoticons, Unicode, satisfactory accuracy. Here, we have applied- Multinomial
digits, punctuations, special characters as well as mixed letters Naive Bayes (MNB), Random Forest (RF), Decision Tree
or words of different languages. Thus, removing these (DT), Support Vector Machine (SVM), K Neighbors
unnecessary characters enhances the regular expression to be Classifier (KNN), and XGBoost (XGB). A short description
cleaned up as well as it generates smooth and readable text of each classifiers including performances based on our
form. Table 4 presents the sample data and cleaned data after dataset analysis is given below as follows:
removing regular expressions.
TABLE IV. CLEANED DATA AFTER REMOVING REGULAR EXPRESSION Support Vector Machine: Support vector machines are
linear machine learning methods for difficult text
No. Sample Input Text Cleaned Data
categorization problems that make use of supervised learning
“ দেশর রাজনীিত িদনেক িদন পেচ দেশর রাজনীিত models. Finding the best hyperplane in an N-dimensional
01 যাে6।!! धोरे नकाएम कोिमकाएं . How িদনেক িদন পেচ space to divide the data points into distinct classes in the
unfortunate. সুB থাকা দায় যাে6 সুB থাকা দায় feature space is the primary goal of the SVM technique. The
linear SVM mathematical formulation is as follows:
তাহারা আমােদরেক কC তাহারা আমােদরেক
02
িদয়ািছল!!! কC িদয়ািছল 0 (1)
Random Forest: It is a significant ensemble approach
We eliminated any null values that had been in the dataset that depends on a random subset of the data and features to
as well as a few needless stop words that added to the data to build the multiple decision trees at the moment of training.
make it more complex. In order to prepare our data for training Random forest optimizes the overfitting and enhances
and testing with our model, we also eliminated stop words generalization by accumulating the predictions of these trees
from our dataset. The majority of frequent words are by voting for classification or averaging. It has been popular
eliminated during stop words preprocessing since they don't for the text classifications for several strengths as managing
provide much context to the text. Stop words in Bangla texts high dimensional data, providing feature importance
comprises articles, prepositions, and conjunctions. estimation as well as robust performances with optimal
Eliminating these terms can increase natural language tuning. Random forest with bootstrap aggregating looks like
processing accuracy and efficiency. this:
For instance, the stop words in the sentence "আিম এবং , ,…, (2)
স আরও এক'ট জয় পেয়িছ" are “আিম”, “এবং”, “ স”
XGBoost: XGB is an approach of gradient boosting
, “আরও”, “এক'ট”. After preprocessing stop words, the algorithms in terms of advancement, renowned for its
phrase may become “জয় পেয়িছ”. efficiency and outcome in supervised learning tasks such as
classification, regression, and ranking. It sequentially builds

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
a series to correct the errors made by previous models such ∑!" , ∈ (3)
as typical decision tree algorithm. The mathematical
representation of this classifier is given below:

PQR ]R^[WVXR
Multinomial Naive Bayes: Multinomial Naive Bayes is MbO × 100% (11)
one Naive Bayes algorithm variant that performs well in Z[\UR STUVWVXRY PQR ]R^[WVXR
classification applications requiring discrete features. It is
frequently applied to text categorization, where characteristics
PQR STUVWVXR
are word counts or frequency distributions. The "Naive" Ncdefgfhi × 100% (12)
PQR STUVWVXRYZ[\UR STUVWVXR
assumption in Naive Bayes implies that features are assumed
l×SPRmVUVTn ×oRm[\\
to be independent given the class label, although this aj kehcd × 100% (13)
SPRmVUVTnYoRm[\\
assumption might not hold true in practice. Despite its
Z[\UR ]R^[WVXRYZ[\UR STUVWVXR
simplicity and the oversimplified assumption, Multinomial pcchc Oqrd × 100% (14)
TW[\ nT. Tt U[uv\RU
Naive Bayes often performs remarkably well in practice,
especially with large, high-dimensional datasets. The IV. RESULT ANALYSIS AND DISCUSSIONS
mathematical representation of Naive Bayes is here:
& $| & $
The implementation of six machine learning models has
# $| (4) enabled the distinction between common and Bangla
&
sentences. All classifiers work well, although some are better
Decision Tree: Decision Trees are versatile supervised than others. Various critical performance measures are
algorithms for classification and regression, using recursive computed to assess each machine learning model's efficacy
partitioning based on information gain or Gini impurity. They during the training and testing phases. Among machine
consist of root, internal, and leaf nodes connected by branches. learning metrics, the confusion matrix is the most widely used
The algorithm continues splitting data until meeting stopping for classification tasks. Table 5 presents four essential values:
criteria like depth limits or sample thresholds. Decision Trees true positive, true negative, false positive, and false negative.
offer interpretability through visual decision paths and handle The confusion matrix for each deployed model is shown in
both numerical and categorical data efficiently. However, Figure 5, which also shows a significant number of true
overfitting risks can be managed with pruning, depth limits, or positives and true negatives while reducing false positives and
ensemble methods like Random Forest [1]. For a binary false negatives.
classification problem, entropy H(D) is defined as: TABLE V. CONFUSION MATRIX
'()# − ∑ " # × ,-. # (5) Saint (Predict) Common (Predict)
K-Nearest Neighbor: The k-NN algorithm is a potent Saint (Actual) Saint (True) Common (False)
supervised machine learning method for regression and
classification applications. By considering the majority class Common (Actual) Saint (False) Common (True)
among its k nearest neighbors, it uses Euclidean distance to
approximate the class labels of data points. In order to To find the best model for text classification, several
anticipate the output value, regression requires averaging the critical performance metrics such as accuracy, precision, F1
goal values of the k nearest neighbors. The KNN score, and error rate were computed after the confusion
mathematical formulation using Euclidean distance is as matrix was computed. In addition, the model's performance
follows: was evaluated by computing the true positive rate (TPR),
false negative rate (FNR), false positive rate (FPR), and true
/ ' 0$ ,1 2∑/3" 3 − 13 (6) negative rate (TNR). Class-wise performance was also
calculated for each class independently in binary
F. Performance Evaluation classification tasks. The performance of the six machine
After implementing the model, various performance learning models is displayed in Table 6.
metrics are calculated to assess its effectiveness in a binary Table 6 shows that all three models achieve an accuracy
classification task of identifying saint form and common form of more than 90%, and all models achieve an accuracy of
of Bangla sentence. The first step involves generating a more than 84%, which is quite good for a classification task.
confusion matrix, then key metrics such as true-positive rate, The support vector machine (SVM) model outperformed the
false-negative rate, false-positive rate, and true-negative rate other five well-defined classifiers, with the greatest accuracy
are derived. Following this, accuracy, precision, F1 score, of 92.33% among the six applied models. With a precision
and error are computed to determine the optimal model for rate of 92.16% and accuracy of 91.87%, the random forest
text classification identification. Their equations “(7-14)” are model comes in close second. Furthermore, the XGBoost
given below: model obtained an F1 score of 91.17% and an accuracy of
:;<=>? @A B@??>CD E?>FGCDG@HI 91.20%. With an accuracy of 84.88% and a slightly uneven
Acccuracy (7)
J@DKL :;<=>? @A E?>FGCDG@HI model performance of 84.86%, the K-Nearest Neighbor
PQR STUVWVXR (KNN) model performed the worst, though.
MNO × 100% (8)
PQR STUVWVXRYZ[\UR ]R^[WVXR
Following the performance evaluation of the machine
Z[\UR ]R^[WVXR
abO × 100% (9) learning models, several case studies were conducted to
PQR STUVWVXRYZ[\UR ]R^[WVXR assess the accurate detection of specific saint and common
Z[\UR STUVWVXR
aNO × 100%
texts by these models. The predictions of the six machine
(10)
Z[\UR STUVWVXRY PQR ]R^[WVXR

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
learning classifiers were applied independently to Bengali could correctly anticipate the input's common or typical form,
texts in both Saint and Common forms, as shown in Tables 5 and assess if the algorithms could predict the output
and Table 6, respectively. An example input was sourced accurately.
from an online newspaper to test whether the algorithms

Fig. 4: Confusion matrices for applied machine learning models (a-f)

TABLE VI. PERFORMANCE METRICES FOR ALL APPLIED MACHINE LEARNING MODELS

Model Class Accuracy TPR FNR FPR TNR Precision F1 Score Error Rate
Saint 89.77 10.23 5.26 94.74 94.15 91.90
SVM 92.33 7.67
Common 94.74 5.26 10.23 89.77 90.76 92.70
Saint 87.91 12.09 4.39 95.61 94.97 91.30
RF 91.87 8.13
Common 95.61 4.39 12.09 87.91 89.34 92.37
Saint 88.37 11.63 6.14 93.86 93.14 90.69
XGB 91.20 8.80
Common 93.86 6.14 11.63 88.37 89.54 91.65
Saint 80.00 20.00 4.39 95.61 94.51 86.65
MNB 88.04 11.96
Common 95.61 4.39 20.00 80.00 83.52 89.16
Saint 87.50 12.50 14.16 85.84 86.34 86.92
DTC 86.68 13.32
Common 85.84 14.16 12.50 87.50 87.04 86.44
Saint 84.65 15.35 14.91 85.09 84.26 84.45
KNN 84.88 15.12
Common 85.09 14.91 15.35 84.65 85.46 85.27

Table 7 and Table 8 demonstrate that all the models SVM Saint
accurately predicted the inputted text as either the saint or MNB Saint
common form of Bangla, respectively. DT Saint
KNN Saint
TABLE VII. PREDICTIONS OF SAINT FORM BANGLA TEXT
TABLE VIII. PREDICTIONS OF COMMON FORM BANGLA TEXT
িবজয়ী বীরেদর অভ*থনার
$ জন* চারিদেকর
Original Text
মানুষ ব*িতব*F হইয়া উ'ঠয়ািছল। বাবা অত*G িবচ ণতার সােথ দুই ভাইেয়র
Original Original Text
Saint ঝগড়া মীমাংসা কের িদেয়িছেলন।
Prediction
িবজয়ী বীরেদর অভ*থনার $ জন* চারিদেকর Original
Input Text Common
মানুষ ব*িতব*F হইয়া উ'ঠয়ািছল Prediction
Prediction of Our Algorithms
XGB Saint বাবা অত*G িবচ ণতার সােথ দুই ভাইেয়র ঝগড়া
Input Text
RF Saint মীমাংসা কের িদেয়িছেলন

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.
Prediction of Our Algorithms 2020 third international conference on smart systems and inventive
technology (ICSSIT), pp. 1131-1136. IEEE, 2020.
XGB Common [6] Rahman, Mafizur, Md Rifayet Azam Talukder, Lima Akter Setu, and
RF Common Amit Kumar Das. "A dynamic strategy for classifying sentiment from
SVM Common Bengali text by utilizing Word2vector model." Journal of Information
Technology Research (JITR) 15, no. 1 (2022): 1-17.
MNB Common
[7] Rahman, Mafizur, Md Rifayet Azam Talukder, Lima Akter Setu, and
DT Common Amit Kumar Das. "A dynamic strategy for classifying sentiment from
KNN Common Bengali text by utilizing Word2vector model." Journal of Information
Technology Research (JITR) 15, no. 1 (2022): 1-17.
[8] Wadud, Md Anwar Hussen, Muhammad Mohsin Kabir, Muhammad F.
Mridha, M. Ameer Ali, Md Abdul Hamid, and Muhammad Mostafa
Monowar. "How can we manage offensive text in social media-a text
classification approach using LSTM-BOOST." International Journal of
Information Management Data Insights 2, no. 2 (2022): 100095.
[9] Hasan, Mehedi, et al. "Multiple Bangla Sentence Classification using
Machine Learning and Deep Learning Algorithms." 2022 13th
International Conference on Computing Communication and
Networking Technologies (ICCCNT). IEEE, 2022.
[10] Islam, Khondoker Ittehadul, Md Saiful Islam, and Md Ruhul Amin.
"Sentiment analysis in Bengali via transfer learning using multi-lingual
BERT." 2020 23rd International Conference on Computer and
Information Technology (ICCIT). IEEE, 2020.
[11] Siam, A. S. M., Hasan, M. M., Talukdar, M. M., Arafat, M. Y., Jobayer,
Fig. 5: Overall performance of six machine learning model. S. H., & Farid, D. M. (2023, October). Bangla News Classification
Employing Deep Learning. In International Conference on Intelligent
V. CONCLUSION AND FUTURE WORKS Systems and Data Science (pp. 155-169). Singapore: Springer Nature
Singapore..
The classification of Bengali text into Saint and Common [12] Sameen Shahgir, H. A. Z., Alam, R., & Alam, M. Z. U. (2023).
forms represents a burgeoning and promising research BanglaCoNER: Towards Robust Bangla Complex Named Entity
domain within Bangla natural language processing. The Recognition. arXiv e-prints, arXiv-2303..
proposed model utilized a dataset comprising nearly 2948 [13] Bitto, A. K., Bijoy, M. H. I., Khan, S., Mahmud, I., & Biplob, K. B. B.
samples, an optimal quantity for testing six classifier methods (2023, May). Approach of Different Classification Algorithms to
to effectively discern Saint and Common forms within Compare in N-gram Feature Between Bangla Good and Bad Text
Bengali texts. Testing results revealed that the (SVM) Discourses. In Machine Intelligence Techniques for Data Analysis and
Signal Processing: Proceedings of the 4th International Conference
outperformed other methods, achieving an accuracy of MISP 2022, Volume 1 (pp. 105-116). Singapore: Springer Nature
92.32%, while RF, (XGB), MNB, DT, and KNN models Singapore.
performed with accuracies of 91.87%, 91.19%, 88.03%, [14] Mandal, Ashis Kumar, and Rikta Sen. "Supervised learning methods
86.68%, and 84.87% respectively. This study aimed to for bangla web document categorization." arXiv preprint
identify Bengali sentences with a nuanced understanding of arXiv:1410.2045 (2014).
Saint and Common forms. Looking ahead, future endeavors [15] Haque, R., Islam, N., Tasneem, M., & Das, A. K. (2023). Multi-class
will explore the integration of various deep learning methods sentiment classification on Bengali social media comments using
to develop new framework aimed at enhancing Bengali text machine learning. International journal of cognitive computing in
classification, leveraging datasets with larger volumes of data engineering, 4, 21-35.
as well as will employ XAI techniques to enhance model [16] Bitto, Abu Kowshir, Md Hasan Imam Bijoy, Md Shohel Arman, Imran
performance effectively. Mahmud, Aka Das, and Joy Majumder. "Sentiment analysis from
Bangladeshi food delivery startup based on user reviews using machine
REFERENCES learning and deep learning." Bulletin of Electrical Engineering and
Informatics 12, no. 4 (2023): 2282-2291.
[1] Van Schendel, Willem. A history of Bangladesh. Cambridge [17] Roy, Amartya, Kamal Sarkar, and Chintan Kumar Mandal. "Bengali
University Press, 2020. Text Classification: A New multi-class Dataset and Performance
[2] Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Evaluation of Machine Learning and Deep Learning Models." (2023).
Tomas Mikolov. "Learning word vectors for 157 languages." arXiv [18] Alam, Tanvirul, Akib Khan, and Firoj Alam. "Bangla text classification
preprint arXiv:1802.06893 (2018).. using transformers." arXiv preprint arXiv:2011.04446 (2020).
[3] Ishmam, Alvi Md, and Sadia Sharmin. "Hateful speech detection in [19] Sarkar, Ovi, Md Faysal Ahamed, Tahsin Tasnia Khan, Moloy Kumar
public facebook pages for the bengali language." In 2019 18th IEEE Ghosh, and Md Robiul Islam. "An experimental framework of bangla
international conference on machine learning and applications text classification for analyzing sentiment applying CNN & BiLSTM."
(ICMLA), pp. 555-560. IEEE, 2019. In 2021 2nd International Conference for Emerging Technology
[4] Bitto, Abu Kowshir, Md Hasan Imam Bijoy, Saima Khan, Imran (INCET), pp. 1-6. IEEE, 2021.
Mahmud, and Khalid Been Badruzzaman Biplob. "Approach of [20] Tanev, Hristo. "Leveraging approximate pattern matching with bert for
Different Classification Algorithms to Compare in N-gram Feature event detection." In Proceedings of the 7th Workshop on Challenges
Between Bangla Good and Bad Text Discourses." In Machine and Applications of Automated Extraction of Socio-political Events
Intelligence Techniques for Data Analysis and Signal Processing: from Text (CASE 2024), pp. 32-39. 2024.
Proceedings of the 4th International Conference MISP 2022, Volume
[21] Yajni, Archit, and Ms Sabu Lama Tamang. "CHUNKER BASED
1, pp. 105-116. Singapore: Springer Nature Singapore, 2023.
SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR
[5] Hasan, HM Mahmudul, and Md Adnanul Islam. "Emotion recognition NEPALI TEXT.
from bengali speech using rnn modulation-based categorization." In

Authorized licensed use limited to: National University Fast. Downloaded on March 27,2025 at 07:58:05 UTC from IEEE Xplore. Restrictions apply.

You might also like