Spring 2025 - CS619 - 10969
Spring 2025 - CS619 - 10969
Abstract / Introduction
The extensive use of social media has led to a significant increase in cyber abuse, including
harassment, bullying, and offensive language, particularly in Roman Urdu. The absence of effective
automated detection systems allows such content to persist, negatively impacting online interactions.
Identifying cyber abuse in Roman Urdu presents a unique challenge due to informal language
structure, variations in spelling, and contextual meanings.
This project aims to develop a machine learning-based model capable of detecting and classifying
cyber abuse in Roman Urdu text. The proposed system will utilize natural language processing (NLP)
techniques and will be trained on data collected from social media platforms. Furthermore, a web
interface will be developed to enable users to evaluate the model’s performance in real time.
Functional Requirements:
Admin (Student) will perform all these (Functional Requirements) tasks.
1. Data-Collection
For this project, the student will collect data from any social media platform (such as
YouTube, Facebook, Twitter, or Instagram) to detect cyber abuse. The dataset must contain
at least 5,000 comments focusing on Roman Urdu.
The student is required to create their own dataset, and using pre-existing datasets from
sources like Kaggle or other online repositories will not be accepted. Any attempt to do so
will result in a deduction of marks. A sample dataset is provided in the link below for
reference.
2. Data Preparation
Prepare the dataset by labeling each comment as "Abusive (A)" or "Non-Abusive (NA)."
This step involves manually reviewing the data to assign appropriate labels, ensuring the
dataset is clean, well-structured, and suitable for machine learning.
3. Data Pre-Processing
As real-world data is often incomplete, noisy, and contains missing values, the student
must apply pre-processing techniques to ensure data quality. The following steps should be
performed systematically:
i. Missing Values
o First, check how many missing values are present and display the output.
o Then, apply an appropriate technique to handle them (e.g., remove or fill with
relevant values).
Additionally, the student must normalize the dataset, remove stop words, and ensure data
is properly structured before feature extraction.
4. Feature Extraction
After the pre-processing step, the student will apply feature extraction techniques to
convert textual data into a structured format suitable for machine learning models. Possible
techniques include Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words
(BoW), N-Gram Models (Uni-Gram, Bi-Gram, Tri-Gram, etc.), Word Embeddings
(Word2Vec, FastText, GloVe) can also be applied.
The student must have a clear understanding of the working principles, advantages, and
limitations of the chosen feature extraction method. It is essential to justify the selection by
explaining why a particular technique was used and how it contributes to improving the
model's performance.
5. Train & Test Data
The student will split the dataset into 75% training data and 25% testing data to evaluate
the performance of the machine learning models. To ensure reliable results, the student
can apply randomized splitting to avoid bias and maintain data diversity.
6. Machine learning Techniques
The student must use at least three different classifiers/models from distinct machine
learning techniques/algorithms. Possible choices include Naïve Bayes (Multinomial,
Bernoulli), Support Vector Machine (SVM) with different kernels (Poly, RBF), Decision Tree,
Random Forest, Logistic Regression, and Ensemble Methods. The selection should be based
on the suitability of the algorithm for text classification tasks.
Additionally, the student must have a clear understanding of each chosen model, including
its algorithmic working, advantages, limitations, and practical applications. It is essential
that the student can justify their selection by explaining why a particular model was chosen
over others. Furthermore, the student should be proficient in the implementation and
coding of the selected models and be able to analyse their performance effectively.
7. Confusion Matrix
The student must generate a confusion matrix for each classification model to evaluate its
performance. The confusion matrix should include key metrics such as True Positives (TP),
True Negatives (TN), False Positives (FP), and False Negatives (FN) to assess the model’s
accuracy. A separate confusion matrix must be created for each selected machine learning
model, and the results should be analyzed to compare their effectiveness in detecting cyber
abuse.
8. Accuracy Evaluation
The student must find the accuracy of all selected machine learning techniques and
compare their performance.
This project will also determine which machine learning technique is more effective for
detecting cyber abuse.
In addition to accuracy, the student should evaluate precision, recall, and F1-score for a
more comprehensive analysis.
The student must visually represent accuracy comparisons using graphs, bar charts, or
other suitable visualizations to highlight differences between models.
A final analysis should be conducted to explain which model performed best and why,
based on the evaluation metrics.
9. Web Interface Integration
After developing the model, the student will integrate a web interface to allow users to test
the model’s performance using real-time comments.
The interface should provide a text input field where users can enter a comment, and the
system will classify it as Abusive (A) or Non-Abusive (NA).
The web interface will be developed using Flask or Django, with a simple HTML/CSS
frontend for user interaction.
The student should ensure that the interface is fully functional, correctly linked to the
trained model, and capable of making real-time predictions.
Tools/Techniques:
Anaconda: Python distribution platform for development.
Jupiter Notebook: For implementing machine learning models.
Python: Programming language used for data pre-processing, model training, and feature
extraction.
Machine Learning Algorithms: For training and testing hate speech detection.
Web Interface: Basic HTML/CSS, Flask, or Django.
Prerequisite:
Knowledge of Artificial Intelligence, Machine Learning, and Natural Language Processing
concepts is required. Students will cover a short course relevant to these concepts, alongside
SRS and Design initial documentation or see the links below.
Helping Material:
Python:
https://www.python.org/
https://www.w3schools.com/python/
https://www.tutorialspoint.com/python/index.htm
Dataset:
https://drive.google.com/file/d/1l8Mo22kVQzrucbo2LCwnP74sRZ4Eztb_/view?usp=sharing
Supervisor:
Name: Tayyab Waqar
Email ID: [email protected]
Skype ID: maliktayyab786_1