0% found this document useful (0 votes)
9 views4 pages

Spring 2025 - CS619 - 10969

The project focuses on developing a machine learning model to detect cyber abuse in Roman Urdu text on social media platforms. It involves data collection, preparation, pre-processing, feature extraction, and the application of various machine learning techniques, culminating in a web interface for real-time testing. The project aims to evaluate model performance through metrics like accuracy, precision, recall, and F1-score, ultimately determining the most effective machine learning technique for this task.

Uploaded by

z GOD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Spring 2025 - CS619 - 10969

The project focuses on developing a machine learning model to detect cyber abuse in Roman Urdu text on social media platforms. It involves data collection, preparation, pre-processing, feature extraction, and the application of various machine learning techniques, culminating in a web interface for real-time testing. The project aims to evaluate model performance through metrics like accuracy, precision, recall, and F1-score, ultimately determining the most effective machine learning technique for this task.

Uploaded by

z GOD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Cyber Abuse Detection using Machine Learning for Roman Urdu

Project Domain / Category


Data Science / Machine Learning / Natural Language Processing (NLP)

Abstract / Introduction
The extensive use of social media has led to a significant increase in cyber abuse, including
harassment, bullying, and offensive language, particularly in Roman Urdu. The absence of effective
automated detection systems allows such content to persist, negatively impacting online interactions.
Identifying cyber abuse in Roman Urdu presents a unique challenge due to informal language
structure, variations in spelling, and contextual meanings.
This project aims to develop a machine learning-based model capable of detecting and classifying
cyber abuse in Roman Urdu text. The proposed system will utilize natural language processing (NLP)
techniques and will be trained on data collected from social media platforms. Furthermore, a web
interface will be developed to enable users to evaluate the model’s performance in real time.

Functional Requirements:
Admin (Student) will perform all these (Functional Requirements) tasks.
1. Data-Collection
 For this project, the student will collect data from any social media platform (such as
YouTube, Facebook, Twitter, or Instagram) to detect cyber abuse. The dataset must contain
at least 5,000 comments focusing on Roman Urdu.
 The student is required to create their own dataset, and using pre-existing datasets from
sources like Kaggle or other online repositories will not be accepted. Any attempt to do so
will result in a deduction of marks. A sample dataset is provided in the link below for
reference.
2. Data Preparation
 Prepare the dataset by labeling each comment as "Abusive (A)" or "Non-Abusive (NA)."
This step involves manually reviewing the data to assign appropriate labels, ensuring the
dataset is clean, well-structured, and suitable for machine learning.
3. Data Pre-Processing
 As real-world data is often incomplete, noisy, and contains missing values, the student
must apply pre-processing techniques to ensure data quality. The following steps should be
performed systematically:

i. Missing Values
o First, check how many missing values are present and display the output.
o Then, apply an appropriate technique to handle them (e.g., remove or fill with
relevant values).

ii. Duplicate Values


o First, check the number of duplicate entries and display the output.
o Then, remove the duplicates to maintain data quality.

iii. Noise & Outliers


o First, identify noisy or extreme values and display the output.
o Then, clean or handle them to improve dataset reliability.

 Additionally, the student must normalize the dataset, remove stop words, and ensure data
is properly structured before feature extraction.
4. Feature Extraction
 After the pre-processing step, the student will apply feature extraction techniques to
convert textual data into a structured format suitable for machine learning models. Possible
techniques include Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words
(BoW), N-Gram Models (Uni-Gram, Bi-Gram, Tri-Gram, etc.), Word Embeddings
(Word2Vec, FastText, GloVe) can also be applied.
 The student must have a clear understanding of the working principles, advantages, and
limitations of the chosen feature extraction method. It is essential to justify the selection by
explaining why a particular technique was used and how it contributes to improving the
model's performance.
5. Train & Test Data
 The student will split the dataset into 75% training data and 25% testing data to evaluate
the performance of the machine learning models. To ensure reliable results, the student
can apply randomized splitting to avoid bias and maintain data diversity.
6. Machine learning Techniques
 The student must use at least three different classifiers/models from distinct machine
learning techniques/algorithms. Possible choices include Naïve Bayes (Multinomial,
Bernoulli), Support Vector Machine (SVM) with different kernels (Poly, RBF), Decision Tree,
Random Forest, Logistic Regression, and Ensemble Methods. The selection should be based
on the suitability of the algorithm for text classification tasks.
 Additionally, the student must have a clear understanding of each chosen model, including
its algorithmic working, advantages, limitations, and practical applications. It is essential
that the student can justify their selection by explaining why a particular model was chosen
over others. Furthermore, the student should be proficient in the implementation and
coding of the selected models and be able to analyse their performance effectively.
7. Confusion Matrix
 The student must generate a confusion matrix for each classification model to evaluate its
performance. The confusion matrix should include key metrics such as True Positives (TP),
True Negatives (TN), False Positives (FP), and False Negatives (FN) to assess the model’s
accuracy. A separate confusion matrix must be created for each selected machine learning
model, and the results should be analyzed to compare their effectiveness in detecting cyber
abuse.
8. Accuracy Evaluation
 The student must find the accuracy of all selected machine learning techniques and
compare their performance.
 This project will also determine which machine learning technique is more effective for
detecting cyber abuse.
 In addition to accuracy, the student should evaluate precision, recall, and F1-score for a
more comprehensive analysis.
 The student must visually represent accuracy comparisons using graphs, bar charts, or
other suitable visualizations to highlight differences between models.
 A final analysis should be conducted to explain which model performed best and why,
based on the evaluation metrics.
9. Web Interface Integration
 After developing the model, the student will integrate a web interface to allow users to test
the model’s performance using real-time comments.
 The interface should provide a text input field where users can enter a comment, and the
system will classify it as Abusive (A) or Non-Abusive (NA).
 The web interface will be developed using Flask or Django, with a simple HTML/CSS
frontend for user interaction.
 The student should ensure that the interface is fully functional, correctly linked to the
trained model, and capable of making real-time predictions.

Tools/Techniques:
 Anaconda: Python distribution platform for development.
 Jupiter Notebook: For implementing machine learning models.
 Python: Programming language used for data pre-processing, model training, and feature
extraction.
 Machine Learning Algorithms: For training and testing hate speech detection.
 Web Interface: Basic HTML/CSS, Flask, or Django.

Prerequisite:
 Knowledge of Artificial Intelligence, Machine Learning, and Natural Language Processing
concepts is required. Students will cover a short course relevant to these concepts, alongside
SRS and Design initial documentation or see the links below.
Helping Material:
Python:
https://www.python.org/
https://www.w3schools.com/python/
https://www.tutorialspoint.com/python/index.htm

Feature Extraction Method:


https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be
https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-
dataset-796d339a4089
https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-a-beginners-
guide-to-understand-natural-language-processing/
http://uc-r.github.io/creating-text-features
Machine Learning Techniques:
https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0
https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-149374935f3c
https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-should-know-
3cc96e0eeee9
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
https://www.youtube.com/watch?v=fG4e4TUrJ3E
https://www.youtube.com/watch?v=7eh4d6sabA0

Dataset:
https://drive.google.com/file/d/1l8Mo22kVQzrucbo2LCwnP74sRZ4Eztb_/view?usp=sharing

Supervisor:
Name: Tayyab Waqar
Email ID: [email protected]
Skype ID: maliktayyab786_1

You might also like