Fraud Detection in Apps via Sentiment Analysis
Fraud Detection in Apps via Sentiment Analysis
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
(2021-2025) BY
i
CERTIFICATE
This is to certify that it is a bonafide record of Mini Project work entitled “FRAUD APP
DETECTION USING SENTIMENT ANALYSIS” done by RAMAVATH SRUJANA
(21241A1252), GANDLA SWATHI (22245A1204), CHENNA CHAITANYA (21241A1213) of
B.Tech in the Department of Information of Technology, Gokaraju Rangaraju Institute of
Engineering and Technology during the period 2021-2025 in the partial fulfillment of the
requirements for the award of degree of BACHELOR OF TECHNOLOGY IN INFORMATION
TECHNOLOGY from GRIET, Hyderabad.
(Project External)
ii
ACKNOWLEDGEMENT
We wish to express our gratitude to Dr. Y J Nagendra Kumar, HOD IT, our Project
Coordinators Mr. G.Vijendar Reddy and Ms. A. Pavitra for their constant support
during the project.
We express our sincere thanks to Dr. Jandhyala N Murthy, Director, GRIET, and
Dr. J. Praveen, Principal, GRIET, for providing us the conductive environment for
carrying through our academic schedules and project with ease.
We also take this opportunity to convey our sincere thanks to the teaching and non-teaching
staff of GRIET College, Hyderabad.
Email: [email protected]
Contact No: 7396047501
iii
DECLARATION
This is to certify that the mini-project entitled “ FRAUD APP DETECTION USING
SENTIMENT ANALYSIS” is a bonafide work done by us in partial fulfillment of the
requirements for the award of the degree BACHELOR OF TECHNOLOGY IN
INFORMATION TECHNOLOGY from Gokaraju Rangaraju Institute of Engineering
and Technology, Hyderabad.
We also declare that this project is a result of our own effort and has not been copied or
imitated from any source. Citations from any websites, books and paper publications are
mentioned in the Bibliography.
This work was not submitted earlier at any other University or Institute for the award of
any degree.
iv
TABLE OF CONTENTS
Name Page no
Certificates ii
Contents v
Abstract vii
1 INTRODUCTION 1
1.1 Introduction to project 1
1.2 Existing System 5
1.3 Proposed System 5
2 REQUIREMENT ENGINEERING 6
2.1 Hardware Requirements 6
2.2 Software Requirements 6
3 LITERATURE SURVEY 7
4 TECHNOLOGY 9
5 DESIGN REQUIREMENT ENGINEERING 13
5.1 UML Diagrams 13
5.2 Use-Case Diagram 15
5.3 Class Diagram 16
5.4 Activity Diagram 17
5.5 Deployment Diagram 18
5.6 Architecture 19
6 IMPLEMENTATION 22
7 SOFTWARE TESTING 31
7.1 Unit Testing 31
7.2 Integration Testing 32
7.3 Acceptance Testing 32
7.4 Testing on our system 33
8 RESULTS 34
9 CONCLUSION AND FUTURE 36
ENHANCEMENTS
10 BIBLIOGRAPHY 37
v
11 LIST OF DIAGRAMS
2 Class Diagram 16
3 Activity Diagram 17
4 Deployment Diagram 18
5 Architecture 21
vi
ABSTRACT
Due to the fast increase in mobile phone users, mobile applications have become more and more
dependent on cellular applications creating a sensitive area, app security. Users frequently cannot
spot fraudulent or noxious apps 1 prior to installing them, posing a critical threat. The goal of this
project is to identify fraudulent apps by´ analysing the user reviews with sentiment analysis.
Sentiment analysis estimates the emotional accompanying the user's feedback, labelling the reviews
as positive, negative, or neutral by means of Natural Language Processing (NLP) algorithms like
tokenizing, lemmatization, and stopword deletion. The resulting reviews are then mapped to feature
vectors by the Term Frequency-Inverse Document Frequency (TF-IDF) approach, which are fed
into machine learning models.
The system uses two classifiers (Logistic Regression and Random Forest), to learn if an app is
fraudulent or legitimate by the sentiment of its reviews. Reviews are extracted from the Google Play
Store by means of the Google Play Scraper API, and the models are tested with respect to accuracy
and classification performance. The user interface enables the input of the URL of an app from the
Google Play Store, retrieve its reviews, preprocess the reviews, and classify the app and app
reviews. This approach gives users a system for taking more intelligent decisions before
downloading potentially risky applications, thereby improving mobile security.
The objective of the project is to provide users with an effective tool to evaluate, in advance, apps
that are about to be downloaded, thus decreasing the number of downloads of potentially dangerous
applications. Models are assessed for accuracy, and the application offers a convenience feature so
that the user can enter an app's URL for its Google Play Store listing and get the review to the listing
and obtain the trustworthiness predicted. Using a combination of sentiment analysis and machine
learning, this system provides an effective solution for increasing mobile app security in order to
facilitate a more secure digital world experience for users.
vii
1. INTRODUCTION
Gradually mobile phones have become important to the users due to advancement in technology.
There has been growth in development of various mobile applications on number of platforms
including the famous Android and iOS platforms. It has reached the state of becoming one of the
bigger problems in the business intelligence industry because it steadily grows daily in terms of sales
and usage, as well as in developments. It does so and, in the process, makes the market that much
more competitive. Companies and application developers are equally struggling against each other to
prove how good their products are and put lots of hard work into attracting customers so as to sustain
their future expansion.
The problem of mobile apps is one of the most significant trends in the modern world since our
devices are no longer accessories but tools needed for functioning. But on the other hand, the vast
expansion of the app world has also contributed to the risks of the program that the so-called fraud
applications might cheat naïve users. While others may devise for themselves such deceptions as
data theft, excessive advertisement display or even disappointing functionality.
In order to overcome these risks this project intends to implement machine learning algorithm for
identification of fraudulent applications on Google play store. Using Basic sentiment analysis
random analysis of the reviews can also help in the identification of negative sentiments and
activities of the fake users making the solution scalable for the detection of the apps that pose a
threat to the real users.
This project uses NLP and Machine learning to evaluate the review of Applications as either
“Fraudulent” or “Legitimate”. With such approach this system offers a reliable way to help users,
developers as well as app store moderators to indentify such apps that may infringue user security or
quality standards.
As the applications of ML and NLP evolve, we should be able to develop methods in which reviews,
for example, are processed to detect the likelihood of running only suspicious apps. In fact, by
1
training models on user reviews marked “Fraudulent” or “Legitimate,” one is capable of evaluating
app quality without human intervention.
User Safety: Improved security to a user by being able to deter with apps with risks involved.
Logistic Regression:
Logistic Regression is one of those basic approaches in the branch of machine learning, particularly
in binary classification. Logistic Regression is an actual name so misleading because it in fact, is a
classification model and not a regression one. It is intended to predict the likelihood of an input
instance forming a particular class; therefore, it is useful where the dependent variable is qualitative.
Logistic Regression is a machine learning algorithm derived from statistics that deals with linearly
separable data through ease of interpretation and implementation.
It can be Yes or No, 0 or 1, true or False, etc., but instead of giving the exact value such as 0
and 1, it provides probabilistic values which are between 0 and 1 only.
In Logistic regression instead of drawing a regression line we draw a s-shape curve, and
maximum value which this function can predict are 0 or 1.
On the basis of the categories, Logistic Regression can be classified into three types:
2
Binomial: In binomial Logistic regression, the variables under consideration being able to
take only two possible forms with regards the dependent variables i.e 0 or 1, Pass or fail.
Ordinal: In ordinal Logistic regression, the dependent variable can actually take 3 or more
number of ordered types like low, Medium or High etc.
3
When more than two observational categories are available for the predictive variable, multinomial
logistic regression, commonly called softmax regression, is performed. In contrast to binary logistic
regression that works with binomial distributions of outcomes, multinomial logistic regression is
suited to handle several types of quantitative class. To estimate the likelihood of an observation
belonging to each class it uses the softmax function which normalizes the results to make the sum
total equal to 1 over all the classes.
4
Random Forest:
Random Forest as an algorithm of greater versatility and power that is applicable to the tasks of
classification and regression. It falls in the category of the boosting method of decision trees; several
decision trees are constructed and combined to achieve higher accuracy and reduced variance.
Unlike Get Single Decision Tree which easily overfits, Random Forest reduces this possibility by
averaging the result of several trees thus making the models more accurate.
Specifically, Random Forest uses a technique, known as Bootstrap Aggregating or Bagging in its
application. This involves formation of the bootstrap samples which are random samples from the
original data set and is created through the resampling with replacements, and the growth of a
number of decision trees on the bootstrap samples separately. Added to this, at every node in a tree,
a random selection of features is made to build up the tree, which again brings in a lot of variation
with the trees.
Such a random selection also assists in breaking the trees’ correlation and thus, variance reduction,
leading to better performance of the model.
In addition, Random Forest has beneficial characteristics when applied to various difficult high
dimensional learning problems and real-world applications with many features. It also mimics what
quantitative methods inform users about feature importance, thus, enable analysts determine which
variables offer the most contribution to predictions. Besides, Random Forest is relatively stable to
outliers and other noisy data since the algorithm utilizes information from several sources, making
the probability of individual inaccuracy relatively low.
5
Fig 1.3 Random Forest
An existing system of detecting fraudulent applications basically rely on manual reviews, user reports,
and the basic rule-based mechanisms. These types of methods are often Unproductive, time-taking,
and likely to cause human error. Moreover, traditional techniques focuses on heavily on metadata
analysis, like download counts, ratings, and permissions, which can be manipulated by fraudsters.
End- user reviews contain rich sources of content, and often analyzed without moving advanced
techniques like machine learning. This results in prolonged detection and an not ability to adapt to
evolving fraudulent schemes. Hence, many fraud apps remain undetected, they show risks to end-
users and app ecosystems.
Main approaches of identifying fraudulent apps at the moment include heuristic-based methods,
signature based detection, machine learning models, user reporting. Heuristic systems are more
inclined and assess the application to fixed rules such as numerous privileges or erratic behaviour of
the application, whereas, signature-based technique identifies application characteristics similar to the
known malware signatures. These systems examine the reviews about the app, how it behaves, and
the permissions it requires to forecast fraud and crowd-sourced feedback or hand-vetted reviews assist
identify dubious apps. That is, the existing systems have issues like reliance on continuous updates,
6
inability to recognize new threats, slow manual review, and the lack of scalability. This project builds
on those methods by being more specific to machine learning and sentiment analysis of user reviews,
where allows real-time, large-scale fraud detection. The project provides an efficientLib, fast
sentiment-based classification to detect fraudulent apps without relying on the limitations of the
traditional systems and alongside the use of models such as Logistic Regression and Random Forests,
this proposed project can give an efficient solution to the problem of detecting fraud apps from large
volumes of data.
The presented model can be described as an all-in-all machine learning framework that goes beyond
the standard parameters of a predictive system and can be referred to as a knowledge-based system.
It starts with a thorough examination of data quality issues that include inconsistencies, missing
values, and noise within the data, which makes it appropriate for the training of the models. The
underlying architecture of the system is based on the majority of algorithms that at least belong to the
family of binary classification such as logistic regression which learns relationships between input
and target variables with the target variable being either 0 or 1. The predictive power of the system
can also be enhanced by using feature selection, hyperparameter optimization and cross validation as
other additional techniques. Furthermore, it provides additional details for interpretability of the
models by showing how important attributes for a model are and the reason predictions were made
thus making the system accurate and easy to operate.
2. REQUIREMENT ENGINEERING
7
• Memory – 8GB RAM
• Windows 10
• Python
• Python Libraries
1. Pandas – For data manipulation and analysis
2. Numpy – For numerical operations
3. Matplotlib – For data visualization
4. Scikit-learn – For implementing machine learning algorithms, preprocessing
5. Google-play-scraper – For fetching Google play reviews
3. LITERATURE SURVEY
As per the paper “Fraud App Detection using Sentiment Analysis” authored by Jyoti Singh, Lakshita
Suthar, Diksha Khabya, Simmi Pachori, Nikita Somani, Dr. Mayank Patel. This paper had presented
about determining fraud applications by using the concept of support vector machine and sentiment
analysis. It was supported by the architecture diagram which briefed about the algorithm and
processes which are implemented in the project. Data gets collected and stored in the database which
is then evaluated with the supporting algorithms defined. This is a unique approach in which the
evidences are aggregated and confined into a single result. The proposed framework is scalable and
8
can be extended to other domain generated evidences for the review fraud detection. The
experimental results showed the effectiveness of the proposed system, the scalability of detection
algorithm as well as some regularity in the ranking fraud activities.[1]
As per the paper “Fraud App Detection using Sentiment Analysis” authored by Mr. Revgade Rohit ,
Mr. Salunkhe Pramod , Mr. Waychale Aniket , Mr. Uday Bajirao , Prof. Mahesh P. Bhandakkar. In
this project, a system for detecting fraud in mobile application ratings has been developed. In
particular, it first demonstrated that leading sessions were the site of ranking fraud and offered a way
to mine leading sessions from an App's past ranking data. Then, in order to detect ranking fraud, it
recognised evidence based on ranking, evidence based on rating, and evidence based on review. This
technique has a distinct advantage in that all the information can be represented using sentiment
analysis, making it simple to expand with additional evidence derived from domain expertise in
order to identify ranking fraud. The efficacy of the proposed strategy was demonstrated by
experimental data. The detection of fraud apps was achieved through the identification of evidence
through a review process. [2]
In this project, we explored various methodologies for detecting fraudulent apps and proposed using
sentiment analysis, which offers greater accuracy and speed through lexicon-based analysis. The
Naive Bayes classifier, especially when the independence assumption holds, outperforms models
like logistic regression and requires less training data. Naive Bayes can be built using Gaussian,
Multinomial, or Binomial distributions and is computationally efficient, making it ideal for text
classification tasks.[3]
In this project, we explored various methodologies for detecting fraudulent apps and proposed using
sentiment analysis, which offers greater accuracy and speed through lexicon-based analysis. The
Naive Bayes classifier, especially when the independence assumption holds, outperforms models
like logistic regression and requires less training data. It's fast, suitable for real-time predictions,
multiclass classification, recommendation systems, and sentiment analysis. Naive Bayes can be built
using Gaussian, Multinomial, or Binomial distributions and is computationally efficient, making it
ideal for text classification tasks.[4]
9
This study developed an enhanced sentiment classification method for anomaly detection in social
media analysis, demonstrated using tweet data. The method successfully identified and interpreted
anomaly patterns, showing its effectiveness and superiority. Validated by high agreement with
human annotators, the research introduces a robust sentiment analysis technique to detect anomalies
and pattern changes over time. It offers valuable insights for businesses, political figures, and
organizations to improve services, understand polling results, and refine brand strategies.[5]
This study developed a sentiment classification method for detecting anomalies in social media,
particularly tweets, effectively identifying unusual patterns. It provides valuable insights for
businesses and organizations by detecting inconsistencies and aiding in customer behavior
analysis.For fraud app detection using sentiment analysis, the process includes gathering user
feedback, preprocessing data, applying sentiment models, and building a fraud detection algorithm.
Using a Java Spring Boot backend with React enables real-time monitoring of fraud, with
continuous improvement crucial for adapting to new patterns.[6]
The rapid growth of technology has led to an increase in fraudulent apps on Google Play, posing a
threat to user privacy. A model was developed to detect such apps using parameters like review
scores, in-app purchases, and content additions. The resolution tree algorithm proved 85% accurate
in detecting fraud. This framework is scalable, can rate fraudulent apps, and helps users make
informed
decisions while holding developers accountable. It offers a promising solution for app security, and
further improvements could enhance its role in protecting user privacy.[7]
4. TECHNOLOGY
Python has changed its environment and steadily becoming more equipped for conceivable statistical
analysis. It seems to be of a medium size so it is not overlarge but also not too low class. Python is
keen on the aspects of efficiency and, at the same time, the readability of code. It has a design of the
program that enhances program readability and is integrated with simple syntax needed by beginners
10
in writing their codes plus making it possible for programmers to write their codes in few lines
notably using indentation. The specialty of this excessive degree language are functions of dynamic
systems and automatic-memory management systems.
Python is used in many application domains. It makes its presence in every emerging field. It is the
fastest-growing programming language and may be used to create any type of application.
There we introduce data science as a flexible and using Python as an open-source language. It comes
with functionality that relates to math and scientific operations. Probably do this because its syntax
is very easy and it has large libraries. It also takes a shorter time to code than using the graphical
user interface.
4.3.1 PANDAS
It is a library in Python that is used for analyzing, cleaning, exploring, and handling statistics. in
general, the dataset contains many valuable and unnecessary inferences. That makes them readable
and relevant Pandas.
4.3.2 NUMPY
It is a library in Python that is used for operations involving numerical calculations such as
multidimensional arrays and mathematical functions. However, it is considered easy to use and
highly optimized for data science, scientific computing, and machine learning applications and
libraries.
12
4.3.3 MATPLOTLIB
A library is used for plotting graphs in Python. It is built on NumPy arrays. We can plot any graph
from basic plot types to bar graphs, histograms, scatter, and many more.
4.3.4 SCIKIT-LEARN
In Python, a library for device gaining knowledge of. device gaining knowledge of and statistical
modeling, along with class, regression, and clustering, are completed with Scikit-learn gear.
The Google Play Scraper is a Python library that is instrumental in extracting data from the
Google Play Store. It gives a simpler way of getting app descriptions, comments from the
users, their grades, and other details from the Google Play Store, all without requiring
permission to use any programming interfaces offered by the Play Store. Such a capability
affirms the efficiency of acquiring information for the purposes of research or analysis and the
construction of models based on app reviews or other app data.
4.3.5 CSV
It is a kind of file that stores tabular records, like a spreadsheet or a database. There are one or extra
fields in each entry, that are separated by commas. We use the CSV built-in module to work with
CSV files.
4.4 Workflow:
13
User Input: The user enters the link to the Google Play Store app URL.
Fetch Reviews: It fetches the reviews of the app by using Google Play Scraper.
Data Preprocessing: It preprocessed and cleaned the reviews (tokenization, stop words removal,
lemmatization).
Feature Extraction: Reviews are converted into numerical feature vectors using TF-IDF.
Model Prediction: The already trained Logistic Regression and Random Forest model predict whether
reviews are original/spam.
Result Display: The result is displayed to the user consisting of a review of individual prediction
results and a classification result summary.
Concept of UML
UML stands for Unified Modelling Language. It is a fourth generation, universal, developmental and
modeling language used for the analysis, design and implementation of software. UML serves a dual
14
need: to offer simple and consistent notation for representing a software system possessing inherent
architectural features. UML has been standardized into a framework of building Object-Oriented
Software in the designing phase over the years. UML blueprints are employed by business users,
developers and, in fact, anybody who requires the data modeling. UML has to be clearly
distinguished from being a development method or programming language.
UML DIAGRAMS
UML diagrams are schematic to support use cases, the design of both static and dynamic structures,
and documentation of a variety of aspects of the system. There are the structural and behavioral
diagrams. Class, object, component, deployment, composite structure, package and profile diagrams
are the examples of static view diagrams which help to describe the architecture and dependencies of
the system. Ultimately, use case diagrams, sequence diagrams, activity diagrams, state machine
diagrams, communication diagrams, interaction overview diagrams and timing diagrams depict
dynamics and interactions in different aspects over time.
Combined, these two diagrams offer a comprehensive depiction of the systems depiction in the
design, analysis and communication phases of development, which serves to enhance any hazy areas
and guarantee smooth running of the process.UML is linked with object-oriented design and
analysis. UML makes the use of elements and forms associations between them to form diagrams.
Diagrams in UML can be broadly classified as:
15
Fig 5.1 Concepts of UML
16
5.1 Use case Diagram
In UML, a use case diagram is one of the perspectives that illustrates the functional requirements of
the particular system from the point of view of the users, known as actors. It includes actors – the
persons or other systems interacting with the system, and use cases – the concrete actions or services
the system offers. These are the use cases though which actors engage with the system, and the
associations correspond to the relations and dependencies that exist between them. For defining the
system requirements, comparing them with stakeholders, and using them as helpful maps to justify
the implementation of software systems, use case diagrams are especially useful.
17
5.2 Class Diagram
A class diagram in UML expresses the structure of a specific system to make the classes, attributes,
methods, and relationships clear. Classes are represented as boxes with three compartments. Class
diagrams help in modeling the data and behavior of a system by offering a clear representation of the
layout and interconnectivity of a system as well as its data classes during the process of
implementation and documenting of the software.
18
5.3 Activity diagram
Activity diagram in UML is developed for capability of showing flow of work or actions in a
system. It aspires to illustrate the sequence of actions, decisions and concurrent actions contained in
a certain process or operation. Activity is represented by or nodes and the movement from one
activity to the other as seen by arrows is associated with some condition or event. Activity diagrams
are particularly useful where business processes, software capabilities and procedural flow are
required to be depicted since they systematically outline and explain the functioning of a system or
process.
19
5.4 Deployment Diagram
20
5.5 Architecture
The architecture of this project is designed to fetch app reviews from the Google Play Store, process
them, and use machine learning models to detect fraudulent or legitimate apps based on sentiment
analysis of user reviews. Below is a breakdown of the architecture components.
The Fraud App Detection Project is designed with a Data Collection Layer as its starting point. This
layer employs the Google Play Scraper for fetching app reviews and details from the Google Play
Store. Users simply input an app URL, and the scraper fetches a lot of data. This includes review text,
ratings, information about reviewers, and other details such as the name and type of the app. The
reviews() or reviews_all() functions get user reviews, while the app() function can get app-specific
details if needed. This layer forms the base for processing and looking at app reviews. It makes it
possible to do things like figuring out how people feel about the app and spotting fraud later on.
The Data Preprocessing Layer cleans up the raw application review text to be fed into machine
learning models. It begins with raw review text and cleans up the text by removing non-alphanumeric
characters and converting all text to lowercase. The text is tokenized into words or tokens, common
stopwords removed from this text while retaining critical words conveying sentiment such as "not,"
and then lemmatization to reduce words to base forms. In general, preprocessing ensures that the data
is clean, well-structured for analysis. For these purposes, it'll use libraries like NLTK for tokenization,
stopwords removal, and so on, but all other custom cleaning and preprocessing will be done with
Python scripts.
21
Feature Extraction Layer:
Feature Extraction Layer Maps the pre-processed review text to numerical representations ready for
use by machine learning models. By means of Term Frequency-Inverse Document Frequency
vectorization, each review can be represented as a sparse matrix of numerical features summarizing
strength of words and phrases in the whole corpus. The ngram\\_range= (1, 3) parameter ensures that
the model uses unigrams, bigrams, and trigrams, and hence is capable of learning to classify singleton
words as well as words in context. This sequence is carried out using Tfidffe Vectorizer from the
scikit-learn-library in order to obtain meaningful high-quality feature vectors that capture both the
semantic, and the structural information of the reviews.
The Model Training Layer is now the one who specializes in training machine learning models to
classify application reviews as legitimate or fraudulent. Here, the inputs are supplied to the layer in
the form of preprocessed reviews and their labels, which are split into train and test sets, using the
traintestsplit function. In this model, both Logistic Regression and Random Forest Classifier models
are both trained on the training set, as they are respectively "born" to be trained in classification
tasks. The performance of the model is subsequently evaluated using selected metrics, namely an
accuracy score and an informative reporting of the classification, using a test set. The software used
in this layer to build and test the models are Logistic Regression and Random Forest, based in the
library scikit-learn.
Prediction Layer:
The Prediction Layer is a pipeline that is able to classify (as fraudulent or legitimate) any new or
unseen app reviews. The reviews for a given new app URL are retrieved from Google Play Store,
preprocessed and vectorized with TfidfVectorizer. Then these transformed reviews are further fed into
trained Logistic Regression and Random Forest models to forecast their class. Results are summarized
22
in a structured DataFrame containing each review and the corresponding predicted label of the two
models. In this layer, both efficient and accurate predictions are made, thus enabling real-time
appraisal of app reviews.
Output Layer:
The Output Layer shows the outputs of the Fraud App Detection system in the form of predicted
labels (fraudulent/legitimate) of each review for a particular app. Based on the predicted labels of the
trained models, this layer offers a convenient interface that displays each review along with its
classification by logistic regression and random forest models respectively. Further, it computes
summary statistics that show the fraction fraudulent vs. legitimate reviews, providing the user with an
idea of the review authenticity of the app. In this layer, the output can be made interpretable and
informative, thus enabling decision making around the predictions.
23
Fig 5.6 Architecture
6. IMPLEMENTATION
The dataset consists of 5,017 entries with two columns: Review and Label. The Review column
contains both positive and negative (critical) text- based user feedback from mobile app users. The
Label column assigns each review to one of two categories, positive (favorable sentiment) or
negative. The data is clean, has no missing values so that it lies to prepared for the analysis.
Model Construction
Importing Libraries:
In this step, the required libraries are imported to perform a wide range of functions including web
scraping, natural language processing (NLP), data manipulation, and machine learning. Libraries such
as google-play-scraper is used to download reviews from the Google Play Store and nltk is helpful for
24
text preprocessing operations (e.g., tokenization, lemmatization, stopword elimination). Besides,
pandas, numpy, and scikit-learn libraries are imported for data preprocessing, feature extraction,
model training and evaluation. With this configuration the environment is prepared to process the
entire workflow from data extraction, cleaning, model building and testing.
Preprocessing Function:
The preprocessing function is especially important to process the raw textual data for further analysis.
It also carries out several text-cleaning operations, including tokenization of each review, text
25
normalization to lowercase, and non-alphanumeric character removal. Further, the term deletes
stopwords (such as "the", "and", etc). It also retains this word "not" to inton the negations in order to
conduct sentiment analysis. Lemmatization is also applied to translate words into their base, root form
(e.g., "running" into "run". In this preprocessing stage, it is ensured that the input data is a
homogeneous and normalized, which is the key factor for the effective feature
extraction/classification.
At this stage, the fetch_reviews function uses the google-play-scraper library to download the app
reviews from the Google Play Store. The function, given a URL, gets the app ID and retrieves the
latest 100 reviews. When no reviews are detected, an empty DataFrame is returned. [The]Purpose of
this function is to facilitate the collection of real-time data, which has a crucial function in the
evaluation of the quality of service and the classification of positive/negative reviews.
26
Fig 6.4 Fetching Reviews
Loading Dataset:
The system is loaded with the dataset containing app reviews and those app review labels (e.g.,
legitimate or fraudulent). The reviews are preprocessed with the function, , defined in order to clean
and adequately prepare the text data for model training. In this step, the dataset is read from a CSV
file and reviewed obtaining data preprocessing (tokenization, lemmatization, stopword elimination)
applied to each review and the input data is consistently prepared to continue the analysis process.
27
Fig 6.5 Loading Dataset
Feature Extraction:
Feature extraction is an important stage in transforming raw text data into numerical format which is
understandable by machine learning algorithms. In this work TfidfVectorizer of scikit-learn is
employed to convert the preprocessed reviews into vector format having a term frequency-inverse
document frequency (TF-IDF) representation. This technique extracts the weights of individual words
compared to the whole corpus of reviews and constructs n-grams (words sequences) for context, and
relationship information between words. The obtained feature vectors are subsequently input into
machine learning models.
28
Train-Test Split:
In order to assess the performance of machine learning models, the dataset is split into training and
testing set. The training set is employed to build the models, and the testing set is employed to assess
the performance of the models on unseen data. Machine learning steps at this point is applying the
scikit-learn traintestsplit procedure which divides the data so that 90% of the data is utilized for
training purposes and 10% of the data for the testing purposes. This guarantees that the models are
tested on a completely different data set such that overfitting will be avoided and the models’
generalization capability will be evaluated.
Training Models:
This phase involves training two different machine learning models: Logistic Regression and Random
Forest. Logistic Regression and Random Forest. Logistic Regression is trained with a maximum of
1000 iterations to solve the classification problem, while the Random Forest Classifier is trained with
100 trees to increase the accuracy and to avoid overfitting. Both models are trained on the training set,
so that they can capture the patterns in the reviews that can differentiate the fraudulent and bona fide
reviews. These models will be used to predict the labels of reviews in the test set.
29
Predictions and Evaluation:
After models are trained, predictions are made on the test set. Accuracies of both models are measured
by the accuracy_score function and a comprehensive classification report is then produced to check
their performance in terms of precision, recall, and F1-score. In this stage, it is especially important to
figure out if the models are working correctly and to which extent they can be refined, such as by
decreasing false positives or false negatives.
In this period the predict_app_reviews function is created to make a prediction whether apps
downloaded from Google Play Store are real or fake reviews. The function first validates the URL of
the app, downloads the user reviews and preprocesses them. Next, Logistic Regression and Random
Forest models are applied to classify sentiment of each review. The predictions are converted into
readable labels (Fraudulent or Legitimate) and returned in a DataFrame. With this function it is
30
possible to carry out real-time sentiment analysis of app reviews pulled from the Google Play Store.
31
Input and Displaying Results:
The last stage is the user interaction (the user is asked to type the URL of an app into the Google Play
store). Reviews from the app are extracted and used in the different trained models if needed. The
output, which includes the predicted sentiment for each review, is presented in a scannable text
output. Also, a review summary (Fraudulent vs. Legitimate) is presented for both Logistic Regression
and Random Forest models. This stage gives users an overall picture of sentiment analysis, enabling
them to make a quick decision on the verifiability of reviews for a specific app within a very short
period.
32
Test Cases:
This test checks if Google Play Store FLAG_URL can properly parse and process a correct URL from
the store. The test verifies the correctness of the reviews collected, the way they are preprocessed as
per the designed workflow and the means based on the trained machine learning models identifying
them as being either genuine or fake. The expected result is the ability of the system in correctly
identifying the position of the reviews into the correct classifications as a result of various test inputs
into the system.
33
Fig 6.14 Valid google play Store URL
This test assesses the function of the system when an improper structure of the URL toward the
Google Play Store is used. It tests whether the system can filter out such URLDs or not, whether it
creates errors or crashes the system. The expected output is that an error message should be returned
to the user informing her or him that the URL entered is not valid, and what should be done about it.
Test Case 3: Invalid Google Play Store URL (Non- existent App)
In this test, the system is given a well formatted Google Play Store link that supposedly leads to a
nonexistent app. The purpose is to check if it is possible to let the system recognize that there are no
reviews or app details available and the system will respond with a reasonable message. This makes
the operation more robust especially for the things that come with no data about the application.
34
Test Case 4: Malformed URL (Missing id)
This test is centred on the URLs that are lacking certain crucial components such as the app ID. The
effectiveness of the system in identifying such erroneous URL addresses or patterns and to inform the
user about the error is examined. Basically, the expected outcome is to have a message informing the
user, about an improperly formatted URL and how it needs to be formatted.
This test focus on the initial step of text mining namely text cleaning, tokenization, lemmatization and
stop word removal. The goal is to make sure it works properly with different formats of reviews and
gets them ready for feature extraction. The expected result is to have a cleaned up data in order to feed
the machine learning models.
35
7. SOFTWARE TESTING
Software testing within the Fraud App Detection project guarantees correct and credible behaviour
and correct, reliable outputs of all components. The process includes testing of single components
(unit testing) such as review preprocessing and model predictions as well as the integration of
components to guarantee smooth data flow and functioning. Moreover, performance testing verifies
the system's suitability for handling massive datasets and validation testing guarantees the quality of
the results of the machine learning models' predictions. Testing is an essential component for bug
detection and bug repair, system stability improvement, making sure the project can achieve the
objectives of detecting malicious mobile apps successfully.
Unit testing is one of the fundamental elements in the Fraudu App Detection project since it
guarantees that the separate parts of the system will work properly in isolation, before they merge
into the final system. In this work, unit tests can be implemented for major functions, e.g., text
preprocessing, review scraping, and the performance prediction models, respectively. As an
illustration, unit tests can test whether the preprocessing functions are valid, i.e., reviews can be
tokenized and stopwords can be removed correctly, and whether the words can be lemmatized into
their root forms. Individually testing these individual components allows us to detect and correct
likely errors early on, so that every stage of the data pipeline performs as intended before it is scaled
up to larger testing.
Additionally, unit test of the Google Play Scraper module also ensures correct extraction of reviews
for valid app URLs. Tests can be designed to see if the scraper is able to "clean up" and come out in
good shape for cases like apps that have no reviews, invalid URLs, etc., or incomplete data. For
example, an experiment may be conducted to model a testURL for an app that does not exist and
observe whether the scraper returns a useful error, thereby protecting the system from crashing during
real-time execution.
36
Unit tests also apply to the machine learning models in the project. In the case of Logistic Regression
and Random Forest models, unit tests can guarantee that the models can be trained correctly and that
the predictions are accurate for test data. This covers the verification of whether the models output
predictions in the appropriate format (for example, "Fraudulent" or "Legitimate" and the evaluation
of the performance of models by comparing predictions to known outcomes from test datasets.
Applying strong unit testing to all of the functions in turn provides the project with the guarantee that
each of the independent building blocks is designed to function correctly and helps build a more
stable and trustworthy system as a whole.
Integration testing in the Fraud App Detection (FAD) project assures that the components of the
system come together to deliver intended results. Verification of the review fetching and
preprocessing pipeline is a crucial part of integration testing. For example, after Google Play Scraper
is able to extract app reviews, then this step consists of processing the reviews. Integration tests will
cover whether the reviews are cleaned, tokenized, stopwords are removed, and words are lemmatized
correctly. These tests guarantee that the transfer of data at the stages (namely, scraping, preprocessing,
and feature extraction) remains not corrupted or erroneous and that the system is ready for model's
data input. At the same time, through the simulation of real-world scenarios (e.g., different review
lengths and formats), integration testing guarantees the well-balanced property of the whole
preprocessing pipeline.
Subsequently, integration tests will ensure the integration of machine learning models (Logistic
Regression and Random Forest) and the feature extraction step. Integration tests afterwards, with TF-
IDF-vectorized reviews, will make sure that these vectors are correctly used as input in the models for
prediction. In particular, tests will target whether and how models can work with different inputs
(downloaded from different applications or reviews) and give proper estimates without plunging into
the abyss. It also confirms that the prediction outcomes are uploaded in and properly used a sample
37
format by the application. Integration testing on these interdependent parts allows the project to
guarantee that the system works as a system, and that functionalities do not fail because of
inappropriately or broken interactions between modules.
Acceptance testing within the Fraud App Detection project is concerned with the appropriate end-user
acceptance of the system meeting specified requirements and performing correctly. At this stage, the
whole system is tested to confirm its viability in real-world settings in terms of meeting the intended
outcomes. For example, the system should correctly retrieve the reviews of the Google Play Store,
perform valid preprocessing and deliver appropriate predictions (fraulent/legitieme) for each single
review. Acceptance testing also includes testing the usability of the user interface (if applicable) and
the system's ability to process diverse types of input, i.e., app URL, and review data set size. Also, the
system's capacity to generate useful summaries of fraudulent vs. legal reviews and present them to the
user is assessed. An aim of the project is to guarantee that the work meets business requirements by
accurately delivering fraud detection for Play Store applications and presenting an intuitive user
experience.
Validating the Fraud App Detection is essential for establishing its properly working, trustworthy, and
accurate behaviour. In this project, components testing is established in order to validate the
functionality of each component (i.e., the Google Play Scraper, some preprocessing functions and
machine learning models). The Google Play Scraper is benchmarked to verify its ability to selectively
extract reviews from valid app URLs and to robustly handle edge cases, such as the absence or
invalidity of reviews. The preprocessing functions are then tested to make sure that the text can be
tokenized, stopwords must be filtered and words are lemmatized correctly such that clean input of the
models can be achieved. The machine learning models—Logistic Regression and Random Forest—
are tested to ensure they are trained correctly and make accurate predictions. Also, integration testing
guarantees smooth data transfer from one stage to another, from review loading to the final prediction,
and performance testing assesses what the system is doing with large datasets or many active requests.
38
In general, deep testing guarantees that the system is capable of conveying an accurate decision for
whether a review application is fraudulent or legitimate, and is also capable of high performance and
stability.
8. RESULTS
Legitimate App:
39
Fig 8.1 Legitimate App
Fraud App:
40
Fig 8.2 Accuracy and Classification report of Logistic Regression Model
41
9. CONCLUSION AND FUTURE ENHANCE MENTS
Conclusion
This project effectively classified mobile applications as legitimate or fraudulent based on user
reviews using sentiment analysis and machine learning models. User reviews were fetched using the
Google Play Scraper, then preprocessed with NLP techniques such as tokenization, lemmatization,
and TF-IDF vectorization. These processed reviews were subsequently classified using two machine
learning models: Logistic Regression and Random Forest. The models demonstrated strong
performance, achieving accuracy rates of 90% for Logistic Regression and 92% for Random Forest.
The final classification of the app was determined by the majority decision from both models'
predictions. Based on the results, the reviews for a given app were classified into either "Fraudulent"
or "Legitimate" categories, and an overall app classification was derived. This project highlights the
effectiveness of combining machine learning and sentiment analysis for detecting fraudulent
applications in online marketplaces.
Future enhancement
Future enhancements for this fraud detection project can focus on improving accuracy, scalability, and
user experience. Advanced NLP techniques, such as using transformer models like BERT or GPT, can
refine sentiment analysis and review classification. The integration of additional app metadata (e.g.,
ratings, downloads, and developer history) and multi-language support can make the system more
comprehensive and versatile. Scalability can be achieved by deploying the solution as a cloud-based
API or web application with dynamic review scraping for real-time updates. Explainable AI methods,
such as SHAP, can provide transparency in predictions, while mechanisms to detect fake or bot-
generated reviews will enhance reliability. These improvements will help transform the project into a
robust, real-time app fraud detection and sentiment analysis platform.
42
10. BIBLIOGRAPHY
[1] JoMingyu. "Google Play Scraper: A Python Library for Scraping Data from Google Play
Store." GitHub Repository. Accessed November 22, 2024. Available at:
https://github.com/JoMingyu/google-play-scraper.
[2] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
Vanderplas, J. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine
Learning Research, 12, 2825-2830. Available at: https://scikit-learn.org/.
[3] Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly
Media. NLTK Documentation. Available at: https://www.nltk.org/.
[4] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau,
D., ... & Oliphant, T. E. (2020). Array Programming with NumPy." Nature, 585(7825), 357-
362. Available at: https://numpy.org/.
[5] McKinney, W. (2010). "Data Structures for Statistical Computing in Python." Proceedings of
the 9th Python in Science Conference, 51-56. Pandas Documentation. Available at:
https://pandas.pydata.org/.
[7] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. Available at:
https://scikit
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
43
[8] Analytics Vidhya Team. "Text Preprocessing in NLP with Python." Analytics Vidhya Blog.
Accessed November 22, 2024. Available at:
https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python/.
[9] Kingma, D. P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." arXiv
preprint arXiv:1412.6980. Accessed November 22, 2024. Available at:
https://arxiv.org/abs/1412.6980.
[10] Google LLC. "Google Play Store." Source for App Reviews and Metadata. Accessed
November 22, 2024. Available at: https://play.google.com/store.
[11] https://www.irjmets.com/uploadedfiles/paper/volume2/
http://www.irjmets.com/ (2020)
[12] https://www.ijraset.com/research-paper/fraud-app-detection-of-google-play-store-apps
44