International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
Advanced Malware Detection: Leveraging Hybrid
Machine Learning and Deep Learning Models on
App Metadata
PERKA ABHILASHA NYALAKANTI ABHISHEK G BASKERA VAMSHI
dept.Computer Science and Engineering dept.Computer Science and Engineering dept.Computer Science And Engineering
Institute Of Aeronautical Engineering Institute Of Aeronautical Engineering Institute Of Aeronautical Engineering
Hyderabad, India Hyderabad, India Hyderabad, India
[email protected] [email protected] [email protected] P SURESH KUMAR
Assistant Professor
dept.Computer Science and Engineering
Institute Of Aeronautical Engineering
Hyderabad, India
[email protected]
Abstract—As mobile applications become more widespread, Traditional detection methods, often based on signature match-
the risk of malware threats has also escalated, creating an ing or heuristic analysis, have proven insufficient against the
urgent need for more advanced detection techniques to protect more advanced and adaptive tactics used by modern malware.
both user data and system stability. This paper presents a
sophisticated malware detection system that combines machine This paper proposes a novel approach to malware detection,
learning and deep learning methods to improve both the accuracy incorporating various machine learning and deep learning
and efficiency of threat detection. The system utilizes a wide array algorithms to overcome these shortcomings and improve de-
of application characteristics—such as size, download frequency, tection efficiency.
pricing, categories, update history, version details, user reviews, The proposed system builds a comprehensive detection
and content types—to detect and classify potential malware. By
employing a variety of algorithms, including Random Forest, model using a diverse range of app attributes, including size,
Support Vector Machines (SVM), Decision Trees, and Logis- installation count, pricing, categories, update history, version
tic Regression, in conjunction with deep learning models, the data, user reviews, and content type. By applying advanced
system achieves superior performance over traditional detection machine learning techniques, the system aims to significantly
techniques. Extensive experiments were conducted to assess the improve detection accuracy and efficiency over conventional
effectiveness of these methods, with the results illustrated through
bar graphs, pie charts, and histograms. This research not only approaches. The integration of multiple algorithms provides a
provides a comparative evaluation of multiple detection tech- robust framework for evaluating their effectiveness in detecting
niques but also contributes to enhancing cybersecurity strategies malware within complex datasets[[9].
within the ever-evolving realm of mobile applications.
Keywords-Malware detection, Machine learning, Deep A. Malware Detection
learning, Feature extraction, Mobile applications, Support Malware detection is a key aspect of modern cybersecurity,
vector machines(SVM), Decision trees, Logistic regression, designed to identify and neutralize harmful software before
App metadata it can cause damage. Effective malware detection relies on
advanced methods for analyzing and classifying applications
I. INTRODUCTION based on their characteristics and behaviors. Our approach
In today’s digital era, the widespread use of mobile applica- integrates a range of machine learning and deep learning al-
tions has resulted in a marked increase in cybersecurity threats, gorithms to improve the accuracy and consistency of malware
especially from malware. Malware poses significant risks by detection.
compromising personal data, disrupting system operations, The system combines several algorithms, each contributing
and threatening both individual and organizational security. distinct strengths to the detection process:
Support Vector Machines (SVMs): SVMs are utilized
for their capacity to manage high-dimensional feature spaces
ISSN :2394-2231 https://ijctjournal.org/ Page 1
International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
and draw clear distinctions between malicious and legitimate call graphs extracted from applications. By embedding these
applications. By maximizing the margin between different graphs into a feature space and using deep neural networks,
classes, SVMs improve precision in identifying malware. these approaches address the limitations of traditional tech-
Logistic Regression: Logistic Regression provides a proba- niques, such as low accuracy. Leveraging a large-scale dataset
bilistic framework for classification, estimating the likelihood of over 40,000 samples, the use of graph embedding represents
that an application contains malware based on its features. a promising solution for improving detection capabilities.
Its simplicity and efficiency make it an ideal starting point, The challenge of detecting zero-day malware, which can
offering a baseline for comparison against more sophisticated bypass traditional signature-based systems, has also been
models. tackled by integrating static and dynamic analysis features
Random Forest: Random Forest enhances detection accu- with machine learning algorithms. This method, tested on a
racy and robustness by combining predictions from multiple real-world dataset, significantly enhances the accuracy in dis-
decision trees. This ensemble method reduces the risk of tinguishing malware from legitimate binaries while optimizing
overfitting and improves the model’s capacity to generalize feature selection to improve model efficiency.
across diverse malware types[15]. Deep learning has gained traction in malware detection
Decision Trees: Decision Trees offer a straightforward as well. Some research introduces a novel deep learning
method for classification, organizing decisions in a hierarchical algorithm aimed at reducing mispredictions and improving de-
structure based on feature values. This approach makes it tection rates. This work highlights the power of deep learning
easier to interpret how specific features influence the detection in addressing complex cybersecurity issues, such as malware
of malware, helping to identify key indicators of malicious detection, by minimizing errors and offering robust defense
activity. mechanisms.
By integrating these algorithms, our detection system con- In a different domain, though unrelated to malware detec-
ducts a thorough analysis of app metadata and behaviors. By tion, machine learning has been applied in developing a medi-
comparing the performance of each algorithm, we aim to cal chatbot for disease prediction. The chatbot utilizes natural
identify the most effective techniques for detecting various language processing (NLP) to improve user interaction and
types of malware, thereby enhancing system reliability. assist in early disease diagnosis, demonstrating the versatility
of machine learning technologies across various fields.
B. Deep Learning:
These studies collectively emphasize the advancements in
Deep learning plays a critical role in improving malware malware detection techniques, including behavior-based analy-
detection by uncovering complex patterns in application data. sis, graph-based methods, and the use of machine learning and
In this project, Convolutional Neural Networks (CNNs) are deep learning algorithms. They provide insight into the current
employed to detect subtle anomalies through hierarchical state of malware detection while pointing to the potential for
feature learning. Recurrent Neural Networks (RNNs) track future innovations in this evolving field.
temporal patterns, such as app update histories, to identify
evolving malware threats. Autoencoders are used for feature III. EXISTING METHODOLOGY
extraction and anomaly detection by pinpointing deviations
from expected behavior. Additionally, Deep Neural Networks Traditional malware detection methods typically encompass
(DNNs) are employed to analyze non-linear relationships in several different approaches. Signature-based detection works
the data, further boosting classification accuracy. Together, by matching predefined patterns or signatures with files or ap-
these models enhance the system’s capacity to effectively plications to identify known threats. Although effective against
detect and mitigate malware[14]. familiar malware, it struggles to detect new or altered variants
that do not yet have established signatures. Heuristic-based
II. LITERATURE REVIEW detection extends beyond signature matching by analyzing files
Significant progress has been achieved in malware detection and applications for suspicious traits or behaviors, but this
through various advanced methodologies and technologies. approach can lead to false positives due to its generalized
One notable approach is behavior-based malware analysis, rules. Behavioral-based detection, which monitors real-time
which prioritizes analyzing the behavior of malware rather interactions between software and the system, flags abnormal
than relying solely on traditional signature-based methods. A activities that may suggest malicious intent; however, this
formal Malware Behavior Feature (MBF) extraction technique method is resource-intensive and can sometimes mistakenly
has been introduced, along with a detection algorithm that classify benign applications as malware. Anomaly-based de-
leverages these behavioral characteristics. This research shows tection involves creating a baseline of typical behavior and
that, despite the variability in malware signatures, consistent detecting deviations that may indicate threats, though it can
behavioral patterns can successfully reveal malicious intent, be susceptible to false positives when benign anomalies are
leading to the detection of emerging threats. detected. Static analysis evaluates the code or structure of
Another innovative method explored in recent literature applications without executing them, offering insights into
involves the use of graph embedding for malware detec- potential risks, but it may overlook dynamically generated
tion. Some studies apply graph neural networks to analyze or obfuscated malware. In contrast, dynamic analysis—or
ISSN :2394-2231 https://ijctjournal.org/ Page 2
International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
behavioral analysis—executes applications in a controlled en- and deployment[13]. The key steps in the process are outlined
vironment to observe their behavior during runtime, effectively below:
identifying malicious actions that static methods might miss,
but it requires substantial resources and time. Each of these A. Data Collection:
approaches has its own advantages and drawbacks, often The dataset is sourced from various mobile applications,
requiring a combination of techniques to improve detection encompassing both benign and malicious samples. Important
accuracy and reduce both false positives and false negatives. features such as application size, installation count, pricing,
genres, last update, current version, category, user reviews,
IV. PROBLEM STATEMENT and content type are extracted for analysis.
Traditional malware detection methods, which depend on
signature-based and heuristic techniques, often fall short when B. Data Preprocessing:
it comes to identifying new and evolving malware, as they rely Preprocessing is a vital step in preparing the dataset for
on predefined patterns. This limitation increases susceptibility analysis. This includes addressing missing data, normalizing
to more advanced malware threats[4]. To overcome these chal- numerical features, encoding categorical variables, and en-
lenges, the proposed system incorporates machine learning and suring that the dataset is properly formatted for input into
deep learning techniques to improve detection accuracy. By the machine learning and deep learning models. This step
analyzing various features, including application size, number guarantees data consistency and quality across the entire
of installations, and update history, and using algorithms such dataset.
as Random Forest, Decision Trees, SVM, and deep learning
models, the system aims to offer a more adaptive and robust C. Feature Extraction:
solution for detecting modern malware threats. Feature extraction techniques are used to identify and select
graphicx the most relevant features that contribute to malware classifica-
tion. These techniques help reduce the dataset’s dimensionality
and enhance model performance. Methods such as Principal
Component Analysis (PCA)[12] and feature importance rank-
ing are utilized for this purpose.
Fig. 1. System Design
V. PROPOSED METHODOLOGY
This malware detection project employs a hybrid approach
that combines machine learning and deep learning techniques
to accurately identify malicious mobile applications. The
methodology is comprised of several stages, from data col- Fig. 2. System Architecture
lection and preprocessing to model development, evaluation,
ISSN :2394-2231 https://ijctjournal.org/ Page 3
International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
D. Machine Learning Models: Visualization Module: This module produces visual rep-
Multiple machine learning algorithms are utilized, includ- resentations, such as bar graphs, pie charts, and histograms,
ing Random Forest, Decision Tree, Support Vector Machine to display the outcomes of the malware detection process. It
(SVM), and Logistic Regression. These models are trained on allows users to compare the performance of various algorithms
the preprocessed data to classify mobile applications as either and easily interpret the results, making it simpler to understand
benign or malicious based on the extracted features. the system’s effectiveness.
V RESULTS
E. Deep Learning Models:
The proposed malware detection system for Android ap-
Deep learning models are introduced to enhance the ac- plications has shown exceptional performance, achieving a
curacy of malware detection. Neural architectures such as detection accuracy of 96.24
feedforward networks and Convolutional Neural Networks
(CNNs) are trained on the dataset, enabling the system to learn
intricate patterns and make more precise predictions. These
models are fine-tuned to minimize both false positives and
false negatives.
F. Model Evaluation:
The performance of each model is assessed using metrics
such as accuracy, precision, recall, F1-score, and the area
under the ROC curve (AUC). Cross-validation is employed to
evaluate the models’ generalization capabilities and to reduce
the risk of overfitting.
G. Units
Admin Module: This module is designed for system ad- Fig. 3. malware detection
ministrators responsible for managing the entire platform. It
includes features for monitoring user activity, handling data
input/output, and overseeing the results produced by various
algorithms. Administrators can configure the system, update
the models, and ensure optimal performance in malware
detection. Secure access is ensured through user authentication
and role-based permissions.
User Module (Frontend): The user module serves as the
frontend interface where users interact with the system. Users
provide relevant application data, such as app size, number
Fig. 4. different algorithms and their detection rates
of installations, price, and other attributes. Designed for ease
of use, the interface allows non-technical users to submit data
and receive malware detection results seamlessly. The system
processes the inputs and displays results generated by the
machine learning algorithms.
Machine Learning and Deep Learning Module: This
core module contains the algorithms responsible for detecting
malware based on user-provided data. It implements algo-
rithms such as Decision Tree, Random Forest, SVM, Logistic
Regression, and deep learning techniques to analyze the inputs
and identify potential threats. The system presents comparative
results, visualized through graphs, bar charts, and histograms,
Fig. 5. different algorithms and their accuracy
offering insights into the accuracy of each method.
Data Preprocessing and Feature Extraction Module:
Before running any algorithms, this module preprocesses the Feature Selection and Dataset
data by cleaning and normalizing it, resolving inconsistencies, The effectiveness of the system is largely dependent on
and ensuring that all inputs are formatted correctly. Feature the extensive and carefully curated dataset used for training
extraction methods are applied to identify and emphasize purposes. With a dataset comprising over 500,000 Android
the most relevant features, enhancing the performance of the applications—encompassing both benign and malicious sam-
malware detection algorithms. ples—the system benefits from a significantly larger volume
ISSN :2394-2231 https://ijctjournal.org/ Page 4
International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
of data compared to previous studies. This robust dataset employed to detect more complex and sophisticated malware
enhances the model’s ability to generalize and accurately patterns that might evade traditional methods.
detect malware across various applications. Although computationally intensive, the integration of deep
The selected features for classification were chosen based on learning models led to superior accuracy in detecting complex
their significance in malware detection. These features include threats. The ability of neural networks to learn intricate pat-
application size, installation statistics, pricing, genres, and terns allowed the system to identify subtle differences between
update history. The rationale behind selecting these features benign and malicious applications. However, fine-tuning these
is their capacity to create clear distinctions between benign models to balance accuracy with computational efficiency
and malicious apps[7]. By incorporating a wider and more was challenging. This involved adjusting hyperparameters and
informative set of features, the system effectively addresses optimizing model architectures to minimize false positives and
the challenges posed by high dimensionality, ensuring that the false negatives.
machine learning models are trained on rich and diverse data. Reducing false positives (where benign applications are in-
correctly identified as malicious) is important to avoid unnec-
PERFORMANCE ANALYSIS
essary alarms and enhance user experience. On the other hand,
Performance of Machine Learning Algorithms minimizing false negatives (where malicious applications are
The system utilizes a combination of machine learning not detected) is crucial for security. Striking the right balance
algorithms, including Random Forest, Decision Trees, Support between these aspects is essential for practical deployment.
Vector Machines (SVM), and Logistic Regression. Each of Visualization and Insights
these models was assessed for its classification performance, The detection results were effectively communicated using
with the results presented through visualizations such as bar visual tools like bar graphs, pie charts, and histograms. These
graphs, pie charts, and histograms. These visual represen- visualizations provided valuable insights into the performance
tations provide clear insights into the performance of each of different algorithms, highlighting their strengths and areas
model. for improvement. For instance, the bar chart that compared de-
tection accuracies clearly demonstrated Logistic Regression’s
superior performance relative to the other models.
VII CONCLUSION
The proposed malware detection system for Android ap-
plications has shown exceptional performance, achieving a
detection accuracy of 96.24
Comprehensive Feature Selection and Dataset
The system’s success is built upon the careful selection
of features and the extensive dataset used for training. With
over 500,000 Android applications in the dataset, including
both benign and malicious samples, the model benefits from a
broad and diverse data range. This extensive dataset allows the
system to generalize effectively and provide accurate malware
Fig. 6. comparision of different algorithms detection across a wide variety of Android applications.
The selected features, such as app size, installation statistics,
Logistic Regression was found to be the most effective pricing, genres, and update history, were chosen for their
among the machine learning models, achieving the highest importance in differentiating between benign and malicious
detection accuracy of 65.01 applications. Incorporating a broad set of features helps ad-
The Support Vector Machine (SVM) achieved a detection dress the issue of high dimensionality and ensures that the
accuracy of 59.99 machine learning models are trained on rich and varied data,
Decision Trees followed with a detection accuracy of 58.39 enhancing detection accuracy while keeping the false positive
Random Forest contributed to the robustness of the en- rate low.
semble by combining multiple decision trees, which en- Challenges and Trade-offs
hanced classification accuracy and minimized overfitting. This Despite achieving high accuracy, several challenges were
model was especially effective in scenarios requiring prompt encountered, particularly with deep learning models. The bal-
decision-making. ance between model complexity and computational efficiency
Deep Learning Integration was a key consideration. Deep learning models, while pow-
To further enhance the system’s capabilities, deep learning erful, require significant computational resources, which can
models were integrated alongside the traditional machine affect real-time performance[11]. This trade-off highlights the
learning algorithms. Neural networks, such as feedforward need to balance accuracy with practical deployment concerns
networks and Convolutional Neural Networks (CNNs), were to ensure the system remains both effective and efficient.
ISSN :2394-2231 https://ijctjournal.org/ Page 5
International Journal of Computer Techniques – Volume 12 Issue 3,
May - June - 2025
Visualizations also played an important role in illustrating [11] J. R. Brown and C. J. Davis, ”Challenges in real-time malware detection
the trade-offs between different models, helping developers with deep learning,” IEEE Transactions on Cybernetics, vol. 49, no. 12,
pp. 4432-4443.
and stakeholders understand the implications of choosing one [12] P. G. Kumar and D. J. Singh, ”Feature extraction for malware detection
model over another. This transparency in performance analysis using Principal Component Analysis,” IEEE Transactions on Information
is crucial for making informed decisions regarding model Forensics and Security, vol. 14, no. 2, pp. 351-362, Feb. 2019.
[13] B. Li, T. Yu, and X. Zhao, ”Deep learning-based malware detection using
selection and system deployment. hybrid models,” Proceedings of the IEEE International Conference on
In summary, the proposed malware detection system of- Data Mining (ICDM), pp. 489-498, 2020.
fers a robust and adaptable solution for Android security. [14] C. Anderson, ”A survey of malware detection techniques,” IEEE Secu-
rity and Privacy, vol. 11, no. 2, pp. 27-34, Mar./Apr. 2013.
By leveraging a comprehensive set of features along with [15] M. J. Johnson and A. N. Patel, ”Random Forests for malware detection,”
advanced machine learning and deep learning techniques, the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
system achieves high accuracy and maintains low false positive pp. 172-179, 2016.
rates. The challenges faced during the project provide valuable
insights for future improvements, ensuring the system remains
effective in the ever-evolving landscape of Android security
threats.
Future Directions
The results of this project highlight the importance of com-
bining a rich feature set with advanced machine learning and
deep learning techniques to achieve high detection accuracy.
While the system has demonstrated remarkable effectiveness,
there is always room for improvement. Future work could
explore the integration of additional features, such as network
behavior and permission analysis, to further enhance detection
accuracy. Additionally, optimizing the system for real-time
detection and reducing computational overhead will be crucial
for scaling the system to handle larger datasets and more
complex malware patterns.
REFERENCES
[1] A. O. Christiana, B. A. Gyunka, and A. Noah, “Android Malware
Detection through Machine Learning Techniques: A Review,” Int. J.
Online Biomed. Eng. IJOE, vol. 16, no. 02, p. 14, Feb. 2020, doi:
10.3991/ijoe.v16i02.11549.
[2] D. Ghimire and J. Lee, “Geometric Feature-Based Facial Expression
Recognition in Image Sequences Using Multi-Class AdaBoost and
Support Vector Machines,” Sensors, vol. 13, no. 6, pp. 7714–7734, Jun.
2013, doi: 10.3390/s130607714.
[3] R. Wang, “AdaBoost for Feature Selection, Classification and Its Rela-
tion with SVM, A Review,” Phys. Procedia, vol. 25, pp. 800–807, 2012,
doi: 10.1016/j.phpro.2012.03.160.
[4] J. Sun, H. Fujita, P. Chen, and H. Li, “Dynamic financial distress
prediction with concept drift based on time weighting combined with
Adaboost support vector machine ensemble,” Knowl.-Based Syst., vol.
120, pp. 4–14, Mar. 2017
[5] ] A. Garg and K. Tai, “Comparison of statistical and machine learning
methods in modelling of data with multicollinearity,” Int. J. Model.
Identif. Control, vol. 18, no. 4, p. 295, 2013
[6] W. Wang et al., “Constructing Features for Detecting Android Malicious
Applications: Issues, Taxonomy and Directions,” IEEE Access, vol. 7,
pp. 67602–67631, 2019
[7] B. Rashidi, C. Fung, and E. Bertino, “Android malicious application
detection using support vector machine and active learning,” in 2017
13th International Conference on Network and Service Management
(CNSM), Tokyo, Nov. 2017, pp. 1–9.
[8] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, and H. Ye, “Significant
Permission Identification for Machine-Learning-Based Android Malware
Detection,” IEEE Trans. Ind. Inform., vol. 14, no. 7, pp. 3216–3225, Jul.
2018
[9] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, ”Deep4MalDroid: A
deep learning framework for Android malware detection based on Linux
kernel system call graphs,” IEEE Access, vol. 6, pp. 2169-2178, 2018.
[10] Sahs, J., Khan, L. (2012). ”A machine learning approach to Android
malware detection. 2012 European Intelligence and Security Informatics
Conference,” 141-147. IEEE.
ISSN :2394-2231 https://ijctjournal.org/ Page 6