100% found this document useful (1 vote)
364 views16 pages

Data Science for Malware Prediction

The document summarizes a final project on predicting malware infection on Windows machines. It describes cleaning and preprocessing a dataset of over 60,000 computers and 80 properties from Microsoft. Several models were built including logistic regression, LASSO logistic regression, gradient boosting decision trees, and random forest. The logistic regression and LASSO models produced confusion matrices and highlighted important predictive features. The GBDT model also generated precision, recall, and a list of the top 18 contributing features. Recommendations include collecting time series data, using the same data for all models, and obtaining full access to the real data for improved analysis.

Uploaded by

vikram k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
364 views16 pages

Data Science for Malware Prediction

The document summarizes a final project on predicting malware infection on Windows machines. It describes cleaning and preprocessing a dataset of over 60,000 computers and 80 properties from Microsoft. Several models were built including logistic regression, LASSO logistic regression, gradient boosting decision trees, and random forest. The logistic regression and LASSO models produced confusion matrices and highlighted important predictive features. The GBDT model also generated precision, recall, and a list of the top 18 contributing features. Recommendations include collecting time series data, using the same data for all models, and obtaining full access to the real data for improved analysis.

Uploaded by

vikram k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

FINAL PROJECT:

Microsoft Malware Prediction


Jingyan Qiao
Jiayi Wang
Quoc Tuong Dong
Ye Chen
Background
Our computers are always exposed to
an unsafe network environment.

 Browsing a website.
 Clicking on a link.
 Turning off an advertisement.

Malware that infects personal, enterprise


and national computers is likely to lead
to criminal activity.
Introduction
 Data Description
Our data came from Microsoft, and provided various information about 60,000
computers, more than 80 properties.
And the response is whether the malware was detected on each computer,
therefore the response variable is binary.
 Problem
Our team encountered problem when trying to analyze the dataset due to its
lack of clarification and transparency.
 Goals
Our goal is to predict a Windows machine’s probability of getting infected by
malware and investigate the significance of each predictor.
Data Cleansing

 Split all features into three groups: numeric, binary and


category.

 Fill in the blank cells and Format the data.

 Delete the features with too many missing values or highly


unbalanced dimensions.
Methods
 Logistic Regression Model

 LASSO Logistic Regression


Model

 Gradient Boosting Decision


Trees (GBDT) Model

 Random Forest Model


Analysis
Logistic Regression Model.
Confusion Matrix.

Features with high contribution


( Logistic )
LocaleIdentifier
Platform
SkuEdition
IsProtected
IsGamer
AppVersion
Logistic Regression Model.
Predicted Probability Plot.
LASSO Logistic Regression Model.

[Link]=0.00265
LASSO Logistic Regression Model.
Confusion Matrix.

Features with high contribution ( LASSO Logistic )


AvSigVersion Processor
EngineVersion OsBuild
ProductName OsSuite
CityIdentifier IsProtected
LocaleIdentifier OsPlatform
IsGamer AppVersion
Processor GeoNameIdentifier
Platform SmartScreen
GBDT Model.
Confusion Matrix.

Precision & Recall


Rate.
GBDT Model.
The top 18 features with highest contribution.
Random Forest Model.
Random Forest Model.
Random Forest Model.
Recommendation
 Collect data for a period of time to generate a time series
analysis.

 Use other the same set of data for the other two for the GBDT
analysis to see if the result we have changes or not.

 Have access to the real data without any confidential


information and do analysis to see if we can build better
prediction models.
Reference
Bronshtein, A. (2019, February 27). Train/Test Split and Cross Validation in Python.
Retrieved from [Link]
in-python-80b61beca4b6
Computing Classification Evaluation Metrics in R. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Person. (2019, January 8). Rstudio, is it useable for large data sets (9gb )? Retrieved
from [Link]
9gb/21138/7
Yurtoğlu, N. (2018). [Link]
[Link]. History Studies
International Journal of History, 10(7), 241–264. doi: 10.9737/hist.2018.658

You might also like