FINAL PROJECT:
Microsoft Malware Prediction
Jingyan Qiao
Jiayi Wang
Quoc Tuong Dong
Ye Chen
Background
Our computers are always exposed to
an unsafe network environment.
Browsing a website.
Clicking on a link.
Turning off an advertisement.
Malware that infects personal, enterprise
and national computers is likely to lead
to criminal activity.
Introduction
Data Description
Our data came from Microsoft, and provided various information about 60,000
computers, more than 80 properties.
And the response is whether the malware was detected on each computer,
therefore the response variable is binary.
Problem
Our team encountered problem when trying to analyze the dataset due to its
lack of clarification and transparency.
Goals
Our goal is to predict a Windows machine’s probability of getting infected by
malware and investigate the significance of each predictor.
Data Cleansing
Split all features into three groups: numeric, binary and
category.
Fill in the blank cells and Format the data.
Delete the features with too many missing values or highly
unbalanced dimensions.
Methods
Logistic Regression Model
LASSO Logistic Regression
Model
Gradient Boosting Decision
Trees (GBDT) Model
Random Forest Model
Analysis
Logistic Regression Model.
Confusion Matrix.
Features with high contribution
( Logistic )
LocaleIdentifier
Platform
SkuEdition
IsProtected
IsGamer
AppVersion
Logistic Regression Model.
Predicted Probability Plot.
LASSO Logistic Regression Model.
[Link]=0.00265
LASSO Logistic Regression Model.
Confusion Matrix.
Features with high contribution ( LASSO Logistic )
AvSigVersion Processor
EngineVersion OsBuild
ProductName OsSuite
CityIdentifier IsProtected
LocaleIdentifier OsPlatform
IsGamer AppVersion
Processor GeoNameIdentifier
Platform SmartScreen
GBDT Model.
Confusion Matrix.
Precision & Recall
Rate.
GBDT Model.
The top 18 features with highest contribution.
Random Forest Model.
Random Forest Model.
Random Forest Model.
Recommendation
Collect data for a period of time to generate a time series
analysis.
Use other the same set of data for the other two for the GBDT
analysis to see if the result we have changes or not.
Have access to the real data without any confidential
information and do analysis to see if we can build better
prediction models.
Reference
Bronshtein, A. (2019, February 27). Train/Test Split and Cross Validation in Python.
Retrieved from [Link]
in-python-80b61beca4b6
Computing Classification Evaluation Metrics in R. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Person. (2019, January 8). Rstudio, is it useable for large data sets (9gb )? Retrieved
from [Link]
9gb/21138/7
Yurtoğlu, N. (2018). [Link]
[Link]. History Studies
International Journal of History, 10(7), 241–264. doi: 10.9737/hist.2018.658