100% found this document useful (1 vote)

364 views16 pages

Data Science for Malware Prediction

The document summarizes a final project on predicting malware infection on Windows machines. It describes cleaning and preprocessing a dataset of over 60,000 computers and 80 properties from Microsoft. Several models were built including logistic regression, LASSO logistic regression, gradient boosting decision trees, and random forest. The logistic regression and LASSO models produced confusion matrices and highlighted important predictive features. The GBDT model also generated precision, recall, and a list of the top 18 contributing features. Recommendations include collecting time series data, using the same data for all models, and obtaining full access to the real data for improved analysis.

Uploaded by

vikram k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

364 views16 pages

Data Science for Malware Prediction

Uploaded by

vikram k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

FINAL PROJECT:

Microsoft Malware Prediction

Jingyan Qiao
Jiayi Wang
Quoc Tuong Dong
Ye Chen
Background
Our computers are always exposed to
an unsafe network environment.

 Browsing a website.
 Clicking on a link.
 Turning off an advertisement.

Malware that infects personal, enterprise

and national computers is likely to lead
to criminal activity.
Introduction
 Data Description
Our data came from Microsoft, and provided various information about 60,000
computers, more than 80 properties.
And the response is whether the malware was detected on each computer,
therefore the response variable is binary.
 Problem
Our team encountered problem when trying to analyze the dataset due to its
lack of clarification and transparency.
 Goals
Our goal is to predict a Windows machine’s probability of getting infected by
malware and investigate the significance of each predictor.
Data Cleansing

 Split all features into three groups: numeric, binary and

category.

 Fill in the blank cells and Format the data.

 Delete the features with too many missing values or highly

unbalanced dimensions.
Methods
 Logistic Regression Model

 LASSO Logistic Regression

Model

 Gradient Boosting Decision

Trees (GBDT) Model

 Random Forest Model

Analysis
Logistic Regression Model.
Confusion Matrix.

Features with high contribution

（ Logistic ）
LocaleIdentifier
Platform
SkuEdition
IsProtected
IsGamer
AppVersion
Logistic Regression Model.
Predicted Probability Plot.
LASSO Logistic Regression Model.

[Link]=0.00265
LASSO Logistic Regression Model.
Confusion Matrix.

Features with high contribution （ LASSO Logistic ）

AvSigVersion Processor
EngineVersion OsBuild
ProductName OsSuite
CityIdentifier IsProtected
LocaleIdentifier OsPlatform
IsGamer AppVersion
Processor GeoNameIdentifier
Platform SmartScreen
GBDT Model.
Confusion Matrix.

Precision & Recall

Rate.
GBDT Model.
The top 18 features with highest contribution.
Random Forest Model.
Random Forest Model.
Random Forest Model.
Recommendation
 Collect data for a period of time to generate a time series
analysis.

 Use other the same set of data for the other two for the GBDT
analysis to see if the result we have changes or not.

 Have access to the real data without any confidential

information and do analysis to see if we can build better
prediction models.
Reference
Bronshtein, A. (2019, February 27). Train/Test Split and Cross Validation in Python.
Retrieved from [Link]
in-python-80b61beca4b6
Computing Classification Evaluation Metrics in R. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Microsoft Malware Prediction. (n.d.). Retrieved from
[Link]
Person. (2019, January 8). Rstudio, is it useable for large data sets (9gb )? Retrieved
from [Link]
9gb/21138/7
Yurtoğlu, N. (2018). [Link]
[Link]. History Studies
International Journal of History, 10(7), 241–264. doi: 10.9737/hist.2018.658

Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
5 pages
Cyber Forences Cheat Sheet
No ratings yet
Cyber Forences Cheat Sheet
1 page
Swap Files Anti-Forensics On Linux
No ratings yet
Swap Files Anti-Forensics On Linux
7 pages
Android Malware Detection Using Machine Learning
No ratings yet
Android Malware Detection Using Machine Learning
4 pages
IEEE-Ai For Cybersecurity
100% (1)
IEEE-Ai For Cybersecurity
3 pages
Detecting Malware in Portable Executable Files Using Machine Learning Approach
No ratings yet
Detecting Malware in Portable Executable Files Using Machine Learning Approach
7 pages
Privilege Escalation Attack Detection and Mitigation in Cloud Using Machine Learning - PPT 3
No ratings yet
Privilege Escalation Attack Detection and Mitigation in Cloud Using Machine Learning - PPT 3
43 pages
GAN-Based Criminal ID System
No ratings yet
GAN-Based Criminal ID System
3 pages
Analyzing the "blacktds" Threat Actor
No ratings yet
Analyzing the "blacktds" Threat Actor
5 pages
A Forensic Analysis of Android Malware - How Is Malware Written and How It Could Be Detected?
No ratings yet
A Forensic Analysis of Android Malware - How Is Malware Written and How It Could Be Detected?
5 pages
CTF Challenges and Walkthroughs List
No ratings yet
CTF Challenges and Walkthroughs List
6 pages
Final Year Project
No ratings yet
Final Year Project
66 pages
Convolutional Neural Networks for Malware Detection
No ratings yet
Convolutional Neural Networks for Malware Detection
37 pages
Lecture 1 - Introduction To Malware Analysis
No ratings yet
Lecture 1 - Introduction To Malware Analysis
22 pages
Digital Forensics Lab Manual Adust
No ratings yet
Digital Forensics Lab Manual Adust
34 pages
In-Memory Malware Detection Using AI
No ratings yet
In-Memory Malware Detection Using AI
111 pages
Advanced Threat Modeling Guide
No ratings yet
Advanced Threat Modeling Guide
22 pages
Data Science Overview by Charles Wang
No ratings yet
Data Science Overview by Charles Wang
68 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
73 pages
Malware Detection System
No ratings yet
Malware Detection System
12 pages
Malware Analysis CIS-672: Lecture 03: Inspecting PE Header
No ratings yet
Malware Analysis CIS-672: Lecture 03: Inspecting PE Header
41 pages
A Novel Method For Malware Detection On ML-based Visualization Technique
No ratings yet
A Novel Method For Malware Detection On ML-based Visualization Technique
41 pages
10 KnowledgeC Investigation
No ratings yet
10 KnowledgeC Investigation
42 pages
Anti Forensics
No ratings yet
Anti Forensics
11 pages
Literature Review On Malware and Its Analysis
No ratings yet
Literature Review On Malware and Its Analysis
13 pages
Cloud Forensics
No ratings yet
Cloud Forensics
33 pages
Manual X-Ways Forensic
No ratings yet
Manual X-Ways Forensic
130 pages
Ebay Case Study
No ratings yet
Ebay Case Study
6 pages
Cyberspace News Prediction of Text and Image
No ratings yet
Cyberspace News Prediction of Text and Image
53 pages
Forensic Tools in Comparison: An Assessment of Performance Across Different Parameters
No ratings yet
Forensic Tools in Comparison: An Assessment of Performance Across Different Parameters
7 pages
Network Intrusion Data for Researchers
No ratings yet
Network Intrusion Data for Researchers
6 pages
What Is Footprinting
No ratings yet
What Is Footprinting
5 pages
Three Anti-Forensics Techniques That Pose The Greatest Risks To Digital Forensic Investigations
No ratings yet
Three Anti-Forensics Techniques That Pose The Greatest Risks To Digital Forensic Investigations
12 pages
Case Study Format
No ratings yet
Case Study Format
5 pages
OSINT For ICS - OT - Review Questions
No ratings yet
OSINT For ICS - OT - Review Questions
17 pages
Hybrid Meta-Heuristic IDS for Databases
No ratings yet
Hybrid Meta-Heuristic IDS for Databases
17 pages
Building a Malware Analysis Lab Guide
No ratings yet
Building a Malware Analysis Lab Guide
9 pages
Machine Learning Detection
No ratings yet
Machine Learning Detection
13 pages
Ethical Hacking for Beginners
100% (1)
Ethical Hacking for Beginners
5 pages
1A Proposed Approach To Analyze Insider Threat Detection Using Emails
No ratings yet
1A Proposed Approach To Analyze Insider Threat Detection Using Emails
6 pages
Overview of Malware Analysis Techniques
No ratings yet
Overview of Malware Analysis Techniques
22 pages
DDoS Attack Detection Using Deep Learning
No ratings yet
DDoS Attack Detection Using Deep Learning
7 pages
(FINAL) Digital Forensics - Windows Forensic Investigations-200224
No ratings yet
(FINAL) Digital Forensics - Windows Forensic Investigations-200224
10 pages
Nessus and OpenVAS
No ratings yet
Nessus and OpenVAS
4 pages
Top 10 Digital Forensic Tools
No ratings yet
Top 10 Digital Forensic Tools
10 pages
Advance Cyber Security - Manmohan Singh
No ratings yet
Advance Cyber Security - Manmohan Singh
296 pages
Forensic Analysis of Windows 2000 Server
No ratings yet
Forensic Analysis of Windows 2000 Server
63 pages
Data Acquisition in Computer Forensics
No ratings yet
Data Acquisition in Computer Forensics
13 pages
Ethical HAcking SPPU Unit 2
No ratings yet
Ethical HAcking SPPU Unit 2
12 pages
Data Recovery and Evidence Collection SYMCA
No ratings yet
Data Recovery and Evidence Collection SYMCA
16 pages
Digital Forensics Course Overview
No ratings yet
Digital Forensics Course Overview
10 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
13 pages
7.antivirus, Firewall, and Steganography
100% (1)
7.antivirus, Firewall, and Steganography
4 pages
Mobile Security Testing Approaches and Challenges: February 2015
No ratings yet
Mobile Security Testing Approaches and Challenges: February 2015
6 pages
PenTest Sem 2 Assignment Breif
No ratings yet
PenTest Sem 2 Assignment Breif
5 pages
3.1.1.5 Lab - Create and Store Strong Passwords PDF
No ratings yet
3.1.1.5 Lab - Create and Store Strong Passwords PDF
3 pages
Lab Assignment 1
No ratings yet
Lab Assignment 1
18 pages
Microsoft Malware Analysis
No ratings yet
Microsoft Malware Analysis
16 pages
Malware Detection: Rahul R S (1BM17IS066) Vikram K (1BM17IS089) Rithvik M (1BM17IS068)
No ratings yet
Malware Detection: Rahul R S (1BM17IS066) Vikram K (1BM17IS089) Rithvik M (1BM17IS068)
17 pages
Summary of Malware Prediction Dataset
No ratings yet
Summary of Malware Prediction Dataset
2 pages
NTG6 Debug and Testing Overview
No ratings yet
NTG6 Debug and Testing Overview
1 page
Generating 256-Bit Stream with USB3.0 Hub
No ratings yet
Generating 256-Bit Stream with USB3.0 Hub
1 page
Graph Theory and Its Applications 3rd 6fe3
100% (5)
Graph Theory and Its Applications 3rd 6fe3
593 pages
ML
No ratings yet
ML
2 pages
Visual Basic Loop Control Structures
100% (1)
Visual Basic Loop Control Structures
15 pages
Nachi Manual Condensat
100% (3)
Nachi Manual Condensat
510 pages
Ericsson RBS Commissioning Guide
100% (5)
Ericsson RBS Commissioning Guide
16 pages
Nian Masna Soal Semester English 2020 SMT 5 Pagi
No ratings yet
Nian Masna Soal Semester English 2020 SMT 5 Pagi
4 pages
Understanding SAP HANA Basics
No ratings yet
Understanding SAP HANA Basics
7 pages
Unit Tests - Practical Go Lessons-19
No ratings yet
Unit Tests - Practical Go Lessons-19
24 pages
Cspe 03
No ratings yet
Cspe 03
175 pages
Apple Inc. Strategic Analysis Report
83% (6)
Apple Inc. Strategic Analysis Report
25 pages
Nipuna DWH
No ratings yet
Nipuna DWH
15 pages
IAT Hooking Tutorial for x86 Processes
No ratings yet
IAT Hooking Tutorial for x86 Processes
3 pages
SAP R/3 Architecture Overview
No ratings yet
SAP R/3 Architecture Overview
33 pages
CSE (EVE) - Class Information - SPRING 2023
No ratings yet
CSE (EVE) - Class Information - SPRING 2023
4 pages
HP Gold Partner Certificate FY15
No ratings yet
HP Gold Partner Certificate FY15
3 pages
TalendOpenStudio Components RG en 7.3.1
No ratings yet
TalendOpenStudio Components RG en 7.3.1
3,956 pages
Python Voice Email Assistant Guide
No ratings yet
Python Voice Email Assistant Guide
19 pages
YASSER - Designed Resume
No ratings yet
YASSER - Designed Resume
3 pages
Module 01 ISM
No ratings yet
Module 01 ISM
15 pages
DS Parallel Job Developers Guide
50% (2)
DS Parallel Job Developers Guide
637 pages
Optimising IoT Networks - Suvarna Patil
No ratings yet
Optimising IoT Networks - Suvarna Patil
477 pages
MR400 Data A3
No ratings yet
MR400 Data A3
2 pages
Indra Yolanda Pristiawati - 201910325116 Resume Webinar
No ratings yet
Indra Yolanda Pristiawati - 201910325116 Resume Webinar
2 pages
DM Important Questions
100% (1)
DM Important Questions
2 pages
Manual Qhmi PDF
No ratings yet
Manual Qhmi PDF
354 pages
0417 m23 QP 12 WORKING
No ratings yet
0417 m23 QP 12 WORKING
21 pages
Mobile Computing
No ratings yet
Mobile Computing
12 pages
Aneka Cloud Overview
No ratings yet
Aneka Cloud Overview
3 pages
Advanced Meter for Industrial Use
No ratings yet
Advanced Meter for Industrial Use
2 pages
ABAP/4 Data Types & Operations Guide
No ratings yet
ABAP/4 Data Types & Operations Guide
30 pages
VPC (Virtual Private Cloud)
No ratings yet
VPC (Virtual Private Cloud)
59 pages
Suva Grammar School Computer Studies Year 12 Worksheet Week 5
No ratings yet
Suva Grammar School Computer Studies Year 12 Worksheet Week 5
2 pages

Data Science for Malware Prediction

Uploaded by

Data Science for Malware Prediction

Uploaded by

FINAL PROJECT:

Microsoft Malware Prediction

Malware that infects personal, enterprise

 Split all features into three groups: numeric, binary and

 Fill in the blank cells and Format the data.

 Delete the features with too many missing values or highly

 LASSO Logistic Regression

 Gradient Boosting Decision

 Random Forest Model

Features with high contribution

Features with high contribution （ LASSO Logistic ）

Precision & Recall

 Have access to the real data without any confidential

You might also like