0% found this document useful (0 votes)

137 views16 pages

Microsoft Malware Analysis

This document provides an assignment front sheet and final project report for a course on intermediate analytics. The project aims to predict the probability of a machine being infected with malware using logistic regression, LASSO regularization, random forest, and gradient boosting decision tree models. The dataset from Microsoft contains information on over 50,000 machines. Feature selection and various classification algorithms are used to determine the most significant factors affecting malware infections and improve Microsoft's security systems. Results are evaluated to understand these machine learning methods and the relationship between predictors and categorical responses.

Uploaded by

vikram k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views16 pages

Microsoft Malware Analysis

Uploaded by

vikram k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

ASSIGNMENT FRONT SHEET

Course Name: ALY6015 20904 Intermediate Analytics

Professor Name: ChuanLi Jiang

Student Class: Fall 2019 CPS Term: A. 2020

Final Project Report

Completion Date: February 13th Due Time:12:00am

Statement of Authorship

I confirm that this work is my own. Additionally, I confirm that no part of this coursework, except where clearly quoted and
referenced, has been copied from material belonging to any other person e.g. from a book, handout, another student. I am
aware that it is a breach of Northeastern University’s regulations to copy the work of another without clear acknowledgement
and that attempting to do so renders me liable to disciplinary procedures. To this effect, I have uploaded my work onto Turnitin
and have ensured that I have made any relevant corrections to my work prior to submission.

Tick here to confirm that your paper version is identical to the version submitted through Turnitin

1
Final Project Report: Microsoft Malware Detections

Group Member & Student ID:

Jingyan Qiao, 001368423

Jiayi Wang, 001082307

Quoc Tuong Dong, 001089611

Ye Chen, 001300776

Introduction:

With the development of information industry and related technologies, computer and other machines

become increasingly important in production of any industry, keeping machine in a safe condition and not getting

infected by any type of malware is a top challenge for security department. The malware industry continues to be a

well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is

infected by malware, criminals can hurt consumers and enterprises in many ways. To protect the enterprise and

consumer customers’ computer and reduce the risk of personal data leaking in advance, Microsoft takes this

problem seriously and the factors of getting a machine infected was investigated in order to improve their security

system. The dataset conducted from Microsoft contains various properties of each machine and the actual infection

status of each machine generated by Microsoft’s endpoint protection solution, Windows Defender, which was

designed to meet certain business constraints, both in regards to user privacy as well as the time period during

which the machine was running.

Logistic regression model is a classic machine learning model classification which makes it an appropriate

regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses,

the logistic regression is a predictive analysis which is used to describe data and to explain the relationship between

one dependent binary variable and one or more independent variables. The dependent variable (response) in the

2
given dataset is whether the malware was detected on the machine or not, therefore the logistic regression model is

the fundamental model in this analysis. When pursuing the higher accuracy of the prediction for high-dimensional

datasets, the trade-off between bias and variance appears all the time. The given dataset contains 83 columns and

over 50k rows. Regularization regression is a method to avoid the risk of overfitting and reduce the variance of the

models without substantial increase in bias when fitting a regression model on a high-dimensional dataset. Least

Absolute Shrinkage and Selection Operator (LASSO) Regularization method is suitable for models with multiple

highly correlated predictors to remove the redundant features in order to facilitate model interpretation to make

decisions.

Decision tree is a model used to classify observations by splitting data based on the features, but it has

weaknesses especially in big data when researchers are facing with various types of features or a large amount of

data. The complexity of the decision trees severely reduces the efficiency and accuracy of the prediction model.

Random Forest is a machine learning method in classification analysis which consists of a large number of

individual decisions trees, functioning as an ensemble. Researchers harvest class predictions from each of the tree

and decide which is the most important class base on the most votes method. To optimize predictions, researchers

usually use bootstrapping methods. Another machine learning algorithm method well-suited for the classification

field based on decision trees is Gradient Boosting Decision Trees (GBDT) model. It is an ensembled model using

boosting technique, essentially combining weak classifiers to form a stronger one. Specifically in this case, the

weak classifiers are different decision trees, therefore the result of GBDT model could be considered as a

combination of these decision trees. The boosting process is sequential, which connects the subsequent trees to the

previous ones with errors in the predictions generated in the past in order to reduce the prediction error gradually

during the combination process. Besides, the GBDT model will not consider the result from one single tree as the

final result. Thus, it can be used to solve the overfitting problem as well.

The goal of this study is to predict the probability of the machine being infected by malware, and determine

the most significant factors affecting the infections for Microsoft Developers to take actions to improve their

systems in the future. Also, it is an opportunity to gain a better understanding of the theory and performance of
3
various classification methods such as simple logistic regression, logistic regression model with Lasso

Regularization, Random forest and Gradient Boosting Decision Trees method, and explore more on the

relationship between large amount of predictors and categorical response variable, and further make prediction

based on the findings.

Data & Methodology:

Data description:

The data was retrieved from Microsoft Competition in Kaggle and the data cleaning process was executed.

The raw dataset contains 8,921,483 observations with 82 various features of the machine and one response

variable, that is whether the machine was infected by the malware (Having Infections: coded as 1, Not Having

Infections: coded as 0). After investigating the frequency of the missing values and the distribution of dimensions

for each feature, those features with higher level of missing values (percentage of the missing values over 75%) or

highly imbalanced dimensions were deleted. For instance, none of the machines appears to have a Beta software,

therefore the feature IsBeta (whether the Beta software appears on each machine) was deleted and 50 variables

were studied in this analysis.

The features assumed to be significant in this study were:

EngineVersion: the version of the engine.

AvSigVersion: the version of the anti-virus signature database.

AvProductStateIdentifier: the status of the anti-virus software used on the machine.

Wdft_IsGamer: whether the machine was a gamer machine.

CityIdentifier: where the machine was located in.

CountryIdentifier: the country the machine was located in.

OsVer: the version of the current operating system.

OsPlatform: the platform used for the operating system.

4
IsProtected: whether there is any anti-virus products on the machine.

Firewall: whether the firewall was built on the machine.

HasTpm: whether the machine has a trusted platform module.

SmartScreen: smart screen enabled on the machine.

Census_SystemVolumeTotalCapacity: the size of the system volume installed on the machine.

Census_OsBuildRevision: the latest version of the operating system on the machine.

Platform: the platform installed on the machine.

Census_TotalPhysicalRAM: the number of the Random-Access Memory.

Census_HasOpticalDiskDrive: whether the machine has an optical disk drive.

Census_OsEdition: the edition of the current operating system.

Census_IsTouchEnabled: whether the machine is a touch device.

Methodology:

⚫ Logistic Regression Model.

a. Summarize the logistic model and select the predictors with significant contribution based on the p value

at a 95% confidence interval.

b. Plot the predicted probability of the dependent variable.

⚫ LASSO Regularization Method: Logistic Regression Model.

a. Plot the cross validation curve with all selected predictors.

b. Determine the value of Lambda that minimize the mean cross validation error and the value of lambda

within one standard error of the minimal mean-squared error to identify the optimal logistic regression

model with lasso regularization method.

⚫ Random Forest Model.

a. Encode the target features as factors.

b. Split the dataset into 6 tree groups and split each group into training and testing sets.
5
c. Fit the random forest classification to the training set for 6 trees. X is the dependent variables, y is the

results, ntree indicates the number of trees to grow resulting with 500 trees.

d. Visualize the results with prediction in binary format, generate the matrix and compare 6 different groups

of data to determine the optimal decision tree.

⚫ Gradient Boosting Decision Trees (GBDT) Model.

a. Labels Encoding: Label Encoding transform values to become numbers between 0 and n-1,

where n is the number of different labels.

b. Frequency Encoding: Frequency Encoding is a special case of Label Encoding. Feature values are

encoded based on the frequency. Transformed values are numbers between 0 and m, where m is

the number of values with a frequency greater or equal than 2.

Analysis & Discussion:

Logistic Regression Analysis:

The predicted formula can be constructed at a 95% confidence interval by the logistic regression to multiple

predictors as follows:

Predicted logit of (HasDetections) = 0.7084 Platformwindows8 + 0.9071SkuEditionEduation

+ 0.9809*SkuEditionEnterprise + 1.366*SkuEditionHome

+ 2.108SkuEditionInvalid + 0.933SkuEditionPro + 0.3086*IsProtected

+ 0.1395Wdft_IsGamer + 0.7202AppVersion. (1)

According to Equation 1, the log of the odds of the machine having infections was related to 9 features (p

value < 0.05). All features have a coefficient greater than 0, which suggests that these features enhance the

probability of the machine getting infected. The feature IsProtected (whether each machine has any active anti-

virus products) and Wdft_IsGamer (whether the user of each machine is a gamer) have the lowest p value,

therefore these two features are most critical factors increasing the probability of the machine having infections.

6
The goodness-of-fit statistics assess the fit of a logistic model against actual outcomes (i.e., whether a machine has

infections). Adjusted R-squared value is 0.028, suggesting around 2.8% of the observations can be explained by the

model. The large AIC value (80695) and Residuals (59748) also show the inefficacy of the model.

Table 1: Confusion Matrix (Logistic Regression Model).

Predicted
Logistic
0 1 Total
0 15144 14388 29532
Actual
1 10688 19586 30274
Success Rate 58.07%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

A confusion matrix was adopted to evaluate the performance of the logistic regression model. Table 1

suggests that the around half of the uninfected machine was predicted to be uninfected and about two-third of those

infected was correctly classified by the logistic regression model. The overall success rate of the prediction was

58.07%. The Type I error (where the uninfected machines were falsely predicted) is relatively higher than Type II

error (where the infected machines were falsely classified as uninfected) which defends the model in some ways,

since it is riskier to classify an infected machine as uninfected than the other way around.

Figure 1: Predicted Probability of the Outcome variable, HasDetections

(1 = Having Infections, 0 = Not Having Infections).

7
The probability prediction plot shows the predicted probability of each machine having infected by malware

along with the actual malware detection status. The Y-axis is the probability of having detections, and the X-axis is

the probability of infection from the lowest to the highest value of 60,000 machines. The curve is relatively flat

around the probability of 0.5, indicating that the prediction of the model is not efficient, which is consistent to the

small R-squared value above.

Logistic Regression Analysis with LASSO Regularization Method:

Figure 2: Mean Squared Error Versus the Logit of (lambda).

From the Plot 3, the value of the mean squared error has increased as lambda value increases, and first dash

lines chose two lambda values for the regularization method. The smaller lambda leads to the minimum mean squared

error, and the lambda within one standard deviation that leads to the model with fewest features. The Lasso

regularization process was applied with a lambda of 0.00265 to remove the redundant variables. The predicted

formula was adjusted by adapting Lasso regularization method to the logistic regression model as follows:

Predicted logit of (HasDetections) = 0.1017 Platformwindows8 + 0.0843IsProtected

+ 0.0169*Wdft_IsGamer (2)

8
Several features were shrunken to zero from Equation 1 to Equation 2. It was noticed that only three features

show significance after Lasso regularization: the Platformwindows8 (the Windows 8 platform), IsProtected (whether

any anti-virus products was on each machine) and Wdft_IsGamer (whether the user of each machine is a gamer). All

these three features increase the risk of the machine getting infected by malware with a positive coefficient.

Table 2: Confusion Matrix (Logistic Regression Model with Lasso Regularization method).

Predicted
LASSO
0 1 Total
0 4411 4446 8857
Actual
1 3082 6061 9143
Success Rate 58.18%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

After performing the Lasso regularization method, the accuracy rate was improved by around 0.1%

compared with the result from the simple logistic regression model shown in table 1. Whereas, the maximal R-

squared value was generated to be around 0.039 which indicates the inefficacy of the model.

Random Forest Model:

9
Note: 0, 1 on the left columns of the tables in Figure 3 represent the actual detection status in the test set and the
upper ones represent the predicted result.
Figure 3: Confusion Matrices (Random Forest Model).

In the Random Forest, the dataset was split up into 6 trees with slightly different number of inputs (500-

1000 rows) in each tree to test whether size of the inputs actually affects the prediction rate. From Figure 3, it can

be observed that Tree 1 with the lowest number of rows (4883 inputs) has the highest success rates at 57.04%. The

higher the number of observations is, the lower the success rate is with Tree 6 (6265 inputs) which can only

predicted 55.91% of the test set successfully. As a result, the number of Type I error cases (wrongly classified the

machine as infected one when it is actually not) are greater than Type II error (wrongly classified the infected

machine) cases across all the trees, which reduces the severity of wrong predictions.

Note: 0, 1 on the left columns of the tables in Figure 4 represent the actual detection status in the test set and the
upper ones represent the predicted result.
Figure 4: Confusion Matrices with Various Thresholds.

The confidence level was set lower to 0.45 and 0.475 and also it was set to be 0.55 in order to compare the

result of the success rate. Using Tree 1 as the base for the optimization, success rates fluctuates as the Confidence

10
level of area for malfunction changes. Nevertheless, the most optimized success rate (highest) is still the original

Tree 1 with 57.04% success rate. Any confidence level above or below 0.5 decreases the success rate of the

Random Forest prediction model. Thus, the confidence level was normally distributed between 0 and 1 with a

mean of 0.5. However, the Type II error was reduced from the original Tree 1 (15.83%) to the Optimized 3 of Tree

1 (10.84%) with a threshold of 0.55. Therefore, the Optimized 3 case with confidence level at 0.55 would be

considered as a more useful decision tree with a 0.3% success rate reduction.

Gradient Boosting Decision Trees (GBDT) Model:

The dataset was randomly split into train set (70%) and test set (30%), with each set containing approximately

equal number of infected machines and uninfected machines. The train set was used to build the model so that the

probability of test machine getting infected can be predicted (1 = Having Infections, 0 = Not Having Infections). A

probability larger than 50% means the machine was predicted to be infected, and not infected with a probability

lower than 50%.

Table 3: Confusion matrix (GBDT Model with 60,000 observations).

Predicted
GBDT
0 1 Total
0 4555 4286 8841
Actual
1 3609 5356 8965
Success Rate 55.66%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

Table 4: Precision, Recall Rate and F1 Score of GBDT Model (with 60,000 observations).

Precision Recall F1
0 51.52% 55.79% 53.57%
1 59.73% 55.55% 57.56%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

11
According to Table 3, 4555 out of 8841uninfected machine were correctly classified and, 3609 out of 8965

infected machine were correctly classified, the predict accuracy was 55.66%. The precision and recall rate measure

the performance of the prediction model with higher proportion representing better performance and higher accuracy

rate. Theoretically, the modification of the model aims to increase both the precision and recall rate, but increasing

the precision rate always leads to a lower recall rate, and vice versa. Thus, the combination of precision and recall

rate, F1 score, was embraced to give a better evaluation on the model. From Table 4, the F1 score is less than 60%

in each group of infected or uninfected machines, which indicates the model poorly explained the observations.

Table 5: Confusion Matrix (GBDT Model with 100,000 observations).

Predicted
GBDT
0 1 Total
0 8685 6290 14975
Actual
1 6508 8517 15025
Success Rate 57.34%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

Table 6: Precision, Recall Rate and F1 Score of GBDT Model (with 100,000 observations).

Precision Recall F1
0 58.00% 57.16% 57.58%
1 56.69% 57.52% 57.10%
*1: the machine that was infected by the malware.
*0: the machine that was not infected by the malware.

The GBDT model was adopted on a larger dataset with 100,000 observations and the accuracy rate was

improved by approximately 2.68% shown in table 5. Also, F1 Score was higher than 55% for both infected and

uninfected cases which confirms the better efficacy of the model with larger amount of data.

12
Feature Importance
0.25

0.2

0.15

0.1

0.05

Figure 5: The importance of the features shown significant contribution (with 60,000 observations).

Figure 6: The importance of the features shown significant contribution (with 100,000 observations).

13
The Gradient Boosting Decision Trees (GBDT) model generated the frequency of each feature used in the

model fitting process and a feature with a higher frequency was considered to have higher importance. The bar charts

were drawn for two GBDT models to visualize the impacts of those features on the probability of the machine getting

infected. The top three features with highest importance from Figure 5 were CountryIdentifier (the country each

machine was located in), LocaleEnglishNameIdentifier (the language code of the machine defined by Microsoft, i.e.

English = 1033.) and GeoNameIdentifier (the geographic region the machine was located in) which all related to the

location of the machine. Census_SystemVolumeTotalCapacity (the size of the system volume installed on the

machine), CountryIdentifier (the country each machine was located in) and Census_OsBuildRevision (the latest

version of the operating system on the machine, i.e. the latest version of Windows 10 =1909) appear to be the top

three features with higher contribution. The importance of each feature has changed from Figure 5 to Figure 6 with

different amount of observations. Whereas, there is no enough evidence to claim that those top three features in

Figure 6 would give a precise prediction with a relatively low success rate.

Conclusion & Recommendations:

From the results discussed above, it can be concluded that the Lasso logistic regression model is the best

model of prediction for this dataset with the highest success rate of 58.18% while the random forest which believed

to be one of the best models for classification machine learning, has the lowest success rate among all three models

both before and after optimization. Also, the Gradient Boosting Decision Trees (GBDT) model was not a proper

approach to predict the probability of each machine getting infected with a smaller accuracy rate (57.34%). R

programming language was found to have limited performance handling with variables containing over 1000 unique

values. In addition, the same process was done in Python for the GBDT model with more high-dimensional features

which results in a higher accuracy rate of 62.02%.

Error probability during extraction. The observations used in this study was only around 6.7% of the whole

dataset due to the limitation of the computers and R programming language, which makes the results hardly
14
representative to the entire dataset. A larger sample was selected to optimize the GBDT model and it did improve

the accuracy rate by 2.68%, which confirms the significance of the sample size. Although we extracted the dataset

randomly with equal amounts of the infected and uninfected machines, there might still be other factors might explain

the final results, for instance, some properties of those machine might be recorded in a specific order.

Lack of transparency, confidentiality and unexplainable features. Out of 83 original columns, only around

half of them are chosen for this analysis. It is due to the fact that a lot of features are not clearly explained in detail,

encoded or even missing for all the observations. One of the examples is the Geographic features. All of the countries

are encoded in a confidential way and the probability of the machine getting infected by the malware in each country

cannot be studied as a result. Those Geographic features were excluded from the Random Forest prediction because

they have more than 200 unique values for each variable and RandomForest function can only assign 53.

For further study, researchers should execute these models using Python on computers with generous memory

to utilize the whole dataset and focus on the features appearing to be significant in this analysis. It could be an

alternative way to study those machines located in the same place or fixed one of other features with gargantuan

number of categories to gain a more precision prediction result and different sets of observations could be selected

to verify the significance of each property of the machines.

15
References

1. Bronshtein, A. (2019, February 27). Train/Test Split and Cross Validation in Python. Retrieved from
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

2. Computing Classification Evaluation Metrics in R. (n.d.). Retrieved from

https://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html

3. Microsoft Malware Prediction. (n.d.). Retrieved from https://www.kaggle.com/c/microsoft-malware-

prediction/data

4. Microsoft Malware Prediction. (n.d.). Retrieved from https://www.kaggle.com/c/microsoft-malware-

prediction/discussion/84065

5. Openspecs-Office. (n.d.). [MS-OE376]: Part 4 Section 7.6.2.39, LCID (Locale ID). Retrieved from
https://docs.microsoft.com/en-us/openspecs/office_standards/ms-oe376/6c085406-a698-4e12-9d4d-
c3b0ee3dbc4a

6. Person. (2019, January 8). Rstudio, is it useable for large data sets (9gb )? Retrieved from
https://community.rstudio.com/t/rstudio-is-it-useable-for-large-data-sets-9gb/21138/7

7. Yurtoğlu, N. (2018). http://www.historystudies.net/dergi//birinci-dunya-savasinda-bir-asayis-sorunu-

sebinkarahisar-ermeni-isyani20181092a4a8f.pdf. History Studies International Journal of History, 10(7), 241–
264. doi: 10.9737/hist.2018.658

Data Science for Malware Prediction
100% (1)
Data Science for Malware Prediction
16 pages
FRP Design
No ratings yet
FRP Design
20 pages
Malware Detection with ML Techniques
No ratings yet
Malware Detection with ML Techniques
8 pages
Survey Paper of Group 7
No ratings yet
Survey Paper of Group 7
9 pages
Malware Detection: Rahul R S (1BM17IS066) Vikram K (1BM17IS089) Rithvik M (1BM17IS068)
No ratings yet
Malware Detection: Rahul R S (1BM17IS066) Vikram K (1BM17IS089) Rithvik M (1BM17IS068)
17 pages
Malware Detection and Evasion With Machine Learning Techniques: A Survey
No ratings yet
Malware Detection and Evasion With Machine Learning Techniques: A Survey
9 pages
Malware Detection Using Machine Leaning
No ratings yet
Malware Detection Using Machine Leaning
9 pages
Research Paper
No ratings yet
Research Paper
8 pages
Dynamic Malware Detection via Deep Learning
No ratings yet
Dynamic Malware Detection via Deep Learning
16 pages
Major Project
No ratings yet
Major Project
10 pages
Unifying Traditional and Machine Learning Approaches For Robust Malware Classification
No ratings yet
Unifying Traditional and Machine Learning Approaches For Robust Malware Classification
6 pages
Research Paper
No ratings yet
Research Paper
8 pages
Thesis
No ratings yet
Thesis
76 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
4 pages
Windows Malware Detection
No ratings yet
Windows Malware Detection
14 pages
Development of Malware Detection and Analysis Mode
No ratings yet
Development of Malware Detection and Analysis Mode
50 pages
Malware Detection for CS Students
No ratings yet
Malware Detection for CS Students
30 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
5 pages
Akron 1311042709
No ratings yet
Akron 1311042709
104 pages
Machine Learning for Malware Detection
No ratings yet
Machine Learning for Malware Detection
16 pages
IET Information Security - 2020 - Ghouti - Malware Classification Using Compact Image Features and Multiclass Support
No ratings yet
IET Information Security - 2020 - Ghouti - Malware Classification Using Compact Image Features and Multiclass Support
11 pages
Symmetry 14 02304
No ratings yet
Symmetry 14 02304
11 pages
Malware Analysis Using Python and Kaggle Dataset
No ratings yet
Malware Analysis Using Python and Kaggle Dataset
4 pages
Malware Analysis Using Machine Learning (Paper Presented)
No ratings yet
Malware Analysis Using Machine Learning (Paper Presented)
69 pages
Salifyanji & Bethsaida Kmu
No ratings yet
Salifyanji & Bethsaida Kmu
12 pages
Malware Detection in Health Sensors Using ML
No ratings yet
Malware Detection in Health Sensors Using ML
74 pages
Android Malware Detection
No ratings yet
Android Malware Detection
17 pages
Applsci 12 08604 v2
No ratings yet
Applsci 12 08604 v2
21 pages
Group 7
No ratings yet
Group 7
25 pages
Thesis Final PDF
No ratings yet
Thesis Final PDF
93 pages
Malware - Detection - Research - Paper - Updated Soheb6
No ratings yet
Malware - Detection - Research - Paper - Updated Soheb6
8 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
7 pages
Amutenda r206668v Technical Paper
No ratings yet
Amutenda r206668v Technical Paper
5 pages
6 Thsemminiproject
No ratings yet
6 Thsemminiproject
12 pages
MCA Thesis 21MCA1088 Vikku Kumar
No ratings yet
MCA Thesis 21MCA1088 Vikku Kumar
72 pages
Document 14
No ratings yet
Document 14
18 pages
Mirai Botnet Detection: A Machine Learning Approach: Journal of Nonlinear Analysis and Optimization January 2025
No ratings yet
Mirai Botnet Detection: A Machine Learning Approach: Journal of Nonlinear Analysis and Optimization January 2025
6 pages
Detecting Obfuscated Malware with AI
No ratings yet
Detecting Obfuscated Malware with AI
6 pages
Fin Irjmets1679988564
No ratings yet
Fin Irjmets1679988564
6 pages
REPORT
No ratings yet
REPORT
14 pages
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
No ratings yet
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
52 pages
Jijo Renj
No ratings yet
Jijo Renj
4 pages
Ensemble Classifier Design and Performance Evaluation For Intrusion Detection Using UNSW-NB15 Dataset
No ratings yet
Ensemble Classifier Design and Performance Evaluation For Intrusion Detection Using UNSW-NB15 Dataset
131 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
BlackBook-Report FY-ML MalwareDetection1
No ratings yet
BlackBook-Report FY-ML MalwareDetection1
48 pages
Tuning The K Value in K-Nearest Neighbors For Malware Detection
No ratings yet
Tuning The K Value in K-Nearest Neighbors For Malware Detection
8 pages
Deep Learning for Malware Detection
No ratings yet
Deep Learning for Malware Detection
28 pages
08 Rohit Final Malware Research Paper
No ratings yet
08 Rohit Final Malware Research Paper
13 pages
Compusoft, 3 (10), 1116-1123 PDF
No ratings yet
Compusoft, 3 (10), 1116-1123 PDF
8 pages
A Survey On Intrusion Detection System Using Machine Learning Techniques
No ratings yet
A Survey On Intrusion Detection System Using Machine Learning Techniques
7 pages
Malware Detection with Machine Learning
No ratings yet
Malware Detection with Machine Learning
29 pages
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
No ratings yet
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
18 pages
Caravan Insurance Prediction Analysis
No ratings yet
Caravan Insurance Prediction Analysis
8 pages
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
No ratings yet
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
8 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
Configuration Manual for Cyber Security Project
No ratings yet
Configuration Manual for Cyber Security Project
19 pages
Ahmadi Et Al. - 2016 - Novel Feature Extraction, Selection and Fusion For Effective Malware Family Classification
No ratings yet
Ahmadi Et Al. - 2016 - Novel Feature Extraction, Selection and Fusion For Effective Malware Family Classification
7 pages
Malware
No ratings yet
Malware
10 pages
A Survey of The Recent Trends in Deep Le
No ratings yet
A Survey of The Recent Trends in Deep Le
30 pages
NTG6 Debug and Testing Overview
No ratings yet
NTG6 Debug and Testing Overview
1 page
Generating 256-Bit Stream with USB3.0 Hub
No ratings yet
Generating 256-Bit Stream with USB3.0 Hub
1 page
Graph Theory and Its Applications 3rd 6fe3
100% (5)
Graph Theory and Its Applications 3rd 6fe3
593 pages
ML
No ratings yet
ML
2 pages
Book Proposal For IGI Global
No ratings yet
Book Proposal For IGI Global
3 pages
Eckmar Marketplace Script Installation Guide
No ratings yet
Eckmar Marketplace Script Installation Guide
5 pages
J59: Returns and Complaints: Sap Ag
No ratings yet
J59: Returns and Complaints: Sap Ag
6 pages
Future PDF
100% (5)
Future PDF
72 pages
Mechanism Design Visual and Programmable Approaches 1st Russell PDF Download
No ratings yet
Mechanism Design Visual and Programmable Approaches 1st Russell PDF Download
322 pages
System and Network Adminstration Final Exam Feb, 2023
94% (16)
System and Network Adminstration Final Exam Feb, 2023
2 pages
The Lord of The Rings Return To Moria Download and Buy Today - Epic Games Store
No ratings yet
The Lord of The Rings Return To Moria Download and Buy Today - Epic Games Store
1 page
Electronic Communication and Summons Under BNSS
No ratings yet
Electronic Communication and Summons Under BNSS
3 pages
SQL Commands for Student Table Operations
No ratings yet
SQL Commands for Student Table Operations
45 pages
TikTok's Impact on College Motivation
No ratings yet
TikTok's Impact on College Motivation
17 pages
Brochure CE Construction Project Management September 2020
No ratings yet
Brochure CE Construction Project Management September 2020
15 pages
Week 1 - Lecture 3 - WDLC
No ratings yet
Week 1 - Lecture 3 - WDLC
21 pages
23PVPCSC02A Mobile Computing
No ratings yet
23PVPCSC02A Mobile Computing
2 pages
Q-Exactive-GC Preinstallation Guide
No ratings yet
Q-Exactive-GC Preinstallation Guide
67 pages
Testbank For Business Analytics 2nd Edition Jaggia
No ratings yet
Testbank For Business Analytics 2nd Edition Jaggia
17 pages
Infographics TTL
No ratings yet
Infographics TTL
2 pages
Soal Test Calon POLICE 2
No ratings yet
Soal Test Calon POLICE 2
6 pages
B2 First For Schools: Technology Poster Lesson Plan and Activities
No ratings yet
B2 First For Schools: Technology Poster Lesson Plan and Activities
8 pages
Lecture 7 - DC Transformer Model and Effect of Losses
No ratings yet
Lecture 7 - DC Transformer Model and Effect of Losses
36 pages
John Carlo AlejandroBSMT A1F
No ratings yet
John Carlo AlejandroBSMT A1F
4 pages
Steam Turbine Insulation Design
No ratings yet
Steam Turbine Insulation Design
29 pages
The Output of Information
No ratings yet
The Output of Information
14 pages
Component Testing Using An Oscilloscope With Integrated Waveform Generator
No ratings yet
Component Testing Using An Oscilloscope With Integrated Waveform Generator
10 pages
6D Series Tractors Tier 3 6100D 6110D 6115D 6125D 6130D and 6140D Filter Overview With Service Intervals and Capacities
No ratings yet
6D Series Tractors Tier 3 6100D 6110D 6115D 6125D 6130D and 6140D Filter Overview With Service Intervals and Capacities
2 pages
Huawei Training for IT Professionals
No ratings yet
Huawei Training for IT Professionals
27 pages
Forencis Assignment Contemps Note
No ratings yet
Forencis Assignment Contemps Note
24 pages
Configuração Digit - Optitex
No ratings yet
Configuração Digit - Optitex
1 page
Ge C++
No ratings yet
Ge C++
4 pages
Startup Success: Vision Over Money
No ratings yet
Startup Success: Vision Over Money
14 pages
Apple vs Samsung: B2B Marketing Analysis
No ratings yet
Apple vs Samsung: B2B Marketing Analysis
11 pages

Microsoft Malware Analysis

Uploaded by

Microsoft Malware Analysis

Uploaded by

ASSIGNMENT FRONT SHEET

Course Name: ALY6015 20904 Intermediate Analytics

Professor Name: ChuanLi Jiang

Student Class: Fall 2019 CPS Term: A. 2020

Final Project Report

Completion Date: February 13th Due Time:12:00am

Group Member & Student ID:

Jingyan Qiao, 001368423

Jiayi Wang, 001082307

Quoc Tuong Dong, 001089611

which the machine was running.

based on the findings.

Data & Methodology:

were studied in this analysis.

The features assumed to be significant in this study were:

EngineVersion: the version of the engine.

AvSigVersion: the version of the anti-virus signature database.

AvProductStateIdentifier: the status of the anti-virus software used on the machine.

Wdft_IsGamer: whether the machine was a gamer machine.

CityIdentifier: where the machine was located in.

CountryIdentifier: the country the machine was located in.

OsVer: the version of the current operating system.

OsPlatform: the platform used for the operating system.

Firewall: whether the firewall was built on the machine.

HasTpm: whether the machine has a trusted platform module.

SmartScreen: smart screen enabled on the machine.

Census_SystemVolumeTotalCapacity: the size of the system volume installed on the machine.

Census_OsBuildRevision: the latest version of the operating system on the machine.

Platform: the platform installed on the machine.

Census_TotalPhysicalRAM: the number of the Random-Access Memory.

Census_HasOpticalDiskDrive: whether the machine has an optical disk drive.

Census_OsEdition: the edition of the current operating system.

Census_IsTouchEnabled: whether the machine is a touch device.

⚫ Logistic Regression Model.

at a 95% confidence interval.

b. Plot the predicted probability of the dependent variable.

⚫ LASSO Regularization Method: Logistic Regression Model.

a. Plot the cross validation curve with all selected predictors.

model with lasso regularization method.

⚫ Random Forest Model.

a. Encode the target features as factors.

of data to determine the optimal decision tree.

⚫ Gradient Boosting Decision Trees (GBDT) Model.

where n is the number of different labels.

the number of values with a frequency greater or equal than 2.

Analysis & Discussion:

Logistic Regression Analysis:

Predicted logit of (HasDetections) = 0.7084 *Platformwindows8 + 0.9071*SkuEditionEduation

+ 2.108*SkuEditionInvalid + 0.933*SkuEditionPro + 0.3086*IsProtected

+ 0.1395*Wdft_IsGamer + 0.7202*AppVersion. (1)

Table 1: Confusion Matrix (Logistic Regression Model).

Figure 1: Predicted Probability of the Outcome variable, HasDetections

small R-squared value above.

Logistic Regression Analysis with LASSO Regularization Method:

Figure 2: Mean Squared Error Versus the Logit of (lambda).

Predicted logit of (HasDetections) = 0.1017 *Platformwindows8 + 0.0843*IsProtected

Random Forest Model:

Gradient Boosting Decision Trees (GBDT) Model:

lower than 50%.

Table 3: Confusion matrix (GBDT Model with 60,000 observations).

Table 5: Confusion Matrix (GBDT Model with 100,000 observations).

Conclusion & Recommendations:

which results in a higher accuracy rate of 62.02%.

to verify the significance of each property of the machines.

2. Computing Classification Evaluation Metrics in R. (n.d.). Retrieved from

3. Microsoft Malware Prediction. (n.d.). Retrieved from https://www.kaggle.com/c/microsoft-malware-

4. Microsoft Malware Prediction. (n.d.). Retrieved from https://www.kaggle.com/c/microsoft-malware-

7. Yurtoğlu, N. (2018). http://www.historystudies.net/dergi//birinci-dunya-savasinda-bir-asayis-sorunu-

You might also like

Predicted logit of (HasDetections) = 0.7084 Platformwindows8 + 0.9071SkuEditionEduation

+ 2.108SkuEditionInvalid + 0.933SkuEditionPro + 0.3086*IsProtected

+ 0.1395Wdft_IsGamer + 0.7202AppVersion. (1)

Predicted logit of (HasDetections) = 0.1017 Platformwindows8 + 0.0843IsProtected