Malware Analysis Lab Dr Benabderrezak
Lab : Malware Analysis Using Python and Kaggle Dataset
Objective
The objective of this lab is to analyze malware using Python by exploring a Kaggle dataset, performing feature
extraction, and applying machine learning techniques for malware classification.
Prerequisites
- Python : Basic understanding of Python programming.
- Pandas & NumPy : Used for data manipulation and numerical operations
- Matplotlib & Seaborn : Visualization libraries for data analysis
- Scikit-learn : Essential for machine learning tasks such as data preprocessing, model training, and
evaluation
- Joblib : Used for saving and loading trained models
- Kaggle Account : Required to download datasets
- Jupyter Notebook or Python IDE : Recommended for running the lab efficiently.
Step 1: Install Required Libraries
pip install pandas numpy sklearn matplotlib seaborn joblib
Step 2: Download the Malware Dataset from Kaggle
- Visit Kaggle and search for a malware dataset (e.g., "Microsoft Malware Classification")
- Download the dataset and place it in your working directory
Step 3 : Load the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset (adjust filename as needed)
df = pd.read_csv('malware_dataset.csv')
# Display basic info
df.info()
df.head()
1
Malware Analysis Lab Dr Benabderrezak
Step 4: Data Exploration and Preprocessing
1. Checking for Missing Values
- Before proceeding with data analysis, it is essential to check if there are any missing values in the dataset.
- Missing data can impact the accuracy of machine learning models.
# Check for missing values
print("Missing values:")
print(df.isnull().sum())
If any missing values are found, we handle them appropriately by filling them with zeros or using other imputation
techniques.
# Handle missing values (if any)
df.fillna(0, inplace=True)
2. Understanding Class Distribution
- Class distribution analysis helps in understanding if the dataset is imbalanced.
- In malware classification, an imbalanced dataset can lead to biased model predictions.
# Check class distribution
sns.countplot(x='label', data=df)
plt.title("Class Distribution")
plt.show()
If the dataset is highly imbalanced, techniques such as oversampling, undersampling, or using balanced algorithms
(e.g., SMOTE) can be applied.
Step 5: Feature Engineering
from sklearn.preprocessing import LabelEncoder
# Convert categorical features to numerical values
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])
2
Malware Analysis Lab Dr Benabderrezak
# Select relevant features
features = df.drop(columns=['label'])
labels = df['label']
Step 6: Split Dataset into Training and Testing Sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
Step 7 : Train a Machine Learning Model
Most Used Machine Learning Algorithms for Malware Detection :
- Random Forest - Ensemble learning method for classification
- Support Vector Machine (SVM) - Effective in high-dimensional spaces.
- Gradient Boosting (XGBoost, LightGBM) - Powerful boosting techniques.
- Neural Networks (Deep Learning) - Advanced detection with deep models.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
3
Malware Analysis Lab Dr Benabderrezak
Step 8 : Feature Importance Analysis
feature_importances = pd.Series(model.feature_importances_, index=features.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.title("Top 10 Important Features")
plt.show()
Step 9: Save the Model
import joblib
joblib.dump(model, "malware_classifier.pkl")
Step 10: Detect Malware on New Data
# Load saved model
model = joblib.load("malware_classifier.pkl")
# Predict on new data (adjust filename accordingly)
new_data = pd.read_csv('new_malware_sample.csv')
new_pred = model.predict(new_data)
print("Prediction:", new_pred)
Conclusion
In this lab, we explored a malware dataset, performed feature engineering, trained a machine learning model, and
evaluated its performance. This approach can be expanded with deep learning techniques and additional feature
extraction methods for better malware detection.