0% found this document useful (0 votes)
48 views

Before Feature Selection

This document summarizes the analysis of a dataset containing network flow data before performing feature selection. It shows that the dataset contains over 2.7 million rows and 50 columns, including non-numeric columns like flow IDs, IP addresses, and categories. Summary statistics are presented on the non-numeric columns, showing the number of unique values and percentage breakdown of the most frequent values in each column. This analysis provides an overview of the dataset content and distribution prior to feature selection.

Uploaded by

Rifqi Zumadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Before Feature Selection

This document summarizes the analysis of a dataset containing network flow data before performing feature selection. It shows that the dataset contains over 2.7 million rows and 50 columns, including non-numeric columns like flow IDs, IP addresses, and categories. Summary statistics are presented on the non-numeric columns, showing the number of unique values and percentage breakdown of the most frequent values in each column. This analysis provides an overview of the dataset content and distribution prior to feature selection.

Uploaded by

Rifqi Zumadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Overview Data
In [1]: import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]: import pandas as pd

data = pd.read_csv('C://Users/ditama/Downloads/Unicauca-dataset-April-June-2019-Network-flows
data.head()

Out[2]: flow_key src_ip_numeric src_ip src_port dst_ip dst_port

0 3acee4f4ea001cd5e6d9584d4036b53d 3232266497 192.168.121.1 67 172.16.255.185

1 974ec5991b439c9a7176b88be0c90df0 3232266497 192.168.121.1 67 172.16.255.186

2 3acee4f4ea001cd5e6d9584d4036b53d 3232266497 192.168.121.1 67 172.16.255.185

3 974ec5991b439c9a7176b88be0c90df0 3232266497 192.168.121.1 67 172.16.255.186

4 cfa7c2740072befaa89c202499729e08 3232266497 192.168.121.1 0 10.130.1.166

5 rows × 50 columns

In [3]: data.shape

Out[3]: (2704839, 50)

In [4]: data.columns

Out[4]: Index(['flow_key', 'src_ip_numeric', 'src_ip', 'src_port', 'dst_ip',


'dst_port', 'proto', 'pktTotalCount', 'octetTotalCount', 'min_ps',
'max_ps', 'avg_ps', 'std_dev_ps', 'flowStart', 'flowEnd',
'flowDuration', 'min_piat', 'max_piat', 'avg_piat', 'std_dev_piat',
'f_pktTotalCount', 'f_octetTotalCount', 'f_min_ps', 'f_max_ps',
'f_avg_ps', 'f_std_dev_ps', 'f_flowStart', 'f_flowEnd',
'f_flowDuration', 'f_min_piat', 'f_max_piat', 'f_avg_piat',
'f_std_dev_piat', 'b_pktTotalCount', 'b_octetTotalCount', 'b_min_ps',
'b_max_ps', 'b_avg_ps', 'b_std_dev_ps', 'b_flowStart', 'b_flowEnd',
'b_flowDuration', 'b_min_piat', 'b_max_piat', 'b_avg_piat',
'b_std_dev_piat', 'flowEndReason', 'category', 'application_protocol',
'web_service'],
dtype='object')

In [5]: data.info()

1 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2704839 entries, 0 to 2704838
Data columns (total 50 columns):
# Column Dtype
--- ------ -----
0 flow_key object
1 src_ip_numeric int64
2 src_ip object
3 src_port int64
4 dst_ip object
5 dst_port int64
6 proto int64
7 pktTotalCount int64
8 octetTotalCount int64
9 min_ps int64
10 max_ps int64
11 avg_ps float64
12 std_dev_ps float64
13 flowStart float64
14 flowEnd float64
15 flowDuration float64
16 min_piat float64
17 max_piat float64
18 avg_piat float64
19 std_dev_piat float64
20 f_pktTotalCount int64
21 f_octetTotalCount int64
22 f_min_ps int64
23 f_max_ps int64
24 f_avg_ps float64
25 f_std_dev_ps float64
26 f_flowStart float64
27 f_flowEnd float64
28 f_flowDuration float64
29 f_min_piat float64
30 f_max_piat float64
31 f_avg_piat float64
32 f_std_dev_piat float64
33 b_pktTotalCount int64
34 b_octetTotalCount int64
35 b_min_ps int64
36 b_max_ps int64
37 b_avg_ps float64
38 b_std_dev_ps float64
39 b_flowStart float64
40 b_flowEnd float64
41 b_flowDuration float64
42 b_min_piat float64
43 b_max_piat float64
44 b_avg_piat float64
45 b_std_dev_piat float64
46 flowEndReason int64
47 category object
48 application_protocol object
49 web_service object
dtypes: float64(27), int64(17), object(6)
memory usage: 1.0+ GB

2 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Lets take a look at all the non-numeric columns


In [6]: non_num_cols = [col for col in data.columns if data[col].dtype == 'O']
non_num_data = data[non_num_cols]
non_num_data

Out[6]: flow_key src_ip dst_ip category application_protocol

0 3acee4f4ea001cd5e6d9584d4036b53d 192.168.121.1 172.16.255.185 Network

1 974ec5991b439c9a7176b88be0c90df0 192.168.121.1 172.16.255.186 Network

2 3acee4f4ea001cd5e6d9584d4036b53d 192.168.121.1 172.16.255.185 Network

3 974ec5991b439c9a7176b88be0c90df0 192.168.121.1 172.16.255.186 Network

4 cfa7c2740072befaa89c202499729e08 192.168.121.1 10.130.1.166 Network

... ... ... ... ...

2704834 695ea899a18c6d2f90c8b2f6c9b70bdf 192.168.128.252 172.16.255.186 System

2704835 f8188e4364129e635fe032a3bda206ea 192.168.128.252 172.16.255.185 System

2704836 4deda0130e2054781655cb4bd4cb580d 192.168.128.252 172.16.255.186 System

2704837 8c07a45c0c48648ff56341d7a065b855 192.168.128.252 108.177.11.188 Web

2704838 a61c7ab8213996e502ac7f54fc97fb34 192.168.128.252 172.217.15.196 Web

2704839 rows × 6 columns

No. of unique values and their counts in non_numeric columns


In [7]: [(col, non_num_data[col].nunique()) for col in non_num_cols]

Out[7]: [('flow_key', 2344534),


('src_ip', 716),
('dst_ip', 104463),
('category', 24),
('application_protocol', 23),
('web_service', 141)]

In [8]: def summarize_cat(col_name):


sorted_values = sorted(non_num_data[col_name].value_counts().iteritems(), key =
remaining_per = 100
for (value, count) in sorted_values:
per = count / len(non_num_data) * 100
if per >= 1:
print(f'{value} : {per:.2f}%')
else :
print(f'Others : {remaining_per:.2f}%')
break
remaining_per = remaining_per - per

3 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [9]: for col in non_num_cols:


print(f"Summary of {col} column : ")
summarize_cat(col)
print('\n')

4 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Summary of flow_key column :


Others : 100.00%

Summary of src_ip column :


192.168.128.3 : 5.59%
192.168.122.52 : 1.75%
192.168.125.17 : 1.58%
192.168.121.62 : 1.30%
192.168.127.13 : 1.26%
192.168.128.87 : 1.14%
Others : 87.38%

Summary of dst_ip column :


172.16.255.200 : 27.43%
172.16.255.183 : 5.46%
172.16.141.250 : 5.01%
Others : 62.10%

Summary of category column :


Web : 52.36%
Network : 16.39%
Unspecified : 9.21%
SocialNetwork : 5.58%
Chat : 2.79%
Download-FileTransfer-FileSharing : 2.62%
Media : 2.36%
Cloud : 1.87%
VoIP : 1.74%
Collaborative : 1.44%
System : 1.37%
Others : 2.27%

Summary of application_protocol column :


Unknown : 48.37%
TLS : 25.58%
DNS : 18.10%
HTTP : 4.75%
QUIC : 2.62%
Others : 0.59%

Summary of web_service column :


Google : 21.07%
DNS : 15.52%
TLS : 9.60%
Unknown : 9.21%
Microsoft : 6.37%
HTTP : 5.65%
Facebook : 4.47%
Amazon : 3.24%
GoogleServices : 3.23%
BitTorrent : 2.62%
YouTube : 2.06%
Messenger : 1.67%
HTTP_Proxy : 1.25%

5 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Others : 14.04%

Exploratory Analysis for numeric columns


In [10]: num_cols = list(set(data.columns) - set(non_num_cols))
num_cols

Out[10]: ['min_ps',
'b_max_ps',
'f_std_dev_piat',
'f_avg_piat',
'b_flowDuration',
'flowDuration',
'b_pktTotalCount',
'f_flowEnd',
'b_flowEnd',
'f_flowStart',
'b_avg_piat',
'f_flowDuration',
'b_max_piat',
'octetTotalCount',
'b_min_piat',
'avg_ps',
'src_ip_numeric',
'b_std_dev_piat',
'flowStart',
'f_octetTotalCount',
'pktTotalCount',
'b_avg_ps',
'src_port',
'b_min_ps',
'f_avg_ps',
'b_std_dev_ps',
'max_piat',
'min_piat',
'std_dev_ps',
'flowEnd',
'std_dev_piat',
'f_max_piat',
'f_min_ps',
'f_max_ps',
'avg_piat',
'f_std_dev_ps',
'flowEndReason',
'dst_port',
'f_pktTotalCount',
'proto',
'max_ps',
'b_flowStart',
'b_octetTotalCount',
'f_min_piat']

In [11]: data[num_cols].describe()

6 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Out[11]: min_ps b_max_ps f_std_dev_piat f_avg_piat b_flowDuration flowDuration b_pktTotalCou

count 2.704839e+06 2.704839e+06 2.704839e+06 2.704839e+06 2.704839e+06 2.704839e+06

mean 5.822987e+01 1.142665e+03 5.803303e+00 7.073287e+00 7.078825e+11 5.361398e+01

std 6.023631e+01 2.913506e+03 2.440810e+01 5.183097e+01 7.746749e+11 1.821647e+02

min 2.800000e+01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

25% 4.000000e+01 5.200000e+01 0.000000e+00 0.000000e+00 0.000000e+00 6.389618e-04

50% 5.200000e+01 1.360000e+02 0.000000e+00 1.208380e-02 0.000000e+00 1.756411e-01

75% 6.500000e+01 1.378000e+03 1.501036e+00 1.561180e+00 1.554586e+12 1.112728e+01

max 1.162400e+04 2.632000e+04 8.624964e+02 1.780822e+03 1.558212e+12 1.800202e+03

8 rows × 44 columns

In [12]: [col for col in num_cols if data[col].isnull().any()]

Out[12]: []

In [13]: print("range and no. of unique values in numeric columns")


for col in num_cols:
print(f'{col}\tRange : {max(data[col]) - min(data[col])}, No. of unique values :

7 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

range and no. of unique values in numeric columns


min_ps Range : 11596, No. of unique values : 711
b_max_ps Range : 26320, No. of unique values : 14233
f_std_dev_piat Range : 862.496393918991, No. of unique values : 1301011
f_avg_piat Range : 1780.82152986526, No. of unique values : 1344234
b_flowDuration Range : 1558211564721.97, No. of unique values : 1189909
flowDuration Range : 1800.20165610313, No. of unique values : 1430642
b_pktTotalCount Range : 1017780, No. of unique values : 7015
f_flowEnd Range : 1558215380814.8389, No. of unique values : 2397265
b_flowEnd Range : 1559771334.33116, No. of unique values : 2282692
f_flowStart Range : 3817819.2345199585, No. of unique values : 2645081
b_avg_piat Range : 1780.82148790359, No. of unique values : 1179890
f_flowDuration Range : 1800.20165610313, No. of unique values : 1288078
b_max_piat Range : 1780.82148790359, No. of unique values : 967839
octetTotalCount Range : 2981111667, No. of unique values : 154581
b_min_piat Range : 1780.82148790359, No. of unique values : 155794
avg_ps Range : 11596.0, No. of unique values : 410222
src_ip_numeric Range : 2044, No. of unique values : 716
b_std_dev_piat Range : 839.900081515312, No. of unique values : 1148891
flowStart Range : 3817819.2345199585, No. of unique values : 2645081
f_octetTotalCount Range : 2955382240, No. of unique values : 56947
pktTotalCount Range : 2292424, No. of unique values : 8984
b_avg_ps Range : 15836.0, No. of unique values : 309031
src_port Range : 65535, No. of unique values : 61314
b_min_ps Range : 15836, No. of unique values : 841
f_avg_ps Range : 11596.0, No. of unique values : 210778
b_std_dev_ps Range : 11680.0, No. of unique values : 655659
max_piat Range : 1780.82109594345, No. of unique values : 1268402
min_piat Range : 1763.94893193245, No. of unique values : 310582
std_dev_ps Range : 9370.13341149918, No. of unique values : 833730
flowEnd Range : 3817819.12740016, No. of unique values : 2621434
std_dev_piat Range : 865.191153526306, No. of unique values : 1394435
f_max_piat Range : 1780.82152986526, No. of unique values : 1174066
f_min_ps Range : 11596, No. of unique values : 1034
f_max_ps Range : 26292, No. of unique values : 6980
avg_piat Range : 1763.94893193245, No. of unique values : 1554611
f_std_dev_ps Range : 7908.10729941073, No. of unique values : 634260
flowEndReason Range : 3, No. of unique values : 4
dst_port Range : 65535, No. of unique values : 33753
f_pktTotalCount Range : 2156204, No. of unique values : 5623
proto Range : 16, No. of unique values : 3
max_ps Range : 26292, No. of unique values : 14548
b_flowStart Range : 1559771334.33015, No. of unique values : 2288361
b_octetTotalCount Range : 2971893160, No. of unique values : 141155
f_min_piat Range : 1780.82152986526, No. of unique values : 243202

For the columns having <=50 unique values, we plot


histograms, for others we just list distribution of most frequent
values as in case of category columns
In [14]: cols_for_hist = [col for col in num_cols if data[col].nunique() <= 50]
cols_for_hist, len(cols_for_hist)

Out[14]: (['flowEndReason', 'proto'], 2)

8 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [15]: cols_for_desc = [col for col in num_cols if data[col].nunique() > 50]


cols_for_desc

Out[15]: ['min_ps',
'b_max_ps',
'f_std_dev_piat',
'f_avg_piat',
'b_flowDuration',
'flowDuration',
'b_pktTotalCount',
'f_flowEnd',
'b_flowEnd',
'f_flowStart',
'b_avg_piat',
'f_flowDuration',
'b_max_piat',
'octetTotalCount',
'b_min_piat',
'avg_ps',
'src_ip_numeric',
'b_std_dev_piat',
'flowStart',
'f_octetTotalCount',
'pktTotalCount',
'b_avg_ps',
'src_port',
'b_min_ps',
'f_avg_ps',
'b_std_dev_ps',
'max_piat',
'min_piat',
'std_dev_ps',
'flowEnd',
'std_dev_piat',
'f_max_piat',
'f_min_ps',
'f_max_ps',
'avg_piat',
'f_std_dev_ps',
'dst_port',
'f_pktTotalCount',
'max_ps',
'b_flowStart',
'b_octetTotalCount',
'f_min_piat']

In [16]: data[cols_for_hist].hist(layout = (7,3), figsize = (12, 20))


plt.tight_layout()

9 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Correlation Matrix
In [17]: corr = data[num_cols].corr()

In [18]: f = plt.figure(figsize = (25,25))


plt.matshow(corr, fignum=f.number)
plt.title('Correlation Matrix of Numeric columns in the dataset', fontsize = 20)
plt.xticks(range(len(num_cols)), num_cols, fontsize = 14, rotation = 90)
plt.yticks(range(len(num_cols)), num_cols, fontsize = 14)
plt.gca().xaxis.set_ticks_position('bottom')
cb = plt.colorbar(fraction = 0.0466, pad = 0.02)
cb.ax.tick_params(labelsize=10)
plt.show()

10 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Prepocessing
In [19]: #check null
data.isnull().sum()

11 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Out[19]: flow_key 0
src_ip_numeric 0
src_ip 0
src_port 0
dst_ip 0
dst_port 0
proto 0
pktTotalCount 0
octetTotalCount 0
min_ps 0
max_ps 0
avg_ps 0
std_dev_ps 0
flowStart 0
flowEnd 0
flowDuration 0
min_piat 0
max_piat 0
avg_piat 0
std_dev_piat 0
f_pktTotalCount 0
f_octetTotalCount 0
f_min_ps 0
f_max_ps 0
f_avg_ps 0
f_std_dev_ps 0
f_flowStart 0
f_flowEnd 0
f_flowDuration 0
f_min_piat 0
f_max_piat 0
f_avg_piat 0
f_std_dev_piat 0
b_pktTotalCount 0
b_octetTotalCount 0
b_min_ps 0
b_max_ps 0
b_avg_ps 0
b_std_dev_ps 0
b_flowStart 0
b_flowEnd 0
b_flowDuration 0
b_min_piat 0
b_max_piat 0
b_avg_piat 0
b_std_dev_piat 0
flowEndReason 0
category 0
application_protocol 0
web_service 0
dtype: int64

In [20]: #check duplicate


dups = data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 10

12 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [21]: #remove duplicate


print('Number of rows before discarding duplicates = %d' % (data.shape[0]))
data = data.drop_duplicates()
print('Number of rows after discarding duplicates = %d' % (data.shape[0]))

Number of rows before discarding duplicates = 2704839


Number of rows after discarding duplicates = 2704829

Feature Selection

Based on the unique column removed

In [22]: ipdata = data.copy()

In [23]: ipdata.drop(['flow_key','src_ip_numeric','src_ip','dst_ip','category','application_protocol'

In [24]: single_unique_cols = [col for col in ipdata.columns if ipdata[col].nunique() == 1]


single_unique_cols

Out[24]: []

Final Feature
In [25]: df = ipdata.copy()

In [26]: df.head()

Out[26]: src_port dst_port proto pktTotalCount octetTotalCount min_ps max_ps avg_ps std_dev_ps

0 67 67 17 22 7620 328 394 346.363636 25.010081

1 67 67 17 17 5670 328 354 333.529412 9.140200

2 67 67 17 43 15124 328 394 351.720930 26.098495

3 67 67 17 30 10086 328 352 336.200000 10.057833

4 0 0 1 1 56 56 56 56.000000 0.000000

5 rows × 43 columns

In [27]: df.shape

Out[27]: (2704829, 43)

Classification Label Web Service DT, Naive


Bayes, KNN,MLP,RF

13 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [28]: #data train, dan test


X = ipdata
Y = data['web_service']

In [29]: #splitting
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state

In [30]: X_train.shape

Out[30]: (1352414, 43)

In [31]: X_test.shape

Out[31]: (1352415, 43)

In [32]: #normalisasi
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [33]: X_train_scaled

Out[33]: array([[5.92904555e-01, 6.75974670e-03, 3.12500000e-01, ...,


0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[7.65606165e-01, 6.75974670e-03, 3.12500000e-01, ...,
1.65325647e-04, 7.98713134e-04, 3.33333333e-01],
[8.28641184e-01, 1.22072175e-03, 3.12500000e-01, ...,
6.71648851e-04, 1.05979673e-04, 0.00000000e+00],
...,
[8.41199359e-01, 8.08728161e-04, 1.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[8.24292363e-01, 9.85503929e-01, 1.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[6.58701457e-01, 6.75974670e-03, 3.12500000e-01, ...,
2.75905144e-06, 1.03205274e-05, 0.00000000e+00]])

In [34]: X_test_scaled

Out[34]: array([[7.78744182e-01, 6.75984985e-03, 3.12500000e-01, ...,


8.03836539e-04, 3.88515652e-03, 3.33333333e-01],
[8.35736629e-01, 1.22074038e-03, 3.12500000e-01, ...,
7.80053657e-05, 1.38839226e-04, 1.00000000e+00],
[9.75173571e-01, 6.75984985e-03, 3.12500000e-01, ...,
2.85750517e-04, 1.62385749e-03, 1.00000000e+00],
...,
[7.09758145e-01, 8.08740501e-04, 1.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[7.61501488e-01, 1.22074038e-03, 3.12500000e-01, ...,
1.27463829e-03, 2.44122729e-03, 6.66666667e-01],
[7.99252308e-01, 1.35624256e-01, 1.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

14 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [35]: from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
Y_train_encode = label_encoder.fit_transform(y_train)

In [36]: label_encoder2 = LabelEncoder()


Y_test_encode = label_encoder2.fit_transform(y_test)

In [37]: Y_train_encode

Out[37]: array([ 25, 1, 3, ..., 130, 90, 122])

In [38]: Y_test_encode

Out[38]: array([105, 31, 105, ..., 33, 39, 45])

In [39]: label_encoder.classes_

Out[39]: array(['AJP', 'Amazon', 'AmazonVideo', 'Apple', 'ApplePush', 'AppleStore',


'AppleiCloud', 'AppleiTunes', 'BJNP', 'BitTorrent', 'CNN',
'CiscoSkinny', 'CiscoVPN', 'Citrix', 'Cloudflare', 'DHCP', 'DNP3',
'DNS', 'DNSoverHTTPS', 'DataSaver', 'Deezer',
'Direct_Download_Link', 'Dropbox', 'FTP_CONTROL', 'FTP_DATA',
'Facebook', 'GMail', 'Github', 'Google', 'GoogleDocs',
'GoogleDrive', 'GoogleHangoutDuo', 'GoogleMaps', 'GooglePlus',
'GoogleServices', 'H323', 'HTTP', 'HTTP_Proxy', 'HotspotShield',
'IAX', 'ICMP', 'IMAPS', 'IMO', 'IPsec', 'IRC', 'Instagram', 'LDAP',
'LinkedIn', 'LotusNotes', 'MQTT', 'MSN', 'MS_OneDrive',
'Messenger', 'Microsoft', 'Mining', 'MsSQL-TDS', 'NFS', 'NTP',
'NestLogSink', 'NetBIOS', 'NetFlix', 'Office365', 'Ookla',
'OpenDNS', 'OpenVPN', 'Oracle', 'POP3', 'PS_VUE',
'Pando_Media_Booster', 'PlayStore', 'Playstation', 'PostgreSQL',
'QQ', 'QUIC', 'RDP', 'RTMP', 'RTP', 'RTSP', 'RX', 'Radius', 'SAP',
'SIP', 'SMBv1', 'SMBv23', 'SMTP', 'SNMP', 'SOCKS', 'SOMEIP',
'SSDP', 'SSH', 'STUN', 'Signal', 'Sina(Weibo)', 'Skype',
'SkypeCall', 'Slack', 'Snapchat', 'SoundCloud', 'Spotify',
'Starcraft', 'Steam', 'Syslog', 'TLS', 'Targus Dataspeed',
'TeamViewer', 'Telegram', 'Teredo', 'TikTok', 'Tor', 'Tuenti',
'Twitch', 'Twitter', 'UBNTAC2', 'UPnP', 'UbuntuONE',
'Unencrypted_Jabber', 'Unknown', 'VNC', 'Viber', 'Waze', 'WeChat',
'Webex', 'WhatsApp', 'WhatsAppCall', 'WhatsAppFiles', 'Whois-DAS',
'Wikipedia', 'WindowsUpdate', 'Xbox', 'Yahoo', 'YouTube', 'Zoom',
'eBay', 'eDonkey', 'sFlow'], dtype=object)

In [40]: label_encoder2.classes_

15 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Out[40]: array(['104', 'AJP', 'Amazon', 'AmazonVideo', 'Apple', 'ApplePush',


'AppleStore', 'AppleiCloud', 'AppleiTunes', 'BGP', 'BJNP',
'BitTorrent', 'CNN', 'CiscoSkinny', 'CiscoVPN', 'Citrix',
'Cloudflare', 'DHCP', 'DNP3', 'DNS', 'DNSoverHTTPS', 'DataSaver',
'Deezer', 'Direct_Download_Link', 'Dropbox', 'FTP_CONTROL',
'FTP_DATA', 'Facebook', 'GMail', 'GTP', 'Github', 'Google',
'GoogleDocs', 'GoogleDrive', 'GoogleHangoutDuo', 'GoogleMaps',
'GooglePlus', 'GoogleServices', 'H323', 'HTTP', 'HTTP_Proxy',
'HotspotShield', 'IAX', 'ICMP', 'IMAPS', 'IMO', 'IPsec', 'IRC',
'Instagram', 'LDAP', 'LinkedIn', 'LotusNotes', 'MDNS', 'MQTT',
'MSN', 'MS_OneDrive', 'Messenger', 'Microsoft', 'Mining',
'MsSQL-TDS', 'MySQL', 'NFS', 'NTP', 'NestLogSink', 'NetBIOS',
'NetFlix', 'Office365', 'Ookla', 'OpenDNS', 'OpenVPN', 'Oracle',
'PS_VUE', 'Pando_Media_Booster', 'PlayStore', 'Playstation',
'PostgreSQL', 'QQ', 'QUIC', 'RDP', 'RTMP', 'RTP', 'RTSP', 'RX',
'Radius', 'SIP', 'SMBv1', 'SMBv23', 'SMTP', 'SMTPS', 'SNMP',
'SOCKS', 'SSDP', 'SSH', 'STUN', 'Signal', 'Sina(Weibo)', 'Skype',
'SkypeCall', 'Slack', 'Snapchat', 'SoundCloud', 'Spotify',
'Starcraft', 'Steam', 'Syslog', 'TLS', 'Targus Dataspeed',
'TeamViewer', 'Telegram', 'Teredo', 'TikTok', 'Tor', 'Tuenti',
'Twitch', 'Twitter', 'UBNTAC2', 'UbuntuONE', 'Unencrypted_Jabber',
'Unknown', 'VNC', 'Viber', 'Waze', 'WeChat', 'Webex', 'WhatsApp',
'WhatsAppCall', 'WhatsAppFiles', 'Whois-DAS', 'Wikipedia',
'WindowsUpdate', 'Xbox', 'Yahoo', 'YouTube', 'eBay', 'sFlow'],
dtype=object)

Desicion Tree Model


In [41]: from sklearn import tree
clf_gini = tree.DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state

# fit the model


clf_gini.fit(X_train_scaled, Y_train_encode)
clf_gini.score(X_train_scaled,Y_train_encode)

Out[41]: 0.716117993454667

In [42]: y_pred_gini = clf_gini.predict(X_test_scaled )

In [43]: from sklearn.metrics import accuracy_score

tree_train_accuracy = clf_gini.score(X_train_scaled,Y_train_encode)
tree_accuracy = clf_gini.score(X_test_scaled,Y_test_encode)

print("Training score: {:.3f}".format(clf_gini.score(X_train_scaled, Y_train_encode


print("Test score: {:.3f}".format(clf_gini.score(X_test_scaled, Y_test_encode)))

Training score: 0.716


Test score: 0.002

16 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [44]: # Lets split the data into 5 folds.


# We will use this 'kf'(StratiFiedKFold splitting stratergy) object as input to cross_val_sco
# The folds are made by preserving the percentage of samples for each class.
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X_train_scaled, Y_train_encode):
print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
cnt+=1

# Note that:
# cross_val_score() parameter 'cv' will by default use StratifiedKFold spliting startergy if
# So you can bypass above step and just specify cv= 5 in cross_val_score() function

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.py:684:
UserWarning: The least populated class in y has only 1 members, which is less than
n_splits=5.
warnings.warn(
Fold:1, Train set: 1081931, Test set:270483
Fold:2, Train set: 1081931, Test set:270483
Fold:3, Train set: 1081931, Test set:270483
Fold:4, Train set: 1081931, Test set:270483
Fold:5, Train set: 1081932, Test set:270482

In [45]: from sklearn.model_selection import cross_val_score


score = cross_val_score(tree.DecisionTreeClassifier(criterion='entropy', max_depth=
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.py:684:
UserWarning: The least populated class in y has only 1 members, which is less than
n_splits=5.
warnings.warn(
Scores for each fold are: [0.82997822 0.83020375 0.829627 0.83093947 0.82995172]
Average score: 0.83

In [46]: # Lets split the data into 5 folds.


# We will use this 'kf'(StratiFiedKFold splitting stratergy) object as input to cross_val_sco
# The folds are made by preserving the percentage of samples for each class.
from sklearn.model_selection import StratifiedKFold
kf2 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf2.split(X_test_scaled, Y_test_encode):
print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
cnt+=1

# Note that:
# cross_val_score() parameter 'cv' will by default use StratifiedKFold spliting startergy if
# So you can bypass above step and just specify cv= 5 in cross_val_score() function

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.py:684:
UserWarning: The least populated class in y has only 1 members, which is less than
n_splits=5.
warnings.warn(

17 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Fold:1, Train set: 1081932, Test set:270483


Fold:2, Train set: 1081932, Test set:270483
Fold:3, Train set: 1081932, Test set:270483
Fold:4, Train set: 1081932, Test set:270483
Fold:5, Train set: 1081932, Test set:270483

In [47]: from sklearn.model_selection import cross_val_score


score = cross_val_score(tree.DecisionTreeClassifier(criterion='entropy', max_depth=
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\model_selection\_split.py:684:
UserWarning: The least populated class in y has only 1 members, which is less than
n_splits=5.
warnings.warn(
Scores for each fold are: [0.82922402 0.82960852 0.82986361 0.83021484 0.82977488]
Average score: 0.83

Naive Bayes
In [48]: # train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import GaussianNB
# instantiate the model
gnb = GaussianNB()
# fit the model
gnb.fit(X_train_scaled, Y_train_encode)

Out[48]: ▾ GaussianNB

GaussianNB()

In [49]: print("Training accuracy = ",gnb.score(X_train_scaled,Y_train_encode))


#Print Test Accuracy
gnb_accuracy = gnb.score(X_test_scaled,Y_test_encode)
print("Testing accuracy = ",gnb.score(X_test_scaled,Y_test_encode))

Training accuracy = 0.04063696471642559


Testing accuracy = 0.002592399522335969

KNN Model
In [50]: #Model Classification KNN using n_neighbors = 3
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_scaled, Y_train_encode)
# store the predicted response values

Out[50]: ▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)

In [51]: print("Training score: {:.3f}".format(neigh.score(X_train_scaled, Y_train_encode)))


print("Test score: {:.3f}".format(neigh.score(X_test_scaled, Y_test_encode)))

18 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

Training score: 0.798


Test score: 0.002

Multi Layer Perceptron


In [52]: from sklearn.neural_network import MLPClassifier

In [53]: mlp = MLPClassifier(hidden_layer_sizes=(3,2),activation='relu')


mlp

Out[53]: ▾ MLPClassifier
MLPClassifier(hidden_layer_sizes=(3, 2))

In [54]: mlp.fit(X_train_scaled,Y_train_encode)

Out[54]: ▾ MLPClassifier
MLPClassifier(hidden_layer_sizes=(3, 2))

In [55]: print("Training accuracy = ",mlp.score(X_train_scaled,Y_train_encode))


#Print Test Accuracy
print("Testing accuracy = ",mlp.score(X_test_scaled,Y_test_encode))

Training accuracy = 0.21055756595243763


Testing accuracy = 0.007640406236251446

Random Forest
In [56]: from sklearn.ensemble import RandomForestClassifier
#Menggunakan ensamble algorithm Random Forest Classifier dengan libSklearn
modelRF = RandomForestClassifier(n_estimators=1)

In [57]: modelRF.fit(X_train_scaled,Y_train_encode)

Out[57]: ▾ RandomForestClassifier
RandomForestClassifier(n_estimators=1)

In [58]: print("Training accuracy = ",modelRF.score(X_train_scaled,Y_train_encode))


#Print Test Accuracy
print("Testing accuracy = ",modelRF.score(X_test_scaled,Y_test_encode))

Training accuracy = 0.9069345629370887


Testing accuracy = 0.0038013479590214543

Evaluation With DT

19 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

In [59]: y_pred = clf_gini.predict(X_train_scaled )


from sklearn import metrics

tree_cm = metrics.confusion_matrix(Y_train_encode, y_pred)


plt.figure(figsize=(10,10))
sns.heatmap(tree_cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r'
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Confusion Matrix - score:'+str(metrics.accuracy_score(Y_train_encode
plt.title(all_sample_title, size = 15);
plt.show()
print(metrics.classification_report(Y_train_encode,y_pred))

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:133
4: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to
0.0 in labels with no predicted samples. Use `zero_division` parameter to control t
his behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\ditama\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:133
4: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to
0.0 in labels with no predicted samples. Use `zero_division` parameter to control t
his behavior.
_warn_prf(average, modifier, msg_start, len(result))

20 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

precision recall f1-score support

0 0.00 0.00 0.00 20


1 0.55 0.15 0.24 43650
2 0.00 0.00 0.00 2505
3 0.53 0.10 0.16 4150
4 1.00 0.72 0.83 2453
5 0.00 0.00 0.00 146
6 0.00 0.00 0.00 1523
7 0.76 0.06 0.12 347
8 0.96 1.00 0.98 275
9 0.97 0.98 0.97 35599
10 0.00 0.00 0.00 23
11 0.00 0.00 0.00 14
12 0.90 0.70 0.79 219
13 0.00 0.00 0.00 11
14 0.39 0.03 0.06 8491
15 1.00 0.95 0.97 1028
16 0.00 0.00 0.00 5
17 0.73 0.86 0.79 209794
18 0.00 0.00 0.00 32
19 0.00 0.00 0.00 2633
20 0.00 0.00 0.00 9
21 0.00 0.00 0.00 4
22 0.37 0.07 0.11 9656
23 0.86 1.00 0.92 6
24 0.00 0.00 0.00 21
25 0.71 0.71 0.71 60588
26 0.91 0.43 0.59 10395
27 0.00 0.00 0.00 3473
28 0.70 0.76 0.73 284761
29 0.60 0.00 0.00 2591
30 0.96 0.12 0.21 3857
31 0.86 0.89 0.88 3320
32 0.00 0.00 0.00 174
33 0.00 0.00 0.00 196
34 0.54 0.33 0.41 43527
35 0.00 0.00 0.00 3
36 0.72 0.94 0.82 76281
37 0.98 0.90 0.94 16882
38 0.00 0.00 0.00 5
39 1.00 0.88 0.93 8
40 1.00 1.00 1.00 3748
41 0.93 0.77 0.84 576
42 1.00 1.00 1.00 11444
43 1.00 0.56 0.72 39
44 0.00 0.00 0.00 8
45 0.54 0.19 0.28 7843
46 1.00 0.99 1.00 2479
47 0.00 0.00 0.00 1412
48 0.00 0.00 0.00 2
49 1.00 0.58 0.74 215
50 0.90 0.21 0.35 7134
51 0.00 0.00 0.00 1699
52 0.93 0.90 0.91 22510
53 0.69 0.61 0.65 86418
54 1.00 0.96 0.98 83
55 1.00 0.75 0.85 284
56 0.00 0.00 0.00 2

21 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

57 1.00 1.00 1.00 3065


58 0.00 0.00 0.00 2
59 1.00 0.99 1.00 11917
60 0.00 0.00 0.00 1026
61 0.72 0.12 0.20 12193
62 0.00 0.00 0.00 209
63 0.00 0.00 0.00 20
64 1.00 0.50 0.67 2
65 0.80 0.88 0.84 93
66 0.00 0.00 0.00 1
67 0.00 0.00 0.00 91
68 0.53 0.89 0.66 76
69 0.83 0.16 0.26 2501
70 1.00 0.01 0.01 133
71 0.33 0.50 0.40 2
72 0.00 0.00 0.00 530
73 0.49 0.11 0.18 193
74 0.86 0.89 0.88 1004
75 1.00 0.10 0.18 10
76 1.00 0.46 0.63 13
77 1.00 0.50 0.67 2
78 0.98 1.00 0.99 869
79 0.99 0.98 0.99 112
80 0.00 0.00 0.00 2
81 0.97 0.25 0.40 110
82 0.67 0.86 0.75 7
83 0.98 0.99 0.98 305
84 0.82 0.99 0.89 413
85 1.00 1.00 1.00 6777
86 1.00 0.20 0.33 5
87 0.00 0.00 0.00 1
88 1.00 0.67 0.80 18
89 0.83 0.22 0.35 157
90 0.69 0.95 0.80 2173
91 0.69 0.11 0.19 2255
92 0.00 0.00 0.00 3
93 0.00 0.00 0.00 7908
94 0.62 0.27 0.37 179
95 0.00 0.00 0.00 711
96 0.00 0.00 0.00 50
97 0.00 0.00 0.00 58
98 0.51 0.02 0.04 1687
99 0.77 0.77 0.77 343
100 0.99 0.46 0.63 391
101 1.00 0.99 1.00 306
102 0.47 0.79 0.59 130260
103 0.00 0.00 0.00 98
104 0.94 0.89 0.91 519
105 0.60 0.01 0.01 423
106 0.98 1.00 0.99 562
107 0.00 0.00 0.00 275
108 0.00 0.00 0.00 10
109 0.00 0.00 0.00 3
110 0.00 0.00 0.00 251
111 0.43 0.08 0.13 5397
112 0.00 0.00 0.00 7
113 0.00 0.00 0.00 1
114 0.81 0.84 0.82 7598
115 0.53 0.86 0.66 183

22 dari 23 29/11/2022 12.16


KBJ_before_feature_selection http://localhost:8888/nbconvert/html/KBJ_before_feature_selection.ipy...

116 0.98 0.97 0.98 124411


117 0.00 0.00 0.00 22
118 0.00 0.00 0.00 5
119 0.00 0.00 0.00 15
120 0.76 0.35 0.48 46
121 0.00 0.00 0.00 8
122 0.56 0.39 0.46 11961
123 0.80 0.65 0.72 507
124 0.00 0.00 0.00 3
125 0.00 0.00 0.00 1
126 0.56 0.01 0.01 835
127 0.78 0.28 0.42 5618
128 0.00 0.00 0.00 561
129 0.00 0.00 0.00 2239
130 0.69 0.39 0.50 27954
131 0.00 0.00 0.00 1
132 0.00 0.00 0.00 122
133 0.00 0.00 0.00 1
134 1.00 0.61 0.76 23

accuracy 0.72 1352414


macro avg 0.48 0.35 0.37 1352414
weighted avg 0.71 0.72 0.69 1352414

C:\Users\ditama\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:133
4: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to
0.0 in labels with no predicted samples. Use `zero_division` parameter to control t
his behavior.
_warn_prf(average, modifier, msg_start, len(result))

In [60]: y = label_encoder.inverse_transform([28,17,102,116])
y

Out[60]: array(['Google', 'DNS', 'TLS', 'Unknown'], dtype=object)

23 dari 23 29/11/2022 12.16

You might also like