0% found this document useful (0 votes)

107 views10 pages

2IA02 Fauzan Ramadhan

The document summarizes the steps taken to prepare a dataset for machine learning modeling. It includes: 1) Importing and reviewing the adult census dataset, which contains demographic and financial information. 2) Dropping rows with missing values and encoding categorical variables as numeric using OrdinalEncoder. 3) Separating numeric and categorical features, target variable, and assigning IDs. 4) Normalizing numeric features using StandardScaler for preprocessing. The dataset is prepared for building machine learning models by handling missing data, encoding variables, and normalizing features.

Uploaded by

2IA02Faris Hidayat Arrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views10 pages

2IA02 Fauzan Ramadhan

Uploaded by

2IA02Faris Hidayat Arrahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Fauzan Ramadhan

2IA02
50421497

Import Dataset

Mengimport library Pandas dan NumPy, serta dataset yang berisi data informasi demografi dan keuangan

# Import library
import pandas as pd
import numpy as np

# import dataset
data = pd.read_csv('https://gitlab.com/andreass.bayu/file-directory/-/raw/main/adult.csv')

# Lihat 5 data awal
data.head(5)

educational- marital- c
age workclass fnlwgt education occupation relationship race gender
num status

Never- Machine-
0 25 Private 226802 11th 7 Own-child Black Male
married op-inspct

Married-
Farming-
1 38 Private 89814 HS-grad 9 civ- Husband White Male
fishing
spouse

Married-
Assoc- Protective-
2 28 Local-gov 336951 12 civ- Husband White Male
acdm serv
spouse

Married-
Some- Machine-
3 44 Private 160323 10 civ- Husband Black Male
college op-inspct
spouse

Review Dataset

# Lihat deskripsi dari tiap kolom
data.describe()

age fnlwgt educational-num capital-gain capital-loss hours-per-week

count 48842.000000 4.884200e+04 48842.000000 48842.000000 48842.000000 48842.000000

mean 38.643585 1.896641e+05 10.078089 1079.067626 87.502314 40.422382

std 13.710510 1.056040e+05 2.570973 7452.019058 403.004552 12.391444

min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000

25% 28.000000 1.175505e+05 9.000000 0.000000 0.000000 40.000000

50% 37.000000 1.781445e+05 10.000000 0.000000 0.000000 40.000000

75% 48.000000 2.376420e+05 12.000000 0.000000 0.000000 45.000000

max 90.000000 1.490400e+06 16.000000 99999.000000 4356.000000 99.000000

# Lihat tipe data dari tiap kolom
data.dtypes

age int64
workclass object
fnlwgt int64
education object
educational-num int64
marital-status object
occupation object
relationship object
race object
gender object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object
income object
dtype: object

# mlihat jumlah atribut dan data / dimensi data
data.shape

(48842, 15)

# hitung dan melihat jumlah data per label kelas
for col in data.columns:
  if data[col].dtype == "object":
    print('Attribute name:',col)
    print('-------------------')
    print(data[col].value_counts())
    print('-------------------')

Attribute name: workclass

-------------------
Private 33906
Self-emp-not-inc 3862
Local-gov 3136
? 2799
State-gov 1981
Self-emp-inc 1695
Federal-gov 1432
Without-pay 21
Never-worked 10
Name: workclass, dtype: int64
-------------------
Attribute name: education
-------------------
HS-grad 15784
Some-college 10878
Bachelors 8025
Masters 2657
Assoc-voc 2061
11th 1812
Assoc-acdm 1601
10th 1389
7th-8th 955
Prof-school 834
9th 756
12th 657
Doctorate 594
5th-6th 509
1st-4th 247
Preschool 83
Name: education, dtype: int64
-------------------
Attribute name: marital-status
-------------------
Married-civ-spouse 22379
Never-married 16117
Divorced 6633
Separated 1530
Widowed 1518
Married-spouse-absent 628
Married-AF-spouse 37
Name: marital-status, dtype: int64
-------------------
Attribute name: occupation
-------------------
Prof-specialty 6172
Craft-repair 6112
Exec-managerial 6086
Adm-clerical 5611
Sales 5504
Other-service 4923
Machine-op-inspct 3022
? 2809
Transport-moving 2355
Handlers-cleaners 2072
Farming-fishing 1490
Tech-support 1446

#import library seaborn untuk visualisasi
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Plot figure untuk menentukan distribusi kelas
plt.figure(figsize=(8,5))

# menghitung baris setiap kelas
sns.countplot(x="income", data=data)

<Axes: xlabel='income', ylabel='count'>

Dataset memiliki distribusi kelas yang tidak seimbang/imbalanced sehingga secara teknis akan digunakan teknik untuk
menangani data yang tidak seimbang.

Dataset preparation

# Buat salinan dataframe
df = data.copy(deep = True)

# mengubah/convert nilai "?" nilai ke bentuk Na / NaN untuk diproses lebih lanjut
for col in data.columns:
df[[col]] = data[[col]].replace('?',np.NaN)

# seleksi kolom fitur/feature columns dari dataset
null_data = df.iloc[:,:-1]
# temukan nilai null untuk semua atribut dan jumlahkan total nilai null
null_data.isnull().sum()

age 0
workclass 2799
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2809
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 857
dtype: int64

# jatuhkan/drop semua baris yang memiliki nilai null
df = df.dropna()

# pilih kolom fitur/feature columns dari dataset
null_data = df.iloc[:,:-1]

# cek ulang nilai null
null_data.isnull().sum()

age 0
workclass 0
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 0
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
dtype: int64

StandardScaler adalah class dari sklearn untuk melakukan normalisasi data agar data yang digunakan tidak memiliki penyimpangan yang besar.

# Import library standard scaler

from sklearn.preprocessing import StandardScaler

# Buat dataframe dengan tipe data int64
colname= []
for col in df.columns:
if df[col].dtype == "int64":
colname.append(col)

# Buat salinan dataset untuk keperluan persiapan data / data preparation
df_copy = df.copy(deep = True)
df_fe = df.copy()

# Buat kerangka data untuk fitur kategoris / categorical features
df_fe.drop('income',axis='columns', inplace=True)
df_fe.drop(colname,axis='columns', inplace=True)

# buat dataframe untuk kelas target / target class
df_cl = df.copy()
df_cl.drop(df_copy.iloc[:,:-1],axis='columns', inplace=True)

# membuat objek scaler / scaler object
std_scaler = StandardScaler()
std_scaler

# Normalisasikan atribut numerik dan tetapkan ke dalam dataframe baru
df_norm = pd.DataFrame(std_scaler.fit_transform(df_copy[colname]), columns=colname)

# import library Ordinal Encoder dari package library sklearn.preprocessing
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()

# enconde fitur kategoris/categorical features menjadi fitur numerik/numerical features
for col in df_fe.columns[:]:
  if df_fe[col].dtype == "object":
    df_fe[col] = ord_enc.fit_transform(df_fe[[col]])

# enconde label kategorikal/categorical label menjadi label biner/binary label
df_cl["income"] = np.where(df_cl["income"].str.contains(">50K"), 0, 1)

# Masukkan kolom id ke datasets yang berbeda
df_norm.insert(0, 'id', range(0, 0 + len(df_norm)))
df_fe.insert(0, 'id', range(0, 0 + len(df_fe)))
df_cl.insert(0, 'id', range(0, 0 + len(df_cl)))

# Lihat shapes datasets yang telah di proses
print(df_norm.shape)
print(df_fe.shape)
print(df_cl.shape)
(45222, 7)
(45222, 9)
(45222, 2)

# Gabungkan semua datasets
df_feature = pd.merge(df_norm,df_fe, on=["id"])
df_final = pd.merge(df_feature,df_cl, on=["id"])

# Drop kolom id dari gabungan dataset
df_final.drop('id',axis='columns', inplace=True)

# Lihat 5 data awal dari gabungan dataset
df_final.head(5)

educational- capital- capital- hours- marital-

age fnlwgt workclass education occu
num gain loss per-week status

0 -1.024983 0.350889 -1.221559 -0.146733 -0.21878 -0.078120 2.0 1.0 4.0

1 -0.041455 -0.945878 -0.438122 -0.146733 -0.21878 0.754701 2.0 11.0 2.0

2 -0.798015 1.393592 0.737034 -0.146733 -0.21878 -0.078120 1.0 7.0 2.0

3 0.412481 -0.278420 -0.046403 0.877467 -0.21878 -0.078120 2.0 15.0 2.0

4 -0.344079 0.084802 -1.613277 -0.146733 -0.21878 -0.910942 2.0 0.0 4.0

Visualization

p = df_final.hist(figsize = (20,20))
Scatter matrix plot adalah plot yang digunakan untuk membuat sekumpulan scatter plot dari beberapa pasang variabel. Hal ini sangat
bermanfaat terutama ketika ingin menganalisis bagaimana bentuk hubungan antar variabel. Plot ini sangat bermanfaat untuk digunakan untuk
data yang ukurannya tidak terlalu besar. Untuk menggunakan scatter matrix kita harus memanggil fungsi scatter_matrix dari pandas.plotting

from pandas.plotting import scatter_matrix

p=scatter_matrix(df_final,figsize=(25, 25))

# Buat visualisasi korelasi data dengan heatmap
import seaborn as sns
import matplotlib.pyplot as plt

# plot heatmap
plt.figure(figsize=(12,10))
p=sns.heatmap(df_final.corr(), annot=True,cmap ='RdYlGn')
Proses Modelling dengan KNN

import numpy as np
from sklearn.model_selection import KFold

X=df_final.iloc[:,:].to_numpy()
y=df_final.iloc[:,-1:].to_numpy()
kf = KFold(n_splits=5)
kf.get_n_splits(X)

print(kf)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    x_train, x_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

KFold(n_splits=5, random_state=None, shuffle=False)

TRAIN: [ 9045 9046 9047 ... 45219 45220 45221] TEST: [ 0 1 2 ... 9042 9043 9044]
TRAIN: [ 0 1 2 ... 45219 45220 45221] TEST: [ 9045 9046 9047 ... 18087 18088 18089]
TRAIN: [ 0 1 2 ... 45219 45220 45221] TEST: [18090 18091 18092 ... 27131 27132 27133]
TRAIN: [ 0 1 2 ... 45219 45220 45221] TEST: [27134 27135 27136 ... 36175 36176 36177]
TRAIN: [ 0 1 2 ... 36175 36176 36177] TEST: [36178 36179 36180 ... 45219 45220 45221]

print('-------- x axis test ----------')
print(x_test)
print('-------- x axis train ---------')
print(x_train)
print('-------- y axis test ----------')
print(y_test)
print('-------- y axis train ---------')
print(y_train)
print('*******************************')

-------- x axis test ----------

[[ 0.63944887 1.45787715 -0.43812161 ... 1. 38.
1. ]
[ 1.24469679 -1.49349376 -0.046403 ... 1. 38.
1. ]
[-0.26842301 -0.14603391 -0.046403 ... 1. 38.
1. ]
...
[ 1.47166476 -0.35805983 -0.43812161 ... 0. 38.
1. ]
[-1.25195088 0.11127873 -0.43812161 ... 1. 38.
1. ]
[ 1.01772882 0.92951628 -0.43812161 ... 0. 38.
0. ]]
-------- x axis train ---------
[[-1.02498291 0.35088942 -1.22155881 ... 1. 38.
1. ]
[-0.04145504 -0.94587846 -0.43812161 ... 1. 38.
1. ]
[-0.79801494 1.39359159 0.73703421 ... 1. 38.
0. ]
...
[-0.344079 -0.77568407 -0.43812161 ... 1. 10.
0. ]
[ 0.86641684 0.04524191 1.52047141 ... 1. 37.
1. ]
[-0.64670296 0.84029651 -0.43812161 ... 0. 38.
1. ]]
-------- y axis test ----------
[[1]
[1]
[1]
...
[1]
[1]
[0]]
-------- y axis train ---------
[[1]
[1]
[0]
...
[0]
[1]
[1]]
*******************************

# import library model KNN dengan alias/as 'KNeighborsClassifier'
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# buat variabel kosong untuk menyimpan metrik KNN/KNN metrics
scores=[]
# Kita coba nilai k yang berbeda untuk KNN (dari k=1 sampai k=26)
lrange=list(range(1,20))
# loop proses KNN
for k in lrange:
  # masukkan nilai k dan ukuran 'jarak'
  knn=KNeighborsClassifier(n_neighbors=k)
  # masukan data train/ data latih untuk melatih KNN
  knn.fit(x_train,y_train.ravel())
  # lihat prediksi KNN dengan memasukkan data uji/data test
  y_pred=knn.predict(x_test)
  # tambahkan performance metric akurasi
  scores.append(metrics.accuracy_score(y_test,y_pred))
plt.figure(2,figsize=(15,5))

optimal_k = lrange[scores.index(max(scores))]
print("Nilai k KNN yang optimal adalah %d" % optimal_k)
print("Skor optimalnya adalah %.2f" % max(scores))

# plot hasilnya
plt.plot(lrange, scores,ls='dashed')
plt.xlabel('Nilai dari k untuk KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores untuk nilai k dari k-Nearest-Neighbors')
plt.show()

Nilai k KNN yang optimal adalah 3

Skor optimalnya adalah 0.95

Program tersebut bertujuan untuk melakukan preprocessing dan menyiapkan dataset untuk tujuan machine learning, serta membuat dan
menguji model klasifikasi K-Nearest Neighbors (KNN) menggunakan validasi silang k-fold. Berikut adalah ringkasan langkah-langkah yang
dilakukan dalam script tersebut:

1. Impor pustaka yang diperlukan (Pandas, Numpy, Seaborn, Matplotlib, StandardScaler, dan KFold).
2. Muat dataset dari file CSV menggunakan Pandas.
3. Cetak 5 baris pertama, jenis data kolom, dimensi, dan jumlah nilai untuk setiap kolom kategorikal dalam dataset.
4. Visualisasikan distribusi kelas dalam dataset menggunakan Seaborn countplot.
5. Ganti nilai yang hilang yang direpresentasikan sebagai "?" dengan nilai NaN.
6. Hapus baris dengan nilai NaN.
7. Normalisasi fitur numerik menggunakan StandardScaler.
8. Encode fitur kategorikal menggunakan OrdinalEncoder.
9. Encode label kelas biner sebagai 0 (pendapatan <=50K) dan 1 (pendapatan >50K).
10. Gabungkan fitur numerik yang telah dinormalisasi, fitur kategorikal yang telah diencode, dan label kelas menjadi satu dataset.
11. Visualisasikan korelasi antara fitur menggunakan heatmap.
12. Bagi dataset menjadi 5 fold untuk validasi silang k-fold.
13. Latih model KNN untuk setiap fold dan evaluasi akurasinya.

Evaluasi Hasil Matriks

y_pred = knn.predict(x_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap='PuRd' ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 14.049999999999999, 'Predicted label')

Kode program di atas digunakan untuk menampilkan confusion matrix pada model k-NN dengan menggunakan library scikit-learn dan seaborn
pada bahasa pemrograman Python. Pertama, dilakukan hasil prediksi dari model k-NN untuk data uji. Kemudian, dilakukan penghitungan
confusion matrix dengan membandingkan nilai target aktual dari data uji dengan hasil prediksi. Selanjutnya, confusion matrix tersebut
ditampilkan pada bentuk heatmap dengan menggunakan library seaborn. Terakhir, dilakukan setting pada judul dan label sumbu pada heatmap
menggunakan library matplotlib.

#import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.91 0.79 0.85 2303

1 0.93 0.97 0.95 6741

accuracy 0.93 9044

macro avg 0.92 0.88 0.90 9044
weighted avg 0.93 0.93 0.92 9044

laporan klasifikasi yang terperinci pada model k-Nearest Neighbors (k-NN) dengan menggunakan library scikit-learn pada bahasa pemrograman
Python. Metrik evaluasi kinerja model seperti precision, recall, f1-score, dan support untuk setiap kelas pada data uji ditampilkan dalam laporan
klasifikasi. Laporan ini berguna untuk mengevaluasi kinerja model k-NN dan membantu dalam menentukan apakah model tersebut sudah
cukup baik dalam memprediksi data uji.
PRAKTIKUM BIG DATA – M7
FEATURE SELECTION DAN MODEL TRAINING

TUGAS 2

Data adalah sekumpulan keterangan ataupun fakta yang dibuat dengan kata-kata, kalimat,
simbol, angka, dan lainnya. Data disini didapatkan melalui sebuah proses pencarian dan juga
pengamatan yang tepat berdasarkan sumber-sumber tertentu.
Contoh data yang paling sederhana yaitu data mahasiswa (Nama, NPM, Alamat).

Di zaman ini, perkembangan data sudah sampai di big data dan terhubung dengan IOT
(Internet of Things).
Internet of things merupakan sebuah konsep di mana suatu benda atau objek ditanamkan
teknologi-teknologi seperti sensor dan software dengan tujuan untuk berkomunikasi,
mengendalikan, menghubungkan, dan bertukar data melalui perangkat lain selama masih
terhubung ke internet.
Contoh IOT itu seperti Smart City Smart Home yang terkoneksi dengan Smartphone user.

Sistem penyimpanan cloud adalah layanan yang memungkinkan pengguna untuk menyimpan
dan mengakses data dari server jarak jauh melalui internet. Data tersebut dapat berupa file,
foto, video, dokumen, dan jenis data lainnya. Sistem ini memungkinkan pengguna untuk
menyimpan data mereka secara online dan dapat diakses kapan saja dan di mana saja tanpa
harus menyimpannya di perangkat keras lokal seperti hard disk atau flash drive.
Contoh dari sistem penyimpanan cloud yang populer adalah:
1. Google Drive: Layanan penyimpanan cloud dari Google yang memungkinkan
pengguna untuk menyimpan dan mengakses file dari berbagai perangkat.
2. Dropbox: Layanan penyimpanan cloud yang memungkinkan pengguna untuk
menyimpan dan berbagi file dengan orang lain.
3. Microsoft OneDrive: Layanan penyimpanan cloud dari Microsoft yang terintegrasi
dengan sistem operasi Windows dan aplikasi Microsoft Office.
4. Amazon S3: Layanan penyimpanan cloud dari Amazon Web Services yang dirancang
untuk menyimpan data dalam jumlah besar dan aplikasi berbasis cloud.
5. iCloud: Layanan penyimpanan cloud dari Apple yang terintegrasi dengan perangkat
Apple seperti iPhone, iPad, dan Mac.
Arsitektur kompleks 3V (Variety, Velocity, Volume) dan 4V (Variety, Velocity, Volume,
Veracity) merujuk pada arsitektur data untuk sistem yang memproses dan mengelola data
dalam jumlah besar. Arsitektur kompleks ini dirancang untuk mengelola data yang sangat
besar, heterogen, dan bervariasi dalam format dan jenis, serta memiliki kecepatan
pengumpulan dan analisis data yang tinggi.

1. Arsitektur kompleks 3V:

• Variety: Data yang dihasilkan dalam berbagai format seperti struktur, semi-
struktur dan tidak terstruktur.
• Velocity: Data yang dihasilkan dan diproses dalam waktu nyata.
• Volume: Data dalam jumlah besar yang dihasilkan dan diproses.
2. Arsitektur kompleks 4V:
• Veracity: Data yang dihasilkan memiliki kualitas yang berbeda. Data yang tidak
akurat atau tidak relevan harus dihilangkan sebelum diproses.
3. Arsitektur kompleks 5V:
• Value: Data yang dihasilkan dan diproses harus memiliki nilai bisnis. Pemrosesan
data harus dilakukan dengan mempertimbangkan keuntungan bisnis yang
diharapkan.

Arsitektur kompleks ini terutama digunakan dalam sistem Big Data dan merupakan suatu
pendekatan yang fleksibel dan skalabel yang mampu menangani volume data yang sangat
besar dan bervariasi. Dengan semakin berkembangnya teknologi, arsitektur kompleks 5V
diharapkan dapat memperhitungkan aspek nilai bisnis dalam pengambilan.

Rifqiirsyad 10123897
No ratings yet
Rifqiirsyad 10123897
16 pages
2 Tekrek M7 KNN - DGX 1
No ratings yet
2 Tekrek M7 KNN - DGX 1
15 pages
Data Analysis and Preparation Guide
No ratings yet
Data Analysis and Preparation Guide
16 pages
Riandhika Vianto (17818821) - Weeks 6
No ratings yet
Riandhika Vianto (17818821) - Weeks 6
8 pages
Aiml
No ratings yet
Aiml
27 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Exploratory Data Analysis and Preprocessing Pipeline
No ratings yet
Exploratory Data Analysis and Preprocessing Pipeline
18 pages
Data Analysis for Machine Learning
No ratings yet
Data Analysis for Machine Learning
9 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
1st Project
No ratings yet
1st Project
24 pages
Exp 343
No ratings yet
Exp 343
18 pages
Output1
No ratings yet
Output1
8 pages
BankX Marketing 1744722258
No ratings yet
BankX Marketing 1744722258
29 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
DSBDA3 - Jupyter Notebook
No ratings yet
DSBDA3 - Jupyter Notebook
12 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Analyzing Customer Data with NumPy
No ratings yet
Analyzing Customer Data with NumPy
9 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
DACLUSTER
No ratings yet
DACLUSTER
9 pages
Campus Recruitment Analysis
No ratings yet
Campus Recruitment Analysis
18 pages
3 Mesures of Central Tendrncy
No ratings yet
3 Mesures of Central Tendrncy
10 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Titanic Eda
No ratings yet
Titanic Eda
17 pages
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
UQ21CA632B Unit2 Class14a Data Representation
No ratings yet
UQ21CA632B Unit2 Class14a Data Representation
5 pages
Sakina Assign1 Batch3
No ratings yet
Sakina Assign1 Batch3
8 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
DWM Journal
No ratings yet
DWM Journal
104 pages
SPPUML3
No ratings yet
SPPUML3
12 pages
Assignmnet 5
No ratings yet
Assignmnet 5
11 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
5 pages
DSBDA 3 Payal
No ratings yet
DSBDA 3 Payal
12 pages
DATASCI112 Midterm Cheat Sheet
No ratings yet
DATASCI112 Midterm Cheat Sheet
2 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Supervised Decision Trees A Case Study For AllLife Bank
No ratings yet
Supervised Decision Trees A Case Study For AllLife Bank
50 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Naan Mudhalvan Assisgnment
No ratings yet
Naan Mudhalvan Assisgnment
6 pages
Salary Prediction
No ratings yet
Salary Prediction
32 pages
DS - Assig-03-Part-I - Jupyter Notebook
No ratings yet
DS - Assig-03-Part-I - Jupyter Notebook
8 pages
Data Integration and Missing Values Analysis
No ratings yet
Data Integration and Missing Values Analysis
23 pages
Credit Risk Prediction Model Overview
No ratings yet
Credit Risk Prediction Model Overview
19 pages
Solution
No ratings yet
Solution
41 pages
Titanic ML for Data Scientists
No ratings yet
Titanic ML for Data Scientists
36 pages
Agglomerative Clustering Analysis Guide
No ratings yet
Agglomerative Clustering Analysis Guide
2 pages
Python Pandas-DataFrames Complete - Jupyter Notebook
No ratings yet
Python Pandas-DataFrames Complete - Jupyter Notebook
34 pages
ML Assignment No 5
No ratings yet
ML Assignment No 5
11 pages
Kunal DA-12 Assignment-4
No ratings yet
Kunal DA-12 Assignment-4
26 pages
Data Science Project VI - Ipynb - Colaboratory
No ratings yet
Data Science Project VI - Ipynb - Colaboratory
15 pages
Titanic Data Analysis
No ratings yet
Titanic Data Analysis
14 pages
Python Project - Checkpoint
No ratings yet
Python Project - Checkpoint
26 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Machine Lerning
No ratings yet
Machine Lerning
5 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Car Data Analysis Guide
No ratings yet
Car Data Analysis Guide
6 pages
MLT Ann Lab 2
No ratings yet
MLT Ann Lab 2
7 pages
LDA Code
No ratings yet
LDA Code
19 pages
Movie Reservation System Thesis
No ratings yet
Movie Reservation System Thesis
76 pages
Passive Voice Mixed Tenses Exercise
No ratings yet
Passive Voice Mixed Tenses Exercise
2 pages
Reversing Parked Documents in SAP
No ratings yet
Reversing Parked Documents in SAP
5 pages
Mathematics: Quarter 3 - Module 5
No ratings yet
Mathematics: Quarter 3 - Module 5
20 pages
A18 IDS (CSE, IT, ECM) 07 02 2024 (Supple)
No ratings yet
A18 IDS (CSE, IT, ECM) 07 02 2024 (Supple)
2 pages
Aptitude Topics Order
No ratings yet
Aptitude Topics Order
1 page
Programming Exercise: Parsing Weather Data Assignment
No ratings yet
Programming Exercise: Parsing Weather Data Assignment
5 pages
BDA Manual
No ratings yet
BDA Manual
56 pages
Blue Simple Professional CV Resume - 20250106 - 121440 - 0000
No ratings yet
Blue Simple Professional CV Resume - 20250106 - 121440 - 0000
1 page
MPS - Ch10 - AVR - Interrupt Programming in Assembly and C
No ratings yet
MPS - Ch10 - AVR - Interrupt Programming in Assembly and C
83 pages
Halloween: Pre-Reading
100% (1)
Halloween: Pre-Reading
8 pages
Essay@Eight - Topic-6
No ratings yet
Essay@Eight - Topic-6
4 pages
05 - 2025 Fundamental Math Practice Book
No ratings yet
05 - 2025 Fundamental Math Practice Book
93 pages
Linux Command Line Basics Guide
100% (2)
Linux Command Line Basics Guide
2 pages
Saint Thomas The Apostle-Feast
No ratings yet
Saint Thomas The Apostle-Feast
3 pages
Prompt
No ratings yet
Prompt
7 pages
BanglaNLG: Benchmarking Bangla NLG
No ratings yet
BanglaNLG: Benchmarking Bangla NLG
10 pages
Future Exercises
100% (1)
Future Exercises
7 pages
Angular
No ratings yet
Angular
44 pages
French Alphabet Lesson Plan
100% (1)
French Alphabet Lesson Plan
4 pages
STS and The Human Condition
No ratings yet
STS and The Human Condition
40 pages
JJacobs CV 2014 Web
No ratings yet
JJacobs CV 2014 Web
5 pages
Google Term Paper
100% (1)
Google Term Paper
8 pages
Understanding Allah's Greatness
No ratings yet
Understanding Allah's Greatness
700 pages
World of Words
No ratings yet
World of Words
22 pages
Understanding Test Cases and Static Testing
No ratings yet
Understanding Test Cases and Static Testing
16 pages
Software Quality Factors Overview
No ratings yet
Software Quality Factors Overview
29 pages
Ug901 Vivado Synthesis
No ratings yet
Ug901 Vivado Synthesis
54 pages
(CIPP 2025) Bai Tap Ngay 111 Collocations With Break 11993311 5242024104916PM
No ratings yet
(CIPP 2025) Bai Tap Ngay 111 Collocations With Break 11993311 5242024104916PM
2 pages
Sample Question Paper of PRogramming in 'C'
100% (1)
Sample Question Paper of PRogramming in 'C'
4 pages

2IA02 Fauzan Ramadhan

Uploaded by

2IA02 Fauzan Ramadhan

Uploaded by

Fauzan Ramadhan

age fnlwgt educational-num capital-gain capital-loss hours-per-week

count 48842.000000 4.884200e+04 48842.000000 48842.000000 48842.000000 48842.000000

mean 38.643585 1.896641e+05 10.078089 1079.067626 87.502314 40.422382

std 13.710510 1.056040e+05 2.570973 7452.019058 403.004552 12.391444

min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000

25% 28.000000 1.175505e+05 9.000000 0.000000 0.000000 40.000000

50% 37.000000 1.781445e+05 10.000000 0.000000 0.000000 40.000000

75% 48.000000 2.376420e+05 12.000000 0.000000 0.000000 45.000000

max 90.000000 1.490400e+06 16.000000 99999.000000 4356.000000 99.000000

Attribute name: workclass

<Axes: xlabel='income', ylabel='count'>

educational- capital- capital- hours- marital-

0 -1.024983 0.350889 -1.221559 -0.146733 -0.21878 -0.078120 2.0 1.0 4.0

1 -0.041455 -0.945878 -0.438122 -0.146733 -0.21878 0.754701 2.0 11.0 2.0

2 -0.798015 1.393592 0.737034 -0.146733 -0.21878 -0.078120 1.0 7.0 2.0

3 0.412481 -0.278420 -0.046403 0.877467 -0.21878 -0.078120 2.0 15.0 2.0

4 -0.344079 0.084802 -1.613277 -0.146733 -0.21878 -0.910942 2.0 0.0 4.0

KFold(n_splits=5, random_state=None, shuffle=False)

-------- x axis test ----------

Nilai k KNN yang optimal adalah 3

Evaluasi Hasil Matriks

precision recall f1-score support

0 0.91 0.79 0.85 2303

accuracy 0.93 9044

1. Arsitektur kompleks 3V:

You might also like