0% found this document useful (0 votes)
2 views

Knn

This document provides a comprehensive guide to implementing a KNN classifier in Python using libraries like pandas and scikit-learn. It covers data reading, feature preparation, model training, prediction, and various evaluation metrics such as accuracy, confusion matrix, precision, recall, F1-score, and ROC curve. The guide also includes code snippets and explanations for each step to facilitate understanding and implementation.

Uploaded by

henokgetnet0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Knn

This document provides a comprehensive guide to implementing a KNN classifier in Python using libraries like pandas and scikit-learn. It covers data reading, feature preparation, model training, prediction, and various evaluation metrics such as accuracy, confusion matrix, precision, recall, F1-score, and ROC curve. The guide also includes code snippets and explanations for each step to facilitate understanding and implementation.

Uploaded by

henokgetnet0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Python code for KNN classifier

1. Initial Message
#python

 You should import different libraries to work easily


 # does not affect Python code, because it's a comment (which means Python ignores
it).

2. Reading the Data


#python
import pandas as pd

 Import the pandas library, which helps you easily work with tables (rows and
columns).

#python
df = pd.read_csv(r"C:/Users/Mewded/Documents/Machine learning/01-data.csv")

 Read the CSV file (your dataset) into a pandas table called df.
 r"..." means "raw string" (Windows file paths need it).

3. Preparing the Features (x) and Labels (y)


#python
x = df.iloc[:, 3:5].values

 df is a DataFrame (likely from pandas).


 Take columns 4 and 5 (Python starts counting from 0) from your data.

These columns are your input features (independent variables).

iloc stands for "integer location" — it selects rows and columns by index position, not by
name.

: → means select all rows.

3:5 → means select columns 3 and 4 (Python slicing is start inclusive, end exclusive).

.values → extracts the values as a NumPy array (not as a pandas DataFrame or Series).

Thus, x is a NumPy array containing the values of the 4th and 5th columns (since
counting starts from 0) for all rows.

#python
y = df.iloc[:, -1].values

 Take the last column (-1) — this is your target (dependent variable).
 y = df.iloc[:, -1].values

: → all rows.

-1 → last column (negative indices count from the end).

.values → again, get a NumPy array.

Thus, y is a NumPy array containing the values of the last column for all rows — usually
used as the target variable (label) in machine learning.

Why define x and y this way?

In machine learning:

x = input features (what you use to predict)

y = target/output labels (what you are predicting)

Picking only two specific columns (columns 3 and 4) as your features.

Picking the last column as the target.

4. Splitting into Train and Test Sets


#python
from sklearn.model_selection import train_test_split

 Import a tool that can split your data into training and testing parts.

#python
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=100,
random_state=0)

 Split the data:


o x_train, y_train → used to teach the model.
o x_test, y_test → used to test how well the model learned.
 test_size=100 means 100 samples will be used for testing.
 random_state=0 keeps the split the same every time you run it (for reproducibility).

5. Scaling (Standardizing) the Features


#python
from sklearn.preprocessing import StandardScaler

 Import StandardScaler, a tool that standardizes (scales) the features.


 Why? Because KNN needs features to be on a similar scale (like 0 mean, unit
variance).

#python
scaler = StandardScaler()
 Create a scaler object.

#python
x_train = scaler.fit_transform(x_train)

 Fit the scaler on x_train (learn mean and std), then transform x_train.

#python
x_tests = scaler.transform(x_test)

 Use the same scaler to transform x_test.

6. Use the KNN classifier


#python
from sklearn.neighbors import KneighborsClassifier

 Import the K-Nearest Neighbors (KNN) classifier from scikit-learn.

#python
model = KNeighborsClassifier(n_neighbors=7)

 Create a KNN model where the model will look at the 7 nearest neighbors to make
a decision.

#python
model.fit(x_train, y_train)

 Train (fit) the model on the training data.


 It does not build a generalizing model (no real "training" happens).
 During .fit(), KNN only stores the training data (x_train, y_train).

7. Making Predictions
#python
y_pred = model.predict(x_tests)

 Use the stored(training) data to predict the results on x_tests (your scaled test data).
 Prediction = expensive and happens at runtime because KNN searches nearest
neighbors among the saved data.

8. Evaluating the Model


Accuracy
#python
from sklearn.metrics import accuracy_score

 Import a function to calculate accuracy (how many predictions were correct).

#python
acc = accuracy_score(y_test, y_pred)
print(f"accuracy: {acc}")

 Calculate and print the accuracy (percentage of correct predictions).

Confusion Matrix
#python
from sklearn.metrics import confusion_matrix

 Import a function to make a confusion matrix.


 A confusion matrix shows where the model made right and wrong predictions.

#python
cm = confusion_matrix(y_test, y_pred, labels=[0,1])
print("Confusion Matrix:")
print(cm)

 Build the confusion matrix, assuming your labels are 0 and 1.


 Then print it.
 Example:

[[40 10]
[ 5 45]]

o 40 = true 0 predicted as 0
o 10 = true 0 predicted as 1 (mistake)
o 5 = true 1 predicted as 0 (mistake)
o 45 = true 1 predicted as 1

Normalized Confusion Matrix


#python
import numpy as np

 Import NumPy (helps with mathematical operations).

#python
print("Normalized Confusion Matrix:")
cm_normalized = np.round(cm/np.sum(cm, axis=1).reshape(-1,1), 2)
print(cm_normalized)

 Normalize the confusion matrix row by row:


o Each row will sum to 1 (or 100%).
 np.sum(cm, axis=1).reshape(-1,1) sums rows.
 np.round(..., 2) rounds to 2 decimal places for better display.
 Example:

[[0.8 0.2]
[0.1 0.9]]

o 80% of class 0 was correctly predicted.


o 90% of class 1 was correctly predicted.

Heatmap for Confusion Matrix


#python
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(cm, cmap="Greens", annot=True,


cbar_kws={"orientation": "vertical", "label": "color bar"},
xticklabels=[0, 1], yticklabels=[0, 1])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

What's happening here?

 cm is your confusion matrix (already computed somewhere before).


 sns.heatmap plots a color-coded matrix where:
o Each cell's value (the count) is annotated (annot=True).
o The color intensity shows the magnitude (controlled by cmap="Greens").
o cbar_kws customizes the color bar.
o xticklabels and yticklabels = [0,1], meaning binary classification.

Goal: Quickly visualize how many true positives, true negatives, false positives, and false
negatives you have.

Normalized Confusion Matrix


#python
import numpy as np

print("Normalized Confusion Matrix:")


cm_normalized = np.round(cm/np.sum(cm, axis=1).reshape(-1, 1), 2)
print(cm_normalized)

What's happening here?

 You normalize the confusion matrix:


o Divide each row by its row sum (np.sum(cm, axis=1) → sum across
predictions for each true class).
o Reshape to align the dimensions.
 np.round(..., 2) rounds to 2 decimal places for cleaner numbers.

Meaning:

 Instead of raw counts, you now see proportions.


 Each row adds to ~1.0 → makes it easier to interpret per-class performance.

Heatmap for Normalized Confusion Matrix


#python
sns.heatmap(cm_normalized, cmap="Greens", annot=True,
cbar_kws={"orientation": "vertical", "label": "color bar"},
xticklabels=[0, 1], yticklabels=[0, 1])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Normalized Confusion Matrix")
plt.show()

What's happening?

 Same as the first heatmap — but with the normalized values.


 Very useful to see if, say, class 0 is predicted correctly 95% of the time but class 1
only 70%.

Additional Evaluation: Precision, Recall, F1-Score


#python
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)


recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

What's happening?

 Precision: Of all predicted positives, how many were correct?


 Recall: Of all actual positives, how many were captured?
 F1 Score: Harmonic mean of precision and recall — balances the two.

Why calculate them?

 Especially important for imbalanced datasets (e.g., detecting rare events).

📈 ROC Curve and AUC


#python
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_scores = model.predict_proba(x_tests)[:, 1] # probability for class 1

fpr, tpr, thresholds = roc_curve(y_test, y_scores)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label='KNN (ROC curve)')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
plt.show()

auc = roc_auc_score(y_test, y_scores)


print(f"AUC: {auc}")

What's happening here?

 ROC Curve: Graph of TPR (True Positive Rate) vs. FPR (False Positive Rate) at
various threshold settings.
 AUC (Area Under Curve): A single value summarizing the ROC curve.
o AUC = 1.0 → perfect classifier
o AUC = 0.5 → random guessing

Why important?

 ROC-AUC is threshold-independent: it evaluates model quality across all possible


thresholds.
 Helpful especially when dealing with imbalanced data.

Part Purpose
Heatmap Visualize raw confusion matrix
Normalize + Heatmap Easier interpretation of model performance per class
Precision, Recall, F1 Key classification metrics, especially for imbalance
ROC Curve & AUC Threshold-independent model evaluation

In this code, students already learned:

 How to read data using pandas


 How to split data into training and testing
 Feature scaling (standardization with StandardScaler)
 Training a basic KNN model with scikit-learn
 Making predictions
 Evaluating model performance by using:
o Accuracy
o Confusion Matrix
o Normalized Confusion Matrix
o Precision, Recall, F1-score
o plots a color-coded matrix
o ROC Curve and AUC

You might also like