Knn
Knn
1. Initial Message
#python
Import the pandas library, which helps you easily work with tables (rows and
columns).
#python
df = pd.read_csv(r"C:/Users/Mewded/Documents/Machine learning/01-data.csv")
Read the CSV file (your dataset) into a pandas table called df.
r"..." means "raw string" (Windows file paths need it).
iloc stands for "integer location" — it selects rows and columns by index position, not by
name.
3:5 → means select columns 3 and 4 (Python slicing is start inclusive, end exclusive).
.values → extracts the values as a NumPy array (not as a pandas DataFrame or Series).
Thus, x is a NumPy array containing the values of the 4th and 5th columns (since
counting starts from 0) for all rows.
#python
y = df.iloc[:, -1].values
Take the last column (-1) — this is your target (dependent variable).
y = df.iloc[:, -1].values
: → all rows.
Thus, y is a NumPy array containing the values of the last column for all rows — usually
used as the target variable (label) in machine learning.
In machine learning:
Import a tool that can split your data into training and testing parts.
#python
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=100,
random_state=0)
#python
scaler = StandardScaler()
Create a scaler object.
#python
x_train = scaler.fit_transform(x_train)
Fit the scaler on x_train (learn mean and std), then transform x_train.
#python
x_tests = scaler.transform(x_test)
#python
model = KNeighborsClassifier(n_neighbors=7)
Create a KNN model where the model will look at the 7 nearest neighbors to make
a decision.
#python
model.fit(x_train, y_train)
7. Making Predictions
#python
y_pred = model.predict(x_tests)
Use the stored(training) data to predict the results on x_tests (your scaled test data).
Prediction = expensive and happens at runtime because KNN searches nearest
neighbors among the saved data.
#python
acc = accuracy_score(y_test, y_pred)
print(f"accuracy: {acc}")
Confusion Matrix
#python
from sklearn.metrics import confusion_matrix
#python
cm = confusion_matrix(y_test, y_pred, labels=[0,1])
print("Confusion Matrix:")
print(cm)
[[40 10]
[ 5 45]]
o 40 = true 0 predicted as 0
o 10 = true 0 predicted as 1 (mistake)
o 5 = true 1 predicted as 0 (mistake)
o 45 = true 1 predicted as 1
#python
print("Normalized Confusion Matrix:")
cm_normalized = np.round(cm/np.sum(cm, axis=1).reshape(-1,1), 2)
print(cm_normalized)
[[0.8 0.2]
[0.1 0.9]]
Goal: Quickly visualize how many true positives, true negatives, false positives, and false
negatives you have.
Meaning:
What's happening?
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
What's happening?
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label='KNN (ROC curve)')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
plt.show()
ROC Curve: Graph of TPR (True Positive Rate) vs. FPR (False Positive Rate) at
various threshold settings.
AUC (Area Under Curve): A single value summarizing the ROC curve.
o AUC = 1.0 → perfect classifier
o AUC = 0.5 → random guessing
Why important?
Part Purpose
Heatmap Visualize raw confusion matrix
Normalize + Heatmap Easier interpretation of model performance per class
Precision, Recall, F1 Key classification metrics, especially for imbalance
ROC Curve & AUC Threshold-independent model evaluation