Welcome back,
ML peeps!
Hugo Bowne-
Anderson
Data Scientist at DataCamp
Big ideas from
last session?
Hugo Bowne-
Anderson
Data Scientist at DataCamp
How good is your
model?
SUPERVISED LEARNING WITH SCIKIT-LEARN
George
Boorman
Core Curriculum Manager, DataCamp
Classification metrics
• Measuring model performance with accuracy:
o Fraction of correctly classified samples
o Not always a useful metric
Class imbalance
Accuracy???
99% legit 99%!
Predicts Wow
Legit
But how
All the time!
about
predicting
1% fraud fraud?
Class imbalance
• Classification for predicting fraudulent bank transactions
99% of transactions are legitimate; 1% are fraudulent
• Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions
Fails at its original purpose
• Class imbalance: Uneven frequency of classes
• Need a different way to assess performance
Confusion matrix for assessing classification
performance
Confusion matrix
NOTE:
Fraud – Positive event
Legitimate – Negative event
Assessing classification performance
Assessing classification performance
Assessing classification performance
Assessing classification performance
Assessing classification performance
Assessing classification performance
Assessing classification performance
Assessing classification performance
Accuracy:
Precision
Precision
High precision = lower false positive rate
High precision: Not many legitimate transactions are predicted to be fraudulent
Recall
Recall
High recall = lower false negative rate
High recall: Predicted most fraudulent transactions correctly
F1 score
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑟𝑒𝑐𝑎𝑙𝑙
• F1 Score: 2 ∗
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
Confusion matrix in scikit-learn
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4,
random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
Confusion matrix in scikit-learn
print(confusion_matrix(y_test, y_pred))
[[1106 11]
[ 183 34]]
[[1106 11]
[183 34]]
Classification report in scikit-learn
print(classification_report(y_test, y_pred))
[[1106 11]
precision
[ 183 34]]
recall f1-score support
0 0.86 0.99 0.92 1117 1 0.76 0.16 0.26 217
accuracy 0.85 1334 macro avg 0.81 0.57 0.59 1334
weighted avg 0.84 0.85 0.81 1334
Note: 0 – negative class, [[1106 11]
1 – positive class [183 34]]
Questions?
Logistic regression
and the ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
George Boorman
Core Curriculum Manager, DataCamp
Logistic regression for binary classification
• Logistic regression is used for classification problems
• Logistic regression outputs probabilities (O to 1)
• If the probability, p>=0.5:
o The data is labeled 1
• If the probability, p<0.5:
o The data is labeled 0
Linear decision boundary
Logistic regression in scikit-learn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
Predicting probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
print(y_pred_probs[0])
[0.08961376]
[0.08961376]
Probability thresholds
• By default, logistic regression threshold = 0.5
• What happens if we vary the threshold?
• Not specific to logistic regression
o KNN classifiers also have thresholds
The ROC curve (Receiver Operating Characteristic)
𝑇𝑃 𝑭𝑷
𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁
𝑭𝑷𝑹 =
𝑭𝑷+𝑻𝑵
Sensitivity / Recall Fall-Out
The ROC curve (Receiver Operating Characteristic)
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
In most contexts, you would want a low False Positive Rate (FPR) and a high True
Positive Rate (TPR).
However, the goal of minimizing FPR is often in tension to maximize the True
Positive Rate (TPR).
Increasing TPR usually comes at the cost of increasing the FPR, and this cost is
heavily influenced by the threshold you set for classification.
The ROC curve (Receiver Operating Characteristic)
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
This emphasizes the importance of your role in setting and managing this threshold,
giving you a sense of control and responsibility in the process.
This trade-off is precisely what the Receiver Operating Characteristic (ROC) curve
illustrates: as you move along the curve, increasing sensitivity usually comes at the
expense of increasing the FPR.
The ROC curve This is like predicting that all belongs to the Positive Class!
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
The ROC curve This is like predicting that all belongs to the Positive Class!
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
This is like predicting that all belongs to the Negative Class!
The ROC curve
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
If we vary the threshold, we get a series of different false positive and true positive rates.
The ROC curve
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
The ROC curve
Plotting the ROC curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--’)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve’)
plt.show()
Plotting the ROC curve
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
ROC AUC
ROC AUC
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃+𝐹𝑁 𝐹𝑃+𝑇𝑁
Sensitivity / Recall Fall-Out
ROC AUC in scikit-learn
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test,
0.6700964152663693
y_pred_probs))
0.6700964152663693
Precision-Recall PR Curve
Other methods
Note: There are various methods other than the ROC and PR curve to evaluate the
performance of classification models.
0.6700964152663693
▪ AUC-PR (Area Under the Precision-Recall Curve)
▪ Cost-Benefit Analysis
▪ Balanced Accuracy
▪ Log-Loss
▪ Gini Coefficient
Questions?
Hyperparameter
tuning
SUPERVISED LEARNING WITH SCIKIT-LEARN
George Boorman
Core Curriculum Manager
Hyperparameter tuning
• Ridge/lasso regression: Choosing alpha
• KNN: Choosing n_neighbors
• Hyperparameters: Parameters we specify before training the model
o Like alpha and n_neighbors
Choosing the correct hyperparameters
1. Try lots of different hyperparameter values
2. Fit all of them separately
3. See how well they perform
4. Choose the best performing values
• This is called hyperparameter tuning
• It is essential to use cross-validation to avoid overfitting to the test
set
• We can still split the data and perform cross-validation on the
training set
• We withhold the test set for final evaluation
Choosing the correct hyperparameters
Note: we can also perform CV in the Training and Validation set
https://medium.com/@rahulchavan4894/understanding-train-test-and-validation-dataset-split-in-simple-quick-terms-5a8630fe58c8
Grid search cross-validation
Grid search cross-validation
Grid search cross-validation
GridSearchCV in scikit-learn
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.linspace(0.0001, 1, 10),
"solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(X_train,
{'alpha': 0.0001, 'solver':y_train)
'sag'}
0.7529912278705785
print(ridge_cv.best_params_, ridge_cv.best_score_)
GridSearchCV in scikit-learn
How to Know the Parameters You Can Tune for Every Model?
• The get_params() Method: Most scikit-learn estimators have a
get_params() method that returns a dictionary of all the parameters for
the estimator, along with their current values.
• You can use this method to not only see what parameters are available
but also to check their current settings.
•{'alpha':
For example:
0.0001, 'solver': 'sag'}
0.7529912278705785
GridSearchCV in scikit-learn
How to Know the Parameters You Can Tune for Every Model?
{'alpha': 0.0001, 'solver': 'sag'}
0.7529912278705785
Limitations and an alternative approach
• 3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
• 10 fold cross-validation, 3 hyperparameters, 30 total values = 900 fits
RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.linspace(0.0001, 1, 10),
"solver": ['sag', 'lsqr’]}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train,
{'solver': 'sag', 'alpha': y_train)
0.0001}
0.7529912278705785
print(ridge_cv.best_params_, ridge_cv.best_score_)
Evaluating on the test set
test_score = ridge_cv.score(X_test, y_test)
print(test_score)
0.7564731534089224
Questions?
Let's practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN