Machine learning
with scikit-learn
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
What is machine learning?
Machine learning is the process whereby:
Computers are given the ability to learn to make decisions from data
without being explicitly programmed!
SUPERVISED LEARNING WITH SCIKIT-LEARN
Examples of machine learning
SUPERVISED LEARNING WITH SCIKIT-LEARN
Unsupervised learning
Uncovering hidden patterns from unlabeled data
Example:
Grouping customers into distinct categories (Clustering)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Supervised learning
The predicted values are known
Aim: Predict the target values of unseen data, given the features
SUPERVISED LEARNING WITH SCIKIT-LEARN
Types of supervised learning
Classification: Target variable consists of Regression: Target variable is continuous
categories
SUPERVISED LEARNING WITH SCIKIT-LEARN
Naming conventions
Feature = predictor variable = independent variable
Target variable = dependent variable = response variable
SUPERVISED LEARNING WITH SCIKIT-LEARN
Before you use supervised learning
Requirements:
No missing values
Data in numeric format
Data stored in pandas DataFrame or NumPy array
Perform Exploratory Data Analysis (EDA) first
SUPERVISED LEARNING WITH SCIKIT-LEARN
scikit-learn syntax
from sklearn.module import Model
model = Model()
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)
array([0, 0, 0, 0, 1, 0])
We import a Model, which is a type of algorithm for our supervised learning problem, from an sklearn module. For
example, the k-Nearest Neighbors model uses distance between observations to predict labels or values. We
create a variable named model, and instantiate the Model. A model is fit to the data, where it learns patterns
about the features and the target variable. We fit the model to X, an array of our features, and y, an array of our
target variable values. We then use the model's dot-predict method, passing six new observations, X_new. For
example, if feeding features from six emails to a spam classification model, an array of six values is returned. A
one indicates the model predicts that email is spam, and a zero represents a prediction of not spam.
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
The classification
challenge
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Classifying labels of unseen data
1. Build a model
2. Model learns from the labeled data we pass to it
3. Pass unlabeled data to the model as input
4. Model predicts the labels of the unseen data
Labeled data = training data
SUPERVISED LEARNING WITH SCIKIT-LEARN
k-Nearest Neighbors
Predict the label of a data point by
Looking at the k closest labeled data points
Taking a majority vote
SUPERVISED LEARNING WITH SCIKIT-LEARN
k-Nearest Neighbors
SUPERVISED LEARNING WITH SCIKIT-LEARN
k-Nearest Neighbors
SUPERVISED LEARNING WITH SCIKIT-LEARN
k-Nearest Neighbors
SUPERVISED LEARNING WITH SCIKIT-LEARN
KNN Intuition
SUPERVISED LEARNING WITH SCIKIT-LEARN
KNN Intuition
SUPERVISED LEARNING WITH SCIKIT-LEARN
Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
X = churn_df[["total_day_charge", "total_eve_charge"]].values
y = churn_df["churn"].values
print(X.shape, y.shape)
(3333, 2), (3333,)
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)
To fit a KNN model using scikit-learn, we import KNeighborsClassifier from sklearn-dot-neighbors. We split our data into X, a 2D array of our
features, and y, a 1D array of the target values - in this case, churn status. scikit-learn requires that the features are in an array where each
column is a feature and each row a different observation. Similarly, the target needs to be a single column with the same number of
observations as the feature data. We use the dot-values attribute to convert X and y to NumPy arrays. Printing the shape of X and y, we see
there are 3333 observations of two features, and 3333 observations of the target variable. We then instantiate our KNeighborsClassifier, setting
n_neighbors equal to 15, and assign it to the variable knn. Then we can fit this classifier to our labeled data by applying the classifier's dot-fit
method and passing two arguments: the feature values, X, and the target values, y.
SUPERVISED LEARNING WITH SCIKIT-LEARN
Here we have a set of new observations, X_new.
Predicting on unlabeled data Checking the shape of X_new, we see it has three
rows and two columns, that is, three observations
and two features. We use the classifier's dot-predict
X_new = np.array([[56.8, 17.5], method and pass it the unseen data as a 2D NumPy
array containing features in columns and
[24.4, 24.1], observations in rows. Printing the predictions returns
a binary value for each observation or row in X_new.
[50.1, 10.9]]) It predicts 1, which corresponds to 'churn', for the first
print(X_new.shape) observation, and 0, which corresponds to 'no churn',
for the second and third observations.
(3, 2)
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))
Predictions: [1 0 0]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Measuring model
performance
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Measuring model performance
In classification, accuracy is a commonly used metric
Accuracy:
SUPERVISED LEARNING WITH SCIKIT-LEARN
Measuring model performance
How do we measure accuracy?
Could compute accuracy on the data used to fit the classifier
NOT indicative of ability to generalize
SUPERVISED LEARNING WITH SCIKIT-LEARN
Computing accuracy
SUPERVISED LEARNING WITH SCIKIT-LEARN
Computing accuracy
SUPERVISED LEARNING WITH SCIKIT-LEARN
Computing accuracy
SUPERVISED LEARNING WITH SCIKIT-LEARN
Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
0.8800599700149925
SUPERVISED LEARNING WITH SCIKIT-LEARN
Model complexity
Larger k = less complex model = can cause underfitting
Smaller k = more complex model = can lead to overfitting
SUPERVISED LEARNING WITH SCIKIT-LEARN
Model complexity and over/underfitting
train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)
for neighbor in neighbors:
knn = KNeighborsClassifier(n_neighbors=neighbor)
knn.fit(X_train, y_train)
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting our results
plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Model complexity curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
Model complexity curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Introduction to
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Predicting blood glucose levels
import pandas as pd
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())
pregnancies glucose triceps insulin bmi age diabetes
0 6 148 35 0 33.6 50 1
1 1 85 29 0 26.6 31 0
2 8 183 0 0 23.3 32 1
3 1 89 23 94 28.1 21 0
4 0 137 35 168 43.1 33 1
SUPERVISED LEARNING WITH SCIKIT-LEARN
Creating feature and target arrays
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
print(type(X), type(y))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
SUPERVISED LEARNING WITH SCIKIT-LEARN
Making predictions from a single feature
X_bmi = X[:, 3]
print(y.shape, X_bmi.shape)
(752,) (752,)
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)
(752, 1)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting glucose vs. body mass index
import matplotlib.pyplot as plt
plt.scatter(X_bmi, y)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting glucose vs. body mass index
SUPERVISED LEARNING WITH SCIKIT-LEARN
Fitting a regression model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_bmi, y)
predictions = reg.predict(X_bmi)
plt.scatter(X_bmi, y)
plt.plot(X_bmi, predictions)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Fitting a regression model
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
The basics of linear
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Regression mechanics
y = ax + b
Simple linear regression uses one feature
y = target
x = single feature
a, b = parameters/coefficients of the model - slope, intercept
How do we choose a and b?
Define an error function for any given line
Choose the line that minimizes the error function
Error function = loss function = cost function
SUPERVISED LEARNING WITH SCIKIT-LEARN
The loss function
SUPERVISED LEARNING WITH SCIKIT-LEARN
The loss function
SUPERVISED LEARNING WITH SCIKIT-LEARN
The loss function
SUPERVISED LEARNING WITH SCIKIT-LEARN
The loss function
SUPERVISED LEARNING WITH SCIKIT-LEARN
The loss function
SUPERVISED LEARNING WITH SCIKIT-LEARN
Ordinary Least Squares
n
RSS = ∑(yi − y^i )2
i=1
Ordinary Least Squares (OLS): minimize RSS
SUPERVISED LEARNING WITH SCIKIT-LEARN
Linear regression in higher dimensions
y = a1 x1 + a2 x2 + b
To fit a linear regression model here:
Need to specify 3 variables: a1 , a2 , b
In higher dimensions:
Known as multiple regression
Must specify coefficients for each feature and the variable b
y = a1 x1 + a2 x2 + a3 x3 + ... + an xn + b
scikit-learn works exactly the same way:
Pass two arrays: features and target
SUPERVISED LEARNING WITH SCIKIT-LEARN
Linear regression using all features
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
R-squared
R2 : quantifies the variance in target values
explained by the features
Values range from 0 to 1
High R2 : Low R2 :
SUPERVISED LEARNING WITH SCIKIT-LEARN
R-squared in scikit-learn
reg_all.score(X_test, y_test)
0.356302876407827
SUPERVISED LEARNING WITH SCIKIT-LEARN
Mean squared error and root mean squared error
n
1
M SE = ∑(yi − y^i )2
n
i=1
M SE is measured in target units, squared
RM SE = √M SE
Measure RM SE in the same units at the target variable
SUPERVISED LEARNING WITH SCIKIT-LEARN
RMSE in scikit-learn
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)
24.028109426907236
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Cross-validation
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Cross-validation motivation
Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data
Solution: Cross-validation!
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation basics
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation and model performance
5 folds = 5-fold CV
10 folds = 10-fold CV
k folds = k-fold CV
More folds = More computationally expensive
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation in scikit-learn
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Evaluating cross-validation peformance
print(cv_results)
[0.70262578, 0.7659624, 0.75188205, 0.76914482, 0.72551151, 0.73608277]
print(np.mean(cv_results), np.std(cv_results))
0.7418682216666667 0.023330243960652888
print(np.quantile(cv_results, [0.025, 0.975]))
array([0.7054865, 0.76874702])
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Regularized
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Why regularize?
Recall: Linear regression minimizes a loss function
It chooses a coefficient, a, for each feature variable, plus b
Large coefficients can lead to overfitting
Regularization: Penalize large coefficients
SUPERVISED LEARNING WITH SCIKIT-LEARN
Ridge regression
Loss function = OLS loss function +
n
α ∗ ∑ ai 2
i=1
Ridge penalizes large positive or negative coefficients
α: parameter we need to choose
Picking α is similar to picking k in KNN
Hyperparameter: variable used to optimize model parameters
α controls model complexity
α = 0 = OLS (Can lead to overfitting)
Very high α: Can lead to underfitting
SUPERVISED LEARNING WITH SCIKIT-LEARN
Ridge regression in scikit-learn
from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
scores.append(ridge.score(X_test, y_test))
print(scores)
[0.2828466623222221, 0.28320633574804777, 0.2853000732200006,
0.26423984812668133, 0.19292424694100963]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Lasso regression
Loss function = OLS loss function +
n
α ∗ ∑ ∣ai ∣
i=1
SUPERVISED LEARNING WITH SCIKIT-LEARN
Lasso regression in scikit-learn
from sklearn.linear_model import Lasso
scores = []
for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
scores.append(lasso.score(X_test, y_test))
print(scores)
[0.99991649071123, 0.99961700284223, 0.93882227671069, 0.74855318676232, -0.05741034640016]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Lasso regression for feature selection
Lasso can select important features of a dataset
Shrinks the coefficients of less important features to zero
Features not shrunk to zero are selected by lasso
SUPERVISED LEARNING WITH SCIKIT-LEARN
Lasso for feature selection in scikit-learn
from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
names = diabetes_df.drop("glucose", axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
plt.bar(names, lasso_coef)
plt.xticks(rotation=45)
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Lasso for feature selection in scikit-learn
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
How good is your
model?
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Classification metrics
Measuring model performance with accuracy:
Fraction of correctly classified samples
Not always a useful metric
SUPERVISED LEARNING WITH SCIKIT-LEARN
Class imbalance
Classification for predicting fraudulent bank transactions
99% of transactions are legitimate; 1% are fraudulent
Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions
Fails at its original purpose
Class imbalance: Uneven frequency of classes
Need a different way to assess performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix for assessing classification
performance
Confusion matrix
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
Accuracy:
SUPERVISED LEARNING WITH SCIKIT-LEARN
Precision
Precision
High precision = lower false positive rate
High precision: Not many legitimate transactions are predicted to be fraudulent
SUPERVISED LEARNING WITH SCIKIT-LEARN
Recall
Recall
High recall = lower false negative rate
High recall: Predicted most fraudulent transactions correctly
SUPERVISED LEARNING WITH SCIKIT-LEARN
F1 score
precision ∗ recall
F1 Score: 2 ∗ precision + recall
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix in scikit-learn
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix in scikit-learn
print(confusion_matrix(y_test, y_pred))
[[1106 11]
[ 183 34]]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Classification report in scikit-learn
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.86 0.99 0.92 1117
1 0.76 0.16 0.26 217
accuracy 0.85 1334
macro avg 0.81 0.57 0.59 1334
weighted avg 0.84 0.85 0.81 1334
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Logistic regression
and the ROC curve
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Logistic regression for binary classification
Logistic regression is used for classification problems
Logistic regression outputs probabilities
If the probability, p > 0.5:
The data is labeled 1
If the probability, p < 0.5:
The data is labeled 0
SUPERVISED LEARNING WITH SCIKIT-LEARN
Linear decision boundary
SUPERVISED LEARNING WITH SCIKIT-LEARN
Logistic regression in scikit-learn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Predicting probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
print(y_pred_probs[0])
[0.08961376]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Probability thresholds
By default, logistic regression threshold = 0.5
Not specific to logistic regression
KNN classifiers also have thresholds
What happens if we vary the threshold?
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting the ROC curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting the ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
ROC AUC
SUPERVISED LEARNING WITH SCIKIT-LEARN
ROC AUC in scikit-learn
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))
0.6700964152663693
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Hyperparameter
tuning
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager
Hyperparameter tuning
Ridge/lasso regression: Choosing alpha
KNN: Choosing n_neighbors
Hyperparameters: Parameters we specify before fitting the model
Like alpha and n_neighbors
SUPERVISED LEARNING WITH SCIKIT-LEARN
Choosing the correct hyperparameters
1. Try lots of different hyperparameter values
2. Fit all of them separately
3. See how well they perform
4. Choose the best performing values
This is called hyperparameter tuning
It is essential to use cross-validation to avoid overfitting to the test set
We can still split the data and perform cross-validation on the training set
We withhold the test set for final evaluation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
GridSearchCV in scikit-learn
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10),
"solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
{'alpha': 0.0001, 'solver': 'sag'}
0.7529912278705785
SUPERVISED LEARNING WITH SCIKIT-LEARN
Limitations and an alternative approach
3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
10 fold cross-validation, 3 hyperparameters, 30 total values = 900 fits
SUPERVISED LEARNING WITH SCIKIT-LEARN
RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.arange(0.0001, 1, 10),
"solver": ['sag', 'lsqr']}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
{'solver': 'sag', 'alpha': 0.0001}
0.7529912278705785
SUPERVISED LEARNING WITH SCIKIT-LEARN
Evaluating on the test set
test_score = ridge_cv.score(X_test, y_test)
print(test_score)
0.7564731534089224
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Preprocessing data
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
scikit-learn requirements
Numeric data
No missing values
With real-world data:
This is rarely the case
We will often need to preprocess our data first
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dealing with categorical features
scikit-learn will not accept categorical features by default
Need to convert categorical features into numeric values
Convert to binary features called dummy variables
0: Observation was NOT that category
1: Observation was that category
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dummy variables
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dummy variables
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dummy variables
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dealing with categorical features in Python
scikit-learn: OneHotEncoder()
pandas: get_dummies()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Music dataset
popularity : Target variable
genre : Categorical feature
print(music.info())
popularity acousticness danceability ... tempo valence genre
0 41.0 0.6440 0.823 ... 102.619000 0.649 Jazz
1 62.0 0.0855 0.686 ... 173.915000 0.636 Rap
2 42.0 0.2390 0.669 ... 145.061000 0.494 Electronic
3 64.0 0.0125 0.522 ... 120.406497 0.595 Rock
4 60.0 0.1210 0.780 ... 96.056000 0.312 Rap
SUPERVISED LEARNING WITH SCIKIT-LEARN
EDA w/ categorical feature
SUPERVISED LEARNING WITH SCIKIT-LEARN
Encoding dummy variables
import pandas as pd
music_df = pd.read_csv('music.csv')
music_dummies = pd.get_dummies(music_df["genre"], drop_first=True)
print(music_dummies.head())
Anime Blues Classical Country Electronic Hip-Hop Jazz Rap Rock
0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 1 0
2 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 1 0
music_dummies = pd.concat([music_df, music_dummies], axis=1)
music_dummies = music_dummies.drop("genre", axis=1)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Encoding dummy variables
music_dummies = pd.get_dummies(music_df, drop_first=True)
print(music_dummies.columns)
Index(['popularity', 'acousticness', 'danceability', 'duration_ms', 'energy',
'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo',
'valence', 'genre_Anime', 'genre_Blues', 'genre_Classical',
'genre_Country', 'genre_Electronic', 'genre_Hip-Hop', 'genre_Jazz',
'genre_Rap', 'genre_Rock'],
dtype='object')
SUPERVISED LEARNING WITH SCIKIT-LEARN
Linear regression with dummy variables
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
X = music_dummies.drop("popularity", axis=1).values
y = music_dummies["popularity"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
linreg = LinearRegression()
linreg_cv = cross_val_score(linreg, X_train, y_train, cv=kf,
scoring="neg_mean_squared_error")
print(np.sqrt(-linreg_cv))
[8.15792932, 8.63117538, 7.52275279, 8.6205778, 7.91329988]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Handling missing
data
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Missing data
No value for a feature in a particular row
This can occur because:
There may have been no observation
The data might be corrupt
We need to deal with missing data
SUPERVISED LEARNING WITH SCIKIT-LEARN
Music dataset
print(music_df.isna().sum().sort_values())
genre 8
popularity 31
loudness 44
liveness 46
tempo 46
speechiness 59
duration_ms 91
instrumentalness 91
danceability 143
valence 143
acousticness 200
energy 200
dtype: int64
SUPERVISED LEARNING WITH SCIKIT-LEARN
Dropping missing data
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
print(music_df.isna().sum().sort_values())
popularity 0
liveness 0
loudness 0
tempo 0
genre 0
duration_ms 29
instrumentalness 29
speechiness 53
danceability 127
valence 127
acousticness 178
energy 178
dtype: int64
SUPERVISED LEARNING WITH SCIKIT-LEARN
Imputing values
Imputation - use subject-matter expertise to replace missing data with educated guesses
Common to use the mean
Can also use the median, or another value
For categorical values, we typically use the most frequent value - the mode
Must split our data first, to avoid data leakage
SUPERVISED LEARNING WITH SCIKIT-LEARN
Imputation with scikit-learn
from sklearn.impute import SimpleImputer
X_cat = music_df["genre"].values.reshape(-1, 1)
X_num = music_df.drop(["genre", "popularity"], axis=1).values
y = music_df["popularity"].values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2,
random_state=12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2,
random_state=12)
imp_cat = SimpleImputer(strategy="most_frequent")
X_train_cat = imp_cat.fit_transform(X_train_cat)
X_test_cat = imp_cat.transform(X_test_cat)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Imputation with scikit-learn
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)
Imputers are known as transformers
SUPERVISED LEARNING WITH SCIKIT-LEARN
Imputing within a pipeline
from sklearn.pipeline import Pipeline
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)
X = music_df.drop("genre", axis=1).values
y = music_df["genre"].values
SUPERVISED LEARNING WITH SCIKIT-LEARN
Imputing within a pipeline
steps = [("imputation", SimpleImputer()),
("logistic_regression", LogisticRegression())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
0.7593582887700535
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Centering and
scaling
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager
Why scale our data?
print(music_df[["duration_ms", "loudness", "speechiness"]].describe())
duration_ms loudness speechiness
count 1.000000e+03 1000.000000 1000.000000
mean 2.176493e+05 -8.284354 0.078642
std 1.137703e+05 5.065447 0.088291
min -1.000000e+00 -38.718000 0.023400
25% 1.831070e+05 -9.658500 0.033700
50% 2.176493e+05 -7.033500 0.045000
75% 2.564468e+05 -5.034000 0.078642
max 1.617333e+06 -0.883000 0.710000
SUPERVISED LEARNING WITH SCIKIT-LEARN
Why scale our data?
Many models use some form of distance to inform them
Features on larger scales can disproportionately influence the model
Example: KNN uses distance explicitly when making predictions
We want features to be on a similar scale
Normalizing or standardizing (scaling and centering)
SUPERVISED LEARNING WITH SCIKIT-LEARN
How to scale our data
Subtract the mean and divide by variance
All features are centered around zero and have a variance of one
This is called standardization
Can also subtract the minimum and divide by the range
Minimum zero and maximum one
Can also normalize so the data ranges from -1 to +1
See scikit-learn docs for further details
SUPERVISED LEARNING WITH SCIKIT-LEARN
Scaling in scikit-learn
from sklearn.preprocessing import StandardScaler
X = music_df.drop("genre", axis=1).values
y = music_df["genre"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(np.mean(X), np.std(X))
print(np.mean(X_train_scaled), np.std(X_train_scaled))
19801.42536120538, 71343.52910125865
2.260817795600319e-17, 1.0
SUPERVISED LEARNING WITH SCIKIT-LEARN
Scaling in a pipeline
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=6))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = knn_scaled.predict(X_test)
print(knn_scaled.score(X_test, y_test))
0.81
SUPERVISED LEARNING WITH SCIKIT-LEARN
Comparing performance using unscaled data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test))
0.53
SUPERVISED LEARNING WITH SCIKIT-LEARN
CV and scaling in a pipeline
from sklearn.model_selection import GridSearchCV
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {"knn__n_neighbors": np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Checking model parameters
print(cv.best_score_)
0.8199999999999999
print(cv.best_params_)
{'knn__n_neighbors': 12}
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Evaluating multiple
models
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Different models for different problems
Some guiding principles
Size of the dataset
Fewer features = simpler model, faster training time
Some models require large amounts of data to perform well
Interpretability
Some models are easier to explain, which can be important for stakeholders
Linear regression has high interpretability, as we can understand the coefficients
Flexibility
May improve accuracy, by making fewer assumptions about data
KNN is a more flexible model, doesn't assume any linear relationships
SUPERVISED LEARNING WITH SCIKIT-LEARN
It's all in the metrics
Regression model performance:
RMSE
R-squared
Classification model performance:
Accuracy
Confusion matrix
Precision, recall, F1-score
ROC AUC
Train several models and evaluate performance out of the box
SUPERVISED LEARNING WITH SCIKIT-LEARN
A note on scaling
Models affected by scaling:
KNN
Linear Regression (plus Ridge, Lasso)
Logistic Regression
Artificial Neural Network
Best to scale our data before evaluating models
SUPERVISED LEARNING WITH SCIKIT-LEARN
Evaluating classification models
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
X = music.drop("genre", axis=1).values
y = music["genre"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Evaluating classification models
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier()}
results = []
for model in models.values():
kf = KFold(n_splits=6, random_state=42, shuffle=True)
cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Visualizing results
SUPERVISED LEARNING WITH SCIKIT-LEARN
Test set performance
for name, model in models.items():
model.fit(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print("{} Test Set Accuracy: {}".format(name, test_score))
Logistic Regression Test Set Accuracy: 0.844
KNN Test Set Accuracy: 0.82
Decision Tree Test Set Accuracy: 0.832
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Congratulations
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
What you've covered
Using supervised learning techniques to build predictive models
For both regression and classification problems
Underfitting and overfitting
How to split data
Cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
What you've covered
Data preprocessing techniques
Model selection
Hyperparameter tuning
Model performance evaluation
Using pipelines
SUPERVISED LEARNING WITH SCIKIT-LEARN
Where to go from here?
Machine Learning with Tree-Based Models in Python
Preprocessing for Machine Learning in Python
Model Validation in Python
Feature Engineering for Machine Learning in Python
Unsupervised Learning in Python
Machine Learning Projects
SUPERVISED LEARNING WITH SCIKIT-LEARN
Thank you!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N