0% found this document useful (0 votes)

122 views178 pages

Supervised Learning With Scikit-Learn

Uploaded by

Lee Louise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views178 pages

Supervised Learning With Scikit-Learn

Uploaded by

Lee Louise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 178

Machine learning

with scikit-learn
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
What is machine learning?
Machine learning is the process whereby:
Computers are given the ability to learn to make decisions from data

without being explicitly programmed!

SUPERVISED LEARNING WITH SCIKIT-LEARN

Examples of machine learning

SUPERVISED LEARNING WITH SCIKIT-LEARN

Unsupervised learning
Uncovering hidden patterns from unlabeled data
Example:
Grouping customers into distinct categories (Clustering)

SUPERVISED LEARNING WITH SCIKIT-LEARN

Supervised learning
The predicted values are known

Aim: Predict the target values of unseen data, given the features

SUPERVISED LEARNING WITH SCIKIT-LEARN

Types of supervised learning
Classification: Target variable consists of Regression: Target variable is continuous
categories

SUPERVISED LEARNING WITH SCIKIT-LEARN

Naming conventions
Feature = predictor variable = independent variable
Target variable = dependent variable = response variable

SUPERVISED LEARNING WITH SCIKIT-LEARN

Before you use supervised learning
Requirements:
No missing values

Data in numeric format

Data stored in pandas DataFrame or NumPy array

Perform Exploratory Data Analysis (EDA) first

SUPERVISED LEARNING WITH SCIKIT-LEARN

scikit-learn syntax
from sklearn.module import Model
model = Model()
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)

array([0, 0, 0, 0, 1, 0])

We import a Model, which is a type of algorithm for our supervised learning problem, from an sklearn module. For
example, the k-Nearest Neighbors model uses distance between observations to predict labels or values. We
create a variable named model, and instantiate the Model. A model is fit to the data, where it learns patterns
about the features and the target variable. We fit the model to X, an array of our features, and y, an array of our
target variable values. We then use the model's dot-predict method, passing six new observations, X_new. For
example, if feeding features from six emails to a spam classification model, an array of six values is returned. A
one indicates the model predicts that email is spam, and a zero represents a prediction of not spam.

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
The classification
challenge
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Classifying labels of unseen data
1. Build a model
2. Model learns from the labeled data we pass to it

3. Pass unlabeled data to the model as input

4. Model predicts the labels of the unseen data

Labeled data = training data

SUPERVISED LEARNING WITH SCIKIT-LEARN

k-Nearest Neighbors
Predict the label of a data point by
Looking at the k closest labeled data points

Taking a majority vote

SUPERVISED LEARNING WITH SCIKIT-LEARN

k-Nearest Neighbors

SUPERVISED LEARNING WITH SCIKIT-LEARN

k-Nearest Neighbors

SUPERVISED LEARNING WITH SCIKIT-LEARN

k-Nearest Neighbors

SUPERVISED LEARNING WITH SCIKIT-LEARN

KNN Intuition

SUPERVISED LEARNING WITH SCIKIT-LEARN

KNN Intuition

SUPERVISED LEARNING WITH SCIKIT-LEARN

Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
X = churn_df[["total_day_charge", "total_eve_charge"]].values
y = churn_df["churn"].values
print(X.shape, y.shape)

(3333, 2), (3333,)

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)
To fit a KNN model using scikit-learn, we import KNeighborsClassifier from sklearn-dot-neighbors. We split our data into X, a 2D array of our
features, and y, a 1D array of the target values - in this case, churn status. scikit-learn requires that the features are in an array where each
column is a feature and each row a different observation. Similarly, the target needs to be a single column with the same number of
observations as the feature data. We use the dot-values attribute to convert X and y to NumPy arrays. Printing the shape of X and y, we see
there are 3333 observations of two features, and 3333 observations of the target variable. We then instantiate our KNeighborsClassifier, setting
n_neighbors equal to 15, and assign it to the variable knn. Then we can fit this classifier to our labeled data by applying the classifier's dot-fit
method and passing two arguments: the feature values, X, and the target values, y.

SUPERVISED LEARNING WITH SCIKIT-LEARN

Here we have a set of new observations, X_new.
Predicting on unlabeled data Checking the shape of X_new, we see it has three
rows and two columns, that is, three observations
and two features. We use the classifier's dot-predict
X_new = np.array([[56.8, 17.5], method and pass it the unseen data as a 2D NumPy
array containing features in columns and
[24.4, 24.1], observations in rows. Printing the predictions returns
a binary value for each observation or row in X_new.
[50.1, 10.9]]) It predicts 1, which corresponds to 'churn', for the first
print(X_new.shape) observation, and 0, which corresponds to 'no churn',
for the second and third observations.

(3, 2)

predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))

Predictions: [1 0 0]

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Measuring model
performance
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Measuring model performance
In classification, accuracy is a commonly used metric
Accuracy:

SUPERVISED LEARNING WITH SCIKIT-LEARN

Measuring model performance
How do we measure accuracy?
Could compute accuracy on the data used to fit the classifier

NOT indicative of ability to generalize

SUPERVISED LEARNING WITH SCIKIT-LEARN

Computing accuracy

SUPERVISED LEARNING WITH SCIKIT-LEARN

Computing accuracy

SUPERVISED LEARNING WITH SCIKIT-LEARN

Computing accuracy

SUPERVISED LEARNING WITH SCIKIT-LEARN

Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

0.8800599700149925

SUPERVISED LEARNING WITH SCIKIT-LEARN

Model complexity
Larger k = less complex model = can cause underfitting
Smaller k = more complex model = can lead to overfitting

SUPERVISED LEARNING WITH SCIKIT-LEARN

Model complexity and over/underfitting
train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)
for neighbor in neighbors:
knn = KNeighborsClassifier(n_neighbors=neighbor)
knn.fit(X_train, y_train)
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)

SUPERVISED LEARNING WITH SCIKIT-LEARN

Plotting our results
plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN

Model complexity curve

SUPERVISED LEARNING WITH SCIKIT-LEARN

Model complexity curve

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Introduction to
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Predicting blood glucose levels
import pandas as pd
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())

pregnancies glucose triceps insulin bmi age diabetes

0 6 148 35 0 33.6 50 1
1 1 85 29 0 26.6 31 0
2 8 183 0 0 23.3 32 1
3 1 89 23 94 28.1 21 0
4 0 137 35 168 43.1 33 1

SUPERVISED LEARNING WITH SCIKIT-LEARN

Creating feature and target arrays
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
print(type(X), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>

SUPERVISED LEARNING WITH SCIKIT-LEARN

Making predictions from a single feature
X_bmi = X[:, 3]
print(y.shape, X_bmi.shape)

(752,) (752,)

X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)

(752, 1)

SUPERVISED LEARNING WITH SCIKIT-LEARN

Plotting glucose vs. body mass index
import matplotlib.pyplot as plt
plt.scatter(X_bmi, y)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN

Plotting glucose vs. body mass index

SUPERVISED LEARNING WITH SCIKIT-LEARN

Fitting a regression model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_bmi, y)
predictions = reg.predict(X_bmi)
plt.scatter(X_bmi, y)
plt.plot(X_bmi, predictions)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN

Fitting a regression model

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
The basics of linear
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Regression mechanics
y = ax + b
Simple linear regression uses one feature
y = target
x = single feature
a, b = parameters/coefficients of the model - slope, intercept
How do we choose a and b?
Define an error function for any given line

Choose the line that minimizes the error function

Error function = loss function = cost function

SUPERVISED LEARNING WITH SCIKIT-LEARN

The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN

The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN

The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN

The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN

The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN

Ordinary Least Squares
n
RSS = ∑(yi − y^i )2
i=1

Ordinary Least Squares (OLS): minimize RSS

SUPERVISED LEARNING WITH SCIKIT-LEARN

Linear regression in higher dimensions
y = a1 x1 + a2 x2 + b

To fit a linear regression model here:

Need to specify 3 variables: a1 , a2 , b
In higher dimensions:
Known as multiple regression

Must specify coefficients for each feature and the variable b

y = a1 x1 + a2 x2 + a3 x3 + ... + an xn + b

scikit-learn works exactly the same way:

Pass two arrays: features and target

SUPERVISED LEARNING WITH SCIKIT-LEARN

Linear regression using all features
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)

SUPERVISED LEARNING WITH SCIKIT-LEARN

R-squared
R2 : quantifies the variance in target values
explained by the features
Values range from 0 to 1

High R2 : Low R2 :

SUPERVISED LEARNING WITH SCIKIT-LEARN

R-squared in scikit-learn
reg_all.score(X_test, y_test)

0.356302876407827

SUPERVISED LEARNING WITH SCIKIT-LEARN

Mean squared error and root mean squared error
n
1
M SE = ∑(yi − y^i )2
n
i=1

M SE is measured in target units, squared

RM SE = √M SE

Measure RM SE in the same units at the target variable

SUPERVISED LEARNING WITH SCIKIT-LEARN

RMSE in scikit-learn
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)

24.028109426907236

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Cross-validation
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Cross-validation motivation
Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data

Solution: Cross-validation!

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation and model performance
5 folds = 5-fold CV

10 folds = 10-fold CV

k folds = k-fold CV

More folds = More computationally expensive

SUPERVISED LEARNING WITH SCIKIT-LEARN

Cross-validation in scikit-learn
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)

SUPERVISED LEARNING WITH SCIKIT-LEARN

Evaluating cross-validation peformance
print(cv_results)

[0.70262578, 0.7659624, 0.75188205, 0.76914482, 0.72551151, 0.73608277]

print(np.mean(cv_results), np.std(cv_results))

0.7418682216666667 0.023330243960652888

print(np.quantile(cv_results, [0.025, 0.975]))

array([0.7054865, 0.76874702])

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Regularized
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Why regularize?
Recall: Linear regression minimizes a loss function
It chooses a coefficient, a, for each feature variable, plus b

Large coefficients can lead to overfitting

Regularization: Penalize large coefficients

SUPERVISED LEARNING WITH SCIKIT-LEARN

Ridge regression
Loss function = OLS loss function +
n
α ∗ ∑ ai 2
i=1

Ridge penalizes large positive or negative coefficients

α: parameter we need to choose

Picking α is similar to picking k in KNN

Hyperparameter: variable used to optimize model parameters

α controls model complexity

α = 0 = OLS (Can lead to overfitting)
Very high α: Can lead to underfitting

SUPERVISED LEARNING WITH SCIKIT-LEARN

Ridge regression in scikit-learn
from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
scores.append(ridge.score(X_test, y_test))
print(scores)

[0.2828466623222221, 0.28320633574804777, 0.2853000732200006,

0.26423984812668133, 0.19292424694100963]

SUPERVISED LEARNING WITH SCIKIT-LEARN

Lasso regression
Loss function = OLS loss function +
n
α ∗ ∑ ∣ai ∣
i=1

SUPERVISED LEARNING WITH SCIKIT-LEARN

Lasso regression in scikit-learn
from sklearn.linear_model import Lasso
scores = []
for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
scores.append(lasso.score(X_test, y_test))
print(scores)

[0.99991649071123, 0.99961700284223, 0.93882227671069, 0.74855318676232, -0.05741034640016]

SUPERVISED LEARNING WITH SCIKIT-LEARN

Lasso regression for feature selection
Lasso can select important features of a dataset

Shrinks the coefficients of less important features to zero

Features not shrunk to zero are selected by lasso

SUPERVISED LEARNING WITH SCIKIT-LEARN

Lasso for feature selection in scikit-learn
from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
names = diabetes_df.drop("glucose", axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
plt.bar(names, lasso_coef)
plt.xticks(rotation=45)
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN

Lasso for feature selection in scikit-learn

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
How good is your
model?
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Classification metrics
Measuring model performance with accuracy:
Fraction of correctly classified samples

Not always a useful metric

SUPERVISED LEARNING WITH SCIKIT-LEARN

Class imbalance
Classification for predicting fraudulent bank transactions
99% of transactions are legitimate; 1% are fraudulent

Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions

Fails at its original purpose

Class imbalance: Uneven frequency of classes