0% found this document useful (0 votes)

230 views42 pages

Designing ML Workflows in Python

compare clf_full and clf_sliding predictions on test set - If different, there is dataset shift - clf_sliding is more robust to shift

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

230 views42 pages

Designing ML Workflows in Python

compare clf_full and clf_sliding predictions on test set - If different, there is dataset shift - clf_sliding is more robust to shift

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

From work ows to pipelines

D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
Revisiting our work ow
from sklearn.ensemble import RandomForestClassifier as rf
X_train, X_test, y_train, y_test = train_test_split(X, y)

grid_search = GridSearchCV(rf(), param_grid={'max_depth': [2, 5, 10]})

grid_search.fit(X_train, y_train)
depth = grid_search.best_params_['max_depth']

vt = SelectKBest(f_classif, k=3).fit(X_train, y_train)

clf = rf(max_depth=best_value).fit(vt.transform(X_train), y_train)
accuracy_score(clf.predict(vt.transform(X_test), y_test))

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

The power of grid search
Optimize max_depth :

pg = {'max_depth': [2,5,10]}
gs = GridSearchCV(rf(),
param_grid=pg)
gs.fit(X_train, y_train)
depth = gs.best_params_['max_depth']

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

The power of grid search
Then optimize n_estimators :

pg = {'n_estimators': [10,20,30]}
gs = GridSearchCV(
rf(max_depth=depth),
param_grid=pg)
gs.fit(X_train, y_train)
n_est = gs.best_params_[
'n_estimators']

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

The power of grid search
Jointly max_depth and n_estimators :

pg = {
'max_depth': [2,5,10],
'n_estimators': [10,20,30]
}
gs = GridSearchCV(rf(),
param_grid=pg)
gs.fit(X_train, y_train)
print(gs.best_params_)

{'max_depth': 10, 'n_estimators': 20}

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Pipelines

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Pipelines

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Pipelines
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('feature_selection', SelectKBest(f_classif)),
('classifier', RandomForestClassifier())
])

params = dict(
feature_selection__k=[2, 3, 4],
classifier__max_depth=[5, 10, 20]
)

grid_search = GridSearchCV(pipe, param_grid=params)

gs = grid_search.fit(X_train, y_train).best_params_

{'classifier__max_depth': 20, 'feature_selection__k': 4}

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Customizing your pipeline
from sklearn.metrics import roc_auc_score, make_scorer
auc_scorer = make_scorer(roc_auc_score)

grid_search = GridSearchCV(pipe, param_grid=params, scoring=auc_scorer)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Don't overdo it
params = dict(
feature_selection__k=[2, 3, 4],
clf__max_depth=[5, 10, 20],
clf__n_estimators=[10, 20, 30]
)
grid_search = GridSearchCV(pipe, params, cv=10)

3 x 3 x 3 x 10 = 270 classi er ts!

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Supercharged work ows
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Model deployment
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Serializing your model
Store a classi er to le:

import pickle
clf = RandomForestClassifier().fit(X_train, y_train)
with open('model.pkl', 'wb') as file:
pickle.dump(clf, file=file)

Load it again from le:

with open('model.pkl', 'rb') as file:

clf2 = pickle.load(file)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Serializing your pipeline
Development environment:

vt = SelectKBest(f_classif).fit(
X_train, y_train)
clf = RandomForestClassifier().fit(
vt.transform(X_train), y_train)
with open('vt.pkl', 'wb') as file:
pickle.dump(vt)
with open('clf.pkl', 'wb') as file:
pickle.dump(clf)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Serializing your pipeline
Production environment:

with open('vt.pkl', 'rb') as file:

vt = pickle.load(vt)
with open('clf.pkl', 'rb') as file:
clf = pickle.load(clf)
clf.predict(vt.transform(X_new))

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Serializing your pipeline
Development environment:

pipe = Pipeline([
('fs', SelectKBest(f_classif)),
('clf', RandomForestClassifier())
])
params = dict(fs__k=[2, 3, 4],
clf__max_depth=[5, 10, 20])
gs = GridSearchCV(pipe, params)
gs = gs.fit(X_train, y_train)

with open('pipe.pkl', 'wb') as file:

pickle.dump(gs, file)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Serializing your pipeline
Production environment:

with open('pipe.pkl', 'rb') as file:

gs = pickle.dump(gs, file)
gs.predict(X_test)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Custom feature transformations
checking_status duration ... own_telephone foreign_worker
0 1 6 ... 1 1
1 0 48 ... 0 1

def negate_second_column(X):
Z = X.copy()
Z[:,1] = -Z[:,1]
return Z

pipe = Pipeline([('ft', FunctionTransformer(negate_second_column)),

('clf', RandomForestClassifier())])

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Production ready!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Iterating without over tting
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Cross-validation results
grid_search = GridSearchCV(pipe, params, cv=3, return_train_score=True)
gs = grid_search.fit(X_train, y_train)
results = pd.DataFrame(gs.cv_results_)

results[['mean_train_score', 'std_train_score',
'mean_test_score', 'std_test_score']]

mean_train_score std_train_score mean_test_score std_test_score

0 0.829 0.006 0.735 0.009
1 0.829 0.006 0.725 0.009
2 0.961 0.008 0.716 0.019
3 0.981 0.005 0.749 0.024
...

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Cross-validation results
mean_train_score std_train_score mean_test_score std_test_score
0 0.829 0.006 0.735 0.009
1 0.829 0.006 0.725 0.009
2 0.961 0.008 0.716 0.019
3 0.981 0.005 0.749 0.024
4 0.986 0.003 0.728 0.009
5 0.995 0.002 0.751 0.008

Observations:

Training score much higher than test score.

The standard deviation of the test score is large.

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Detecting over tting
CV Training Score >> CV Test Score
over tting in model tting stage

reduce complexity of classi er

get more training data

increase cv number

CV Test Score >> Validation Score

over tting in model tuning stage

decrease cv number

decrease size of parameter grid

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
"Expert in CV" in your CV!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N
Dataset shift
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Dr. Chris Anagnostopoulos

Honorary Associate Professor
What is dataset shift?
elec dataset:

2 years worth of data.

class=1 represents price went up relative to last 24 hours, and 0 means down.

day period nswprice ... vicdemand transfer class

0 2 0.000000 0.056443 ... 0.422915 0.414912 1
1 2 0.553191 0.042482 ... 0.422915 0.414912 0
2 2 0.574468 0.044374 ... 0.422915 0.414912 1

[3 rows x 8 columns]

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

What is shifting exactly?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

What is shifting exactly?

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Windows
Sliding window Expanding window

window = (t_now-window_size+1):t_now window = 0:t_now

sliding_window = elec.loc[window] expanding_window = elec.loc[window]

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dataset shift detection
# t_now = 40000, window_size = 20000
clf_full = RandomForestClassifier().fit(X, y)
clf_sliding = RandomForestClassifier().fit(sliding_X, sliding_y)

# Use future data as test

test = elec.loc[t_now:elec.shape[0]]
test_X = test.drop('class', 1); test_y = test['class']

roc_auc_score(test_y, clf_full.predict(test_X))
roc_auc_score(test_y, clf_sliding.predict(test_X))

0.775
0.780

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Window size
for w_size in range(10, 100, 10):
sliding = arrh.loc[
(t_now - w_size + 1):t_now
]
X = sliding.drop('class', 1)
y = sliding['class']
clf = GaussianNB()
clf.fit(X, y)
preds = clf.predict(test_X)
roc_auc_score(test_y, preds)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Domain shift
arrhythmia dataset:

age sex height ... chV6_TwaveAmp chV6_QRSA chV6_QRSTA class

0 75 0 190 ... 2.9 23.3 49.4 0
1 56 1 165 ... 2.1 20.4 38.8 0
2 54 0 172 ... 3.4 12.3 49.0 0
3 55 0 175 ... 2.6 34.6 61.6 1
4 75 0 190 ... 3.9 25.4 62.8 0

[5 rows x 280 columns]

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

More data is not always better!
D E S I G N I N G M A C H I N E L E A R N I N G W O R K F LO W S I N P Y T H O N

Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
ML Algorithms for Data Scientists
100% (2)
ML Algorithms for Data Scientists
148 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Python Functions for Audio Transcription
No ratings yet
Python Functions for Audio Transcription
46 pages
Seaborn Data Visualization Guide
No ratings yet
Seaborn Data Visualization Guide
26 pages
Time-Series Visualization with Matplotlib
No ratings yet
Time-Series Visualization with Matplotlib
27 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Relational Plots and Subplots in Seaborn
No ratings yet
Relational Plots and Subplots in Seaborn
38 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Audio Processing in Python Guide
No ratings yet
Audio Processing in Python Guide
17 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Machine Learning Theory and Practice
No ratings yet
Machine Learning Theory and Practice
299 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
ConvNet Insights for Tech Enthusiasts
No ratings yet
ConvNet Insights for Tech Enthusiasts
7 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Pandas for Data Analysts
100% (1)
Pandas for Data Analysts
64 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
Python Feature Engineering Guide
No ratings yet
Python Feature Engineering Guide
27 pages
Time Series
100% (1)
Time Series
91 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Aspiring Data Scientist Guide
No ratings yet
Aspiring Data Scientist Guide
10 pages
Cassandra
100% (1)
Cassandra
31 pages
PyTorch Lightning Guide 0.8.5
No ratings yet
PyTorch Lightning Guide 0.8.5
562 pages
Data Visualization Cheatsheet 1702209209
100% (1)
Data Visualization Cheatsheet 1702209209
7 pages
Classification Algorithm Guide
100% (2)
Classification Algorithm Guide
23 pages
Customer Churn Prediction Analysis
100% (1)
Customer Churn Prediction Analysis
3 pages
Deep Learning Guide: Installation to MLPs
No ratings yet
Deep Learning Guide: Installation to MLPs
986 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Amazon Fine Food Reviews Dataset Overview
No ratings yet
Amazon Fine Food Reviews Dataset Overview
1 page
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
No ratings yet
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
358 pages
Bayesian Learning Essentials
No ratings yet
Bayesian Learning Essentials
49 pages
Eda PDF
100% (1)
Eda PDF
45 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
Machine Learning Python
100% (2)
Machine Learning Python
9 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Pandas DataFrame Basics Cheatsheet
No ratings yet
Pandas DataFrame Basics Cheatsheet
3 pages
Feature Engineering & Selection Guide
No ratings yet
Feature Engineering & Selection Guide
32 pages
7 Classification
100% (3)
7 Classification
63 pages
270+ Python Machine Learning Projects
100% (1)
270+ Python Machine Learning Projects
15 pages
Pant D. Statistics For Data Scientists and Analysts... Using Python 2025
100% (2)
Pant D. Statistics For Data Scientists and Analysts... Using Python 2025
508 pages
NLTK: Python for Natural Language Processing
No ratings yet
NLTK: Python for Natural Language Processing
23 pages
ARIMA Models for Seasonal Time Series
100% (1)
ARIMA Models for Seasonal Time Series
50 pages
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
No ratings yet
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
16 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
100% (1)
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
11 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
384 pages
Jason Brown Lee Text Books
0% (1)
Jason Brown Lee Text Books
1 page
Statistics Machine Learning Python Draft
100% (1)
Statistics Machine Learning Python Draft
333 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Lecture 4-5
No ratings yet
Lecture 4-5
48 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Simplifying Model Comparison For Machine Learning
No ratings yet
Simplifying Model Comparison For Machine Learning
11 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
35 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
30 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Customize Seaborn Plot Styles and Colors
No ratings yet
Customize Seaborn Plot Styles and Colors
54 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
RFM Customer Segmentation in Python
No ratings yet
RFM Customer Segmentation in Python
33 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
Credit Risk Modeling in Python Chapter2
100% (1)
Credit Risk Modeling in Python Chapter2
36 pages
PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Dyslexia Guide For Educators - Understanding Dyslexia
100% (1)
Dyslexia Guide For Educators - Understanding Dyslexia
36 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
9 pages
Chapter 2 History of Psychology
No ratings yet
Chapter 2 History of Psychology
15 pages
Part 2
No ratings yet
Part 2
3 pages
Proficiency for Cognitive Disabilities
No ratings yet
Proficiency for Cognitive Disabilities
18 pages
Personality Development Saurabh
No ratings yet
Personality Development Saurabh
4 pages
Rubric For C.O
No ratings yet
Rubric For C.O
2 pages
Presentation Notes
No ratings yet
Presentation Notes
5 pages
Lsat Logical Reasoning
0% (1)
Lsat Logical Reasoning
1 page
Lets Practice Imperatives Interactive Worksheet
50% (2)
Lets Practice Imperatives Interactive Worksheet
2 pages
Final Exam in FOUNDATION OF SOCIAL STUDIES
100% (5)
Final Exam in FOUNDATION OF SOCIAL STUDIES
2 pages
Example: The Teacher Always Gives Homework. Option 1: Me Option 3: You
No ratings yet
Example: The Teacher Always Gives Homework. Option 1: Me Option 3: You
5 pages
Uhv MCQ
No ratings yet
Uhv MCQ
17 pages
Silo - Tips Admission Test For Global Executive Mba Program Candidates
No ratings yet
Silo - Tips Admission Test For Global Executive Mba Program Candidates
8 pages
The Effectiveness of Intervention Materials in Improving Learners' Competence in Grade 7 Students in Biology
No ratings yet
The Effectiveness of Intervention Materials in Improving Learners' Competence in Grade 7 Students in Biology
17 pages
DrChesterRelleve PEDAGOGYGENZ
No ratings yet
DrChesterRelleve PEDAGOGYGENZ
60 pages
Voices Elementary Answer-Key Test A 1
No ratings yet
Voices Elementary Answer-Key Test A 1
2 pages
Tensor Force
No ratings yet
Tensor Force
25 pages
Leadership and Environment in Education
No ratings yet
Leadership and Environment in Education
13 pages
LinkedIn Learning Competency Mapping Guide
No ratings yet
LinkedIn Learning Competency Mapping Guide
206 pages
BS Notes MODULE 3
100% (1)
BS Notes MODULE 3
4 pages
Project Dairy - Dept - Of.agri - Engg
No ratings yet
Project Dairy - Dept - Of.agri - Engg
19 pages
Action Verbs: Annie Writes On The Board. The Puppy Ran Down The Road
No ratings yet
Action Verbs: Annie Writes On The Board. The Puppy Ran Down The Road
2 pages
Language Learning Strategy Instruction Current Issues and Research
No ratings yet
Language Learning Strategy Instruction Current Issues and Research
19 pages
Present Perfect Time Expressions
No ratings yet
Present Perfect Time Expressions
4 pages
Scientific Inquiry and Pseudoscience Quiz
No ratings yet
Scientific Inquiry and Pseudoscience Quiz
23 pages
University Internship Policy Guide
No ratings yet
University Internship Policy Guide
27 pages
Developmental Stages 1 To 19 Years
50% (2)
Developmental Stages 1 To 19 Years
4 pages
CNNs for Image Recognition
No ratings yet
CNNs for Image Recognition
16 pages
Reflective Writing Task 1 and 2 Semester 1 2025 - 16 May
No ratings yet
Reflective Writing Task 1 and 2 Semester 1 2025 - 16 May
5 pages

Designing ML Workflows in Python

Uploaded by

Designing ML Workflows in Python

Uploaded by

From work ows to pipelines

Dr. Chris Anagnostopoulos

grid_search = GridSearchCV(rf(), param_grid={'max_depth': [2, 5, 10]})

vt = SelectKBest(f_classif, k=3).fit(X_train, y_train)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

{'max_depth': 10, 'n_estimators': 20}

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

grid_search = GridSearchCV(pipe, param_grid=params)

{'classifier__max_depth': 20, 'feature_selection__k': 4}

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

grid_search = GridSearchCV(pipe, param_grid=params, scoring=auc_scorer)

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

3 x 3 x 3 x 10 = 270 classi er ts!

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

Load it again from le:

with open('model.pkl', 'rb') as file:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

with open('vt.pkl', 'rb') as file:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

with open('pipe.pkl', 'wb') as file:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

with open('pipe.pkl', 'rb') as file:

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

pipe = Pipeline([('ft', FunctionTransformer(negate_second_column)),

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

mean_train_score std_train_score mean_test_score std_test_score

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Training score much higher than test score.

The standard deviation of the test score is large.

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

reduce complexity of classi er

get more training data

CV Test Score >> Validation Score

decrease size of parameter grid

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

Dr. Chris Anagnostopoulos

2 years worth of data.

day period nswprice ... vicdemand transfer class

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

window = (t_now-window_size+1):t_now window = 0:t_now

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

# Use future data as test

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

age sex height ... chV6_TwaveAmp chV6_QRSA chV6_QRSTA class

[5 rows x 280 columns]

DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON

You might also like