A Practical and Technical Introduction To Machine Learning
A Practical and Technical Introduction To Machine Learning
Machine Learning
Rules Output
Traditional Machine
Output Rules
Programming Learning
Data Data
What is Machine learning?
● Machine learning is a subset of Artificial Intelligence that enables
computers to learn (progressively improve performance on tasks) from
data (examples, experience) without explicit (rule-based) programming and
make predictions or decisions.
Autonomously learning from examples; pattern recognition; autonomously identify patterns and extract
insights from data;training (learning) time followed by test (prediction, evaluation) time;
Rules Output
Traditional Machine
Output Rules
Programming Learning
Data Data
Ground Model
Data Objective
truth output
Prediction Label
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
❏ Data transformation (convert non-numeric features to numeric, resize input to fixed size)
- Numeric: Normalization (range scaling, log-scaling, clipping, z-score or
standardization), binning (equally-spaced, quantile-based)
- Categorical: one-hot encoding, tokenization
❏ Transform within a pipeline (beware data leakage)
❏ Data cleaning (missing values, imputation)
❏ Feature engineering (determining which features are important for training and creating
them from raw data)
Machine learning project lifecycle
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
Model
Problem Data Data Data Model
training/
framing collection wrangling analysis deployment
evaluation
❏
Machine learning project lifecycle
1. Problem framing
Express the problem within the business context, emphasize its values
Decide if solvable without ML, cost-benefit analysis; feasibility; data requirements
Define the problem technically and choose a performance measure
Prepare the environment
2. Data collection
Make sure data is representative of production use cases
Reduce sampling bias
Data annotation strategy if required
Split the data for evaluation)
3. Data analysis
EDA (summary statistics, visualisations, identify outliers)
Extract insights from data
4. Data preparation
Data cleaning and formatting (imputation, encoding, standardization)
Feature engineering
5. Model training and evaluation
Build an end-to-end pipeline that can be tested
Start simple simple models and find strong baselines
Model selection and Hyperparameter tuning
Error analysis
6. Model deployment and maintenance
Pipeline integration
Monitoring and regression testing
7. Presentation
Terminology
Data
❖ Features
❖ Examples
❖ Labels
❖ Dataset
● Supervised learning: Examples are labeled. The goal is to find a model that
predicts y from x.
❏ Classification: Label is a category.
❏ Regression: Label is a real number.
● Unsupervised learning: Examples are unlabeled.
● Reinforcement learning
Supervised learning
Data
❖ Features
❖ Examples (in sample space):
❖ Labels (in label space):
❖ Labeled Dataset:
● The goal is to find f with a small expected loss or risk (generalization error):
approximation
rror
at ion e error
estim
r approximation error Bayes error
ion erro
at somewhere here
estim
● Infinite F (Vapnik-Chervonenkis)
Let F be a finite hypothesis set with finite VC dimension dVC. Then, for all f in F, for
all δ>0, with probability at least 1-δ
measure the “effective” size of the class, that is, the size of the projection of the class onto finite observations.
Regularization Complexity, capacity, richness, expressivity