IU 3.6.
4 Machine Learning 101
RISE 2.0
SEP 2022
13 hours
Contents Where Are We in the Journey & 5 mins
Learning Objectives
Agenda for Today
25 mins
• Course Intro + Machine Learning Techniques
• Chapter 1 – Linear Regression 3.5 hrs.
• Chapter 2 – Logistic Regression 3 hrs.
• Chapter 3 - Clustering 3 hrs.
• Chapter 4 – Recommender Systems 3 hrs.
1
Where are we in the learning journey?
IU 1.0 IU 2.0
IU 3.0
Orientation Business Digital Capstone
Business and Data Analytics core
(1 week) Essentials Essentials (4 weeks)
(14 weeks)
(1 week) (1 week)
IU 3.6.4 – Machine Learning 101
Career Development Journey: 5 Career Spotlights + 4 Career Buddies
Leadership and Personal Development Journey: Networking + Enrichment Sessions
2
Re-Cap
In previous session, we covered the following topics:
• Why Machine Learning matters
• Introduction to machine learning techniques
• Supervised and Unsupervised Learnings
Note: The following slides are a repeat of previous session. The Trainer can move to jupyter
notebooks directly and come back for project de-briefing
3
Understand the importance &
applications of machine learning
Learning
Overview of different machine learning
objectives techniques & associated steps
Practice the use of machine learning for
varied business applications
4
Course Introduction
5
Machine learning is a data analytics technique
that teaches computers to do what comes
naturally to humans and animals: learn from
experience
Machine learning algorithms use computational
Machine methods to “learn” information directly from
Learning data without relying on a predetermined
equation as a model
The algorithms adaptively improve their
performance as the number of samples
available for learning increases
* More reference on the external link library 6
Why machine learning matters?
With the rise in big data, machine learning has become a key technique for solving problems in areas, such as:
Retail & CPG Manufacturing Natural language processing
Understanding the future potential The monitoring of manufacturing Natural language processing (NLP) is
demand and sales for products is a key equipment is vital to any industrial about developing applications and
task for any retailer to better plan for process. Sometimes it is critical that services that are able to understand
inventory, cut down on production of equipment be monitored in real-time for human languages.
unnecessary products, decide pricing faults and anomalies to prevent damage
strategy. and correlate equipment behavior faults • Refer here for more details on NLP
to production line issues. Fault detection
• Refer here for example on price is the pre-cursor to predictive
forecasting maintenance.
• Refer here for more details
7
Why machine learning matters?
With the rise in big data, machine learning has become a key technique for solving problems in areas, such as:
Computational finance Image processing & computer vision Computational biology
Computational finance is also sometimes Image processing & computer vision is a Can be used for tumor detection, drug
referred to as "financial engineering," method to perform some operations on discovery, and DNA sequencing.
"financial mathematics," "mathematical an image, in order to get an enhanced
finance," or "quantitative finance." It image or to extract some useful informa • Refer here for more on Tumor
uses the tools of mathematics, statistics, tion. detection or try github
and computing to solve problems in • Refer here to understand more
finance like credit scoring and • Refer here for basic understanding about DNA sequencing
algorithmic trading. about facial recognition and a quick
tutorial
• Refer here for basic understanding • Refer here to understand more about
about credit risk models motion detection
• Refer here for more about trading
8
Course Outline
9
Chapter 1 (3.5 hours)
• Types of Regression
• Linear Regression model
• Model Training, Evaluation and Validation
Course Outline (I/II) Chapter 2 (3 hours)
• Logistic Regression model
• Model Training, Iteration and Validation
• Model Fit Statistic
• Class Imbalance
10
Chapter 3 (3 hours)
• Supervised vs Unsupervised
• Clustering model
• Common Methods: K-means
Course Outline (II/II) Chapter 4 (3 hours)
• Recommender Systems
• Common methods: Association rules learning, Market
Basket Analysis, Content-based recommendation
11
Introduction to machine
learning techniques
12
When should we use machine learning?
We consider using machine learning algorithms when we have a complex task or problem involving a large amount of
data and lots of variables, but no existing formula or equation.
For example, machine learning is a good option if you need to handle situations like these:
13
Three conditions must be met to apply machine learning to a problem
A pattern must exist in the input There must exist an ample amount The behavior in the problem can be
data that would help to arrive at a of data (examples, samples) to formulated as a mathematical
conclusion apply machine learning to a problem expression
• For instance, if we concluded the • For instance, if there are no product • Machine learning is used to derive
product reviews are random and do reviews for the webcam, it will be meaning from the data and perform
not offer any meaning, then it would difficult to arrive at a decision “structured learning” to arrive at a
be difficult to arrive at a decision by whether or not to buy the product mathematical approximation to
using them describe the behavior of the problem
• Handling these situation requires
• To solve a problem with machine simplifying the hypotheses & models
learning, the machine learning (use non-parametric approaches). *
algorithm must have a pattern to
infer from * More reference on the external link library
14
How Machine Learning Works?
Process Flow of Machine Learning
15
How Machine Learning Algorithm Works?
A machine learning algorithm performs a learning task where it either:
Understands relationships between input & an output Identifies intrinsic patterns in input data
• Given input data x & an output Y, the machine learning • The machine learning algorithm tries to find underlying
algorithm tries to find a relationship between x & Y, which structure or distributions in the data x
can be represented as: Y = f(x) • Since there is no output Y defined, there are no perfect
• The goal of machine learning algorithm would be to learn answers
the properties of this target function f, based on the given
data x
* More reference on the external link library
16
Overview of different techniques
Mainly there are 5 different categories of Machine Learning techniques that are used in the industry
17
Supervised & unsupervised
learning
18
What is supervised learning?
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output. Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you
can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can
be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm
iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when
the algorithm achieves an acceptable level of performance.
19
Supervised learning problems
Supervised learning problems can be further grouped into regression and classification problems:
Regression Vs Classification
• Classification: A classification problem is when the output variable is a discrete category, such as “will
a customer default or not in loan payment?” or “was a transaction anomalous or not?” or "is the growth
on brain shown in MRI scan a tumor or not?"
• Regression: A regression problem is when the output variable is a real value, such as “estimating future
demand of a product” or “predicting revenue based on advertising spend”.
20
Regression Vs Classification algorithms
21
What is unsupervised learning?
In unsupervised learning, we only have input data (X) and no corresponding output variables
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order
to learn more about the data
These are called unsupervised learning because unlike supervised learning above there is no correct
answers and there is no teacher. Algorithms are left to their own devises to discover and present the
interesting structure in the data
22
Unsupervised learning problems
Unsupervised learning problems can be further grouped into clustering and association mining problems: Clustering Vs
Association
• Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as "grouping
customers by purchasing behavior & demographic features"
• Association Mining: An association rule learning problem is where you want to discover rules that describe large
portions of your data, such as "if a customer bought milk, which other products would he/she likely buy?"
23
Clustering Vs Association algorithms
24
Choosing between supervised & unsupervised ML
25
The basic steps in using any machine learning technique
Step 1 - Identify if we have a target variable
Step 2 - Identify if the target variable is continuous or categorical (not valid if there is no target variable)
Step 3 - Identify the independent features which can explain the target variable
Step 4 - Make necessary transformations of data
Step 5 - Perform modeling based on data characteristics
26
Example 1 - Regression
Objective: Predict sales for every product-store combination for the next quarter
27
Example 1 - Regression
Objective: Predict sales for every product-store combination for the next quarter
Step 1 - Identify if we have a target variable
• From the data, we can observe that we have a target variable – sales
Step 2 - Identify if the target variable is continuous or categorical (not valid if there is no target variable)
• The target variable sales is continuous. This means we should go with regression
Step 3 - Identify the independent features which can explain the target variable
• Discount, visitor count, store area, holiday status can influence sales
Step 4 - Make necessary transformations of data
• For example, the holiday status can be changed from Yes / No to 1 / 0 so the algorithm can understand it
Step 5 - Perform modeling based on data characteristics
• After the data is cleaned & pre-processed, we can choose a regression algorithm based on how the data is structured
• If we observe a linear relationship between the response (sales) and the other independent features, we can choose
linear regression
28
Example 2 - Classification
Objective: Predict if a customer will default on loan payment in the next year
29
Example 2 - Classification
Objective: Predict if a customer will default on loan payment in the next year
Step 1 - Identify if we have a target variable
• From the data, we can observe that we have a target variable - default status
Step 2 - Identify if the target variable is continuous or categorical (not valid if there is no target variable)
• The target variable sales is categorical (yes/no). This means we should go with classification
Step 3 - Identify the independent features which can explain the target variable
• Sex, Education, Income, previous default indicator, state of origin can influence the default behavior
Step 4 - Make necessary transformations of data
• For example, categorical data columns like sex, education, state can be changed to numerical values so the
algorithm can understand them better
Step 5 - Perform modeling based on data characteristics
• After the data is cleaned & pre-processed, we can choose a classification algorithm
30
Course Deep Dive
31
Chapter 1 – Linear Regression
Exit to Demo Workbook
02
Chapter 2 – Logistic Regression
Exit to Demo Workbook
03
Chapter 3 - Clustering
Exit to Demo Workbook
04
Chapter 4 – Recommender Systems
Exit to Demo Workbook
Project Details post
Machine Learning
36
Identify the level of income qualification needed for
the families in Latin America.
Points to note:
• Many social programs have a hard time ensuring that the right
people are given enough aid.
• The client believes that new ML methods beyond traditional
Project De- econometrics, might help improve the model for this problem.
Brief • The project involves these main tasks:
EDA Identify the output variable.
EDA Understand the type of data.
EDA Check if there are any biases in your dataset.
EDA Check whether all members of the house have the same poverty level.
EDA Check if there is a house without a family head.
EDA Set poverty level of the members and the head of the house within a family.
EDA Count how many null values are existing in columns.
Data Cleaning Remove null value rows of the target variable.
Modeling Predict the accuracy using random forest classifier and 2 other algorithms
Modeling Discuss parameter tuning and find the optimal paramater for each algorithm
Modeling Check the accuracy with cross validation.
37