Machine Learning using
Matlab
Lecture 2 Linear regression
Concerned questions
● Collaboration
○ The maximum number for a group is 3.
○ Submit technical report individually, include the whole framework of your project, and what you
have done in the project.
○ Your score will be given combining the whole project and your work.
● Project
○ A good technical report is composed of three parts: novel idea, good writing, and code (good
code ≠ high score)
○ Project flowchart (next slide)
● Submit your group list and project proposal (up to one page) by the end of this
week.
Flowchart of a ML project
Data collection and
annotation
Training data
(60%)
Validating data
(20%)
Test data
(20%)
Machine learning
model
Linear regression with one variable
Hypothesis:
Parameters:
Cost Function:
Goal: ?
Gradient descent
● Given a function fx , our objective is
● Repeat until convergence{
}
Concern
● If α is too small, gradient descent can be slow, more iterations are needed
● If α is too large, gradient descent can overshoot the minimum. It may fail to
converge, or even diverge.
● May converge to a local minimum if the cost function is non-convex
● When we approach a local minimum, gradient descent will automatically take
smaller steps. So, no need to decrease α over time.
● “Batch”: all the training examples are used in each step of gradient descent
Gradient descent for linear regression
(one variable)
Repeat until convergence{
}
Question: which algorithm is correct?
Repeat until convergence{
}
Repeat until convergence{
}
Correct
Gradient descent for linear regression
(one variable)
Repeat until convergence{
}
Linear regression with multiple variables
● n : number of features
● x(i) : input features of i-th training example
● x(j) : value of feature j in i-th training example
Area of site
(1000 square feet)
Size of living place
(1000 square feet)
Number of rooms Ages in years Selling price
3.472 0.998 7 42 25.9
3.531 1.500 7 62 29.5
2.275 1.175 6 40 27.9
4.050 1.232 6 54 25.9
... ... ... ... ...
Hypothesis
● Hypothesis for multiple features:
● If we define x0 = 1, then we have
Linear regression with multiple variables
Hypothesis:
Parameters:
Cost Function:
Gradient descent:
● Repeat until convergence{
}
Gradient descent for linear regression
(multiple variables)
Repeat{
}
Suggestions on gradient descent
● How to make sure gradient descent is working correctly?
● How to choose learning rate?
Make sure gradient descent is working correctly
● Ideal cost output: decrease sharply, then slightly decrease
● Declare convergence if cost decrease between two iterations is less than a
threshold
Choosing learning rate
● α too small, more iterations; α too large, may not converge
● To choose α, try
Ideal small large
Feature normalization - intuition
● Each feature has a different scale, which may generates an oval shape
● Result: more iterations to converge
● Solution: feature normalization
Feature normalization
● Feature scaling:
● Standard score:
Normal equation for linear regression
Area of site
(1000 square
feet)
Size of living
place (1000
square feet)
Number of
rooms
Ages in
years
Selling
price
3.472 0.998 7 42 25.9
3.531 1.500 7 62 29.5
2.275 1.175 6 40 27.9
4.050 1.232 6 54 25.9
Area of site
(1000 square
feet)
Size of living
place (1000
square feet)
Number of
rooms
Ages in
years
Selling
price
1 3.472 0.998 7 42 25.9
1 3.531 1.500 7 62 29.5
1 2.275 1.175 6 40 27.9
1 4.050 1.232 6 54 25.9
Normal equation for linear regression
● Instead of gradient descent, we can obtain the best solution using the
following equations:
● Matlab one line code: pinv(X’*X)*X’*y
Gradient descent vs normal equation
Gradient descent
● Need to choose learning rate
● Need many iterations
● Works well even when number of features
n is very large
Normal equation
● No learning rate
● No iterations
● Need to compute , slow when
is very large, non-invertible
Congratulations!
You have learnt your first ML model!
Questions?
Classification: logistic regression
Binary classification
● Examples:
○ Email: spam/not spam?
○ Tumor: malignant/benign?
○ Object: car/not car?
● ,where 1 is positive class, and 0 is negative class, e.g, spam (1) vs
not spam (0).
● Intuitively, negative class conveys absence of something
Hypothesis
● If hx > 0.5 , predict “y = 1”
● If hx < 0.5 , predict “y = 0”
●
Logistic regression model
● Hell
● sigmoid/logistic function:
Interpretation of hypothesis output
● Estimated probability that y = 1, given x , “parameterized” by T :
● Example: given tumor size, if we have , which means: tell patient
that 70% chance of tumor being malignant.
● As we only have two classes, we have:
Logistic regression
● when hx > 0.5 , predict “y = 1”
● when hx < 0.5 , predict “y = 0”
● Hx
● Decision boundary:
● Note: different ML model generates
different decision boundary
Decision boundary
Cost function
● A new representation of cost function:
● In linear regression, it is given by:
Cost function
● Logistic regression cost function:
● Intuition: if h(x) = 0, but y = 1, the learning algorithm will be penalized by a
very large cost.
● We can compact two cases into one equation:

Machine learning using matlab.pdf

  • 1.
  • 2.
    Concerned questions ● Collaboration ○The maximum number for a group is 3. ○ Submit technical report individually, include the whole framework of your project, and what you have done in the project. ○ Your score will be given combining the whole project and your work. ● Project ○ A good technical report is composed of three parts: novel idea, good writing, and code (good code ≠ high score) ○ Project flowchart (next slide) ● Submit your group list and project proposal (up to one page) by the end of this week.
  • 3.
    Flowchart of aML project Data collection and annotation Training data (60%) Validating data (20%) Test data (20%) Machine learning model
  • 4.
    Linear regression withone variable Hypothesis: Parameters: Cost Function: Goal: ?
  • 5.
    Gradient descent ● Givena function fx , our objective is ● Repeat until convergence{ }
  • 6.
    Concern ● If αis too small, gradient descent can be slow, more iterations are needed ● If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. ● May converge to a local minimum if the cost function is non-convex ● When we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time. ● “Batch”: all the training examples are used in each step of gradient descent
  • 7.
    Gradient descent forlinear regression (one variable) Repeat until convergence{ }
  • 8.
    Question: which algorithmis correct? Repeat until convergence{ } Repeat until convergence{ } Correct
  • 9.
    Gradient descent forlinear regression (one variable) Repeat until convergence{ }
  • 10.
    Linear regression withmultiple variables ● n : number of features ● x(i) : input features of i-th training example ● x(j) : value of feature j in i-th training example Area of site (1000 square feet) Size of living place (1000 square feet) Number of rooms Ages in years Selling price 3.472 0.998 7 42 25.9 3.531 1.500 7 62 29.5 2.275 1.175 6 40 27.9 4.050 1.232 6 54 25.9 ... ... ... ... ...
  • 11.
    Hypothesis ● Hypothesis formultiple features: ● If we define x0 = 1, then we have
  • 12.
    Linear regression withmultiple variables Hypothesis: Parameters: Cost Function: Gradient descent: ● Repeat until convergence{ }
  • 13.
    Gradient descent forlinear regression (multiple variables) Repeat{ }
  • 14.
    Suggestions on gradientdescent ● How to make sure gradient descent is working correctly? ● How to choose learning rate?
  • 15.
    Make sure gradientdescent is working correctly ● Ideal cost output: decrease sharply, then slightly decrease ● Declare convergence if cost decrease between two iterations is less than a threshold
  • 16.
    Choosing learning rate ●α too small, more iterations; α too large, may not converge ● To choose α, try Ideal small large
  • 17.
    Feature normalization -intuition ● Each feature has a different scale, which may generates an oval shape ● Result: more iterations to converge ● Solution: feature normalization
  • 18.
    Feature normalization ● Featurescaling: ● Standard score:
  • 19.
    Normal equation forlinear regression Area of site (1000 square feet) Size of living place (1000 square feet) Number of rooms Ages in years Selling price 3.472 0.998 7 42 25.9 3.531 1.500 7 62 29.5 2.275 1.175 6 40 27.9 4.050 1.232 6 54 25.9 Area of site (1000 square feet) Size of living place (1000 square feet) Number of rooms Ages in years Selling price 1 3.472 0.998 7 42 25.9 1 3.531 1.500 7 62 29.5 1 2.275 1.175 6 40 27.9 1 4.050 1.232 6 54 25.9
  • 20.
    Normal equation forlinear regression ● Instead of gradient descent, we can obtain the best solution using the following equations: ● Matlab one line code: pinv(X’*X)*X’*y
  • 21.
    Gradient descent vsnormal equation Gradient descent ● Need to choose learning rate ● Need many iterations ● Works well even when number of features n is very large Normal equation ● No learning rate ● No iterations ● Need to compute , slow when is very large, non-invertible
  • 22.
    Congratulations! You have learntyour first ML model! Questions?
  • 23.
  • 24.
    Binary classification ● Examples: ○Email: spam/not spam? ○ Tumor: malignant/benign? ○ Object: car/not car? ● ,where 1 is positive class, and 0 is negative class, e.g, spam (1) vs not spam (0). ● Intuitively, negative class conveys absence of something
  • 25.
    Hypothesis ● If hx> 0.5 , predict “y = 1” ● If hx < 0.5 , predict “y = 0” ●
  • 26.
    Logistic regression model ●Hell ● sigmoid/logistic function:
  • 27.
    Interpretation of hypothesisoutput ● Estimated probability that y = 1, given x , “parameterized” by T : ● Example: given tumor size, if we have , which means: tell patient that 70% chance of tumor being malignant. ● As we only have two classes, we have:
  • 28.
    Logistic regression ● whenhx > 0.5 , predict “y = 1” ● when hx < 0.5 , predict “y = 0”
  • 29.
    ● Hx ● Decisionboundary: ● Note: different ML model generates different decision boundary Decision boundary
  • 30.
    Cost function ● Anew representation of cost function: ● In linear regression, it is given by:
  • 31.
    Cost function ● Logisticregression cost function: ● Intuition: if h(x) = 0, but y = 1, the learning algorithm will be penalized by a very large cost. ● We can compact two cases into one equation: