0% found this document useful (0 votes)
2 views

Lecture Slides - Linear Reg

The document provides an overview of regression techniques, including linear and logistic regression, and discusses gradient descent methods such as batch, stochastic, and mini-batch gradient descent. It covers polynomial regression, regularization techniques (L1, L2, ElasticNet), and the concept of overfitting and underfitting. Additionally, it introduces logistic regression for binary outcomes and softmax regression for multi-class classification, along with relevant Python parameters for implementation.

Uploaded by

Neeraja
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture Slides - Linear Reg

The document provides an overview of regression techniques, including linear and logistic regression, and discusses gradient descent methods such as batch, stochastic, and mini-batch gradient descent. It covers polynomial regression, regularization techniques (L1, L2, ElasticNet), and the concept of overfitting and underfitting. Additionally, it introduces logistic regression for binary outcomes and softmax regression for multi-class classification, along with relevant Python parameters for implementation.

Uploaded by

Neeraja
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Linear/Logistic

Regression
Varol Kayhan, PhD
Agenda
• Regression recap
• Gradient Descent
• Stochastic gradient descent
• Batch/mini-bath gradient descent
• Polynomial regression
• Regularization
• L1
• L2
• ElasticNet
• Logistic Regression
Recall: Regression

: Intercept
, …, : Beta coefficients
, … , : dimensions, features, variables, predictors
Recall: Regression
• Example: House prices with one variable Sqft Price
1,000 110,000
• : Sqft
1,500 150,000
• : Price
… …

• Fit the best line through observations (by minimizing error)

Price 𝑎 𝑎
= 𝑆𝑙𝑜𝑝𝑒= 𝛽
𝑏 𝑏 1

𝛽0

Sqft
Recall: Regression
• Unfortunately, life is not that simple
• We usually have more than one variable
• It is a "multi-dimensional" space
• (Impossible to visualize it)

𝑦 = 𝛽0 + 𝛽1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 𝑥 3 + …+ 𝛽 𝑛 𝑥 𝑛
Regression

How can we find the BEST set of beta coefficients?


(This is also called "training the model" in machine learning)

1) The Normal Equation (closed-form solution):

X: Training set
y: vector of output values
Computationally very costly (if there are lots of features, or lots of data)
2) Approximate solution using an "optimizer": Gradient Descent
Gradient Descent
• A generic optimization algorithm
• Tweak the parameters iteratively to minimize the cost function (SSE,
MSE, RMSE, etc.)

• Analogy:
• Lost in a mountain
• How do you get to the bottom? (i.e., "minimize")
• Feel slope below your feet!
• Then, go in the direction of steepest (descending) slope
Gradient Descent

Cost

Beta value
Gradient Descent - Mechanics
• Initialize the beta coefficients with random values
• Calculate the "gradient" of each beta coefficient
• Based on the gradient and "learning rate", change the beta coefficients.
• New value = Previous value – learning rate x "gradient value"
• If gradient is zero, minimum achieved!
Gradient Descent - Example

𝑃𝑟𝑖𝑐𝑒=𝛽0+𝛽1 𝐴𝑔𝑒+𝛽2 𝑆𝑞𝑓𝑡


Age SqFt Price
10 2,500 110,000
5 1700 120,000
… … …

Iteration #1: Assign random values for each beta coefficient, calculate the "cost" of each

𝛽0 𝛽1 𝛽2
Iteration #2: Adjust the coefficients, recalculate the cost

𝛽0 𝛽1 𝛽2
Iteration #3 and so forth: Repeat
Learning Rate
• Determines how fast to move in each iteration
• If too small: too many steps to converge
• Might not converge or take too long to converge
• It too large: jump around and never reach minimum
Gradient Descent
• Two challenges:

If you are on the left, If you are on the right,


you will get stuck in the you will get stuck in the
local minimum plateau
Gradient Descent
• MSE is always shaped like a bowl (for linear regression)
• Only one global minimum
• No local minimum
• Slope doesn't change abruptly
• It is guaranteed to approach global minimum
• However, always standardize your numeric variables
• Non-standardized variables change the shape of bowl,
you might end up with “plateaus”
Batch Gradient Descent
• Uses the “entire” training set to calculate MSE at every step
• Calculate MSE
• Calculate the gradients
• Update the beta coefficients (based on gradients)
• Repeat
• Can be very slow if you have a large set
Stochastic Gradient Descent (SGD)
• Default algorithm in many libraries
• Works really well
• Stochastic means "random"
• Pick a random instance at every step
• Calculate the gradients for that instance only
• (Gradient decreases on average)
• Much faster than batch gradient
• Problem: jumps around (even at global min)
• It is instance specific
• Might require more iterations to find the optimal solution
Stochastic Gradient Descent (SGD)
• Problem with SGD (for the example discussed earlier)

Iteration #1: Assign random values for each beta coefficient, calculate the "cost" of each for ONE random observation

𝛽0 𝛽1 𝛽2
Iteration #2: Adjust the coefficients, recalculate the cost for another random observation

𝛽0 𝛽1 𝛽2
Iteration #3 and so forth: Repeat
is going in the wrong
direction
Stochastic Gradient Descent (SGD)
• Works even if the cost function is not like a “bowl”
• It can get out of local minima
• Uses a "learning schedule" to gradually decrease the learning rate
• Increases the chances of converging on the optimal solution

High learning rate

Low learning rate


Mini-Batch Gradient Descent
• Calculates the gradient on a small subsample (i.e., mini-batch)
• Batch uses the entire data set
• Stochastic uses one instance at a time
• Mini-batch uses a subsample at a time
• Less erratic than stochastic gradient
Polynomial Regression
• Regression with polynomial terms
• Used when you think the regression line is "curved"
Polynomial Regression
• Examples:
One variable, first-degree polynomial: One variable, second-degree polynomial:
(i.e., no polynomial term)
Polynomial Regression
• Example: Second-degree polynomial:
• One variable: (3 beta coefficients)

• Two variables: (6 beta coefficients)


Polynomial Regression
• Example: Third-degree polynomial
• One variable: (4 beta coefficients)

• Two variables: (10 beta coefficients)

• Higher degrees generate lots of terms


• Models become difficult to train
• Models become more susceptible to overfit
Learning Curves
• Overfitting: Models performs well on training, but not on test
• Underfitting: Model performs badly on both training and test
Regularization:
• A technique to reduce overfitting
• A technique to "penalize the model complexity"
(so you don't learn too much)
• "Constrains" the weights (i.e., betas) of the model
• Two types:
• L2 Regularization (i.e., Ridge Regression)
• L1 Regularization (i.e., Lasso Regression)
• Both are performed by adding a new term to the cost function (during
training)
L2 Regularization (Ridge Regression)
• Goal: model simplicity
• Forces the algorithm to keep the beta coefficients as small as possible

• Cost function = MSE + α


L2 Regularization (Ridge Regression)
• Cost function = MSE + α
• The term alpha (α) controls the magnitude.
• If zero, then it is regular regression
• If too large, then all weights are very close to
zero and you end up with the intercept only.
(i.e., a straight line through the mean)

alpha value
Very
Comple
simple
x model
Very low (overfitt
model Very high
(underfi
ing)
tting)
L1 Regularization (Lasso Regression)
• Goal: model sparsity
• Cost function = MSE + α
• It eliminates the least important features/variables
(by setting their betas to zero)
• Automatically performs feature selection
Elastic Net
• Mix of L2 and L1
• MSE + rα +
• Control the mix ratio using the term "r" in the cost function:
• 0 <= r <= 1
• r = 0 , then L2
• r = 1 , then L1
• Same as before: α (alpha) controls the magnitude of "regularization"
• α = 0, then no regularization (i.e., a regular regression model)
• Higher values mean more regularization
Early Stopping
• Another regularization technique: stop when validation error is
minimum

Is it baked into the algorithms?


Yes&No
(Sometimes, you have to write your own)
Logistic Regression
• Regression for binary outcomes
• Works just like regular regression
• Output is probability of belonging to class 1 vs. 0
• No known closed-form to find the beta coefficients
• Uses gradient descent
• Cost function is like a bowl, so guaranteed global minimum
• Can be regularized using both L1 and L2
Logistic Regression
• The output value is constrained between 0 and 1
• Logistic function =
Softmax Regression
• Also know as, "Multinomial Logistic Regression"
• Used for multi-class classification
• Finds probabilities of each class using the softmax function
• Class is assigned using the highest estimated probability
• Uses the cross-entropy cost function
• If classes = 2, reverts back to logistic regression
• Uses gradient descent for optimization
Python Cheatsheet
• eta0: learning rate in gradient descent algorithms
• alpha (α): regularization hyperparameter (for both L2 and L1)
• l1_ratio: the mix ratio of r in Elastic Net
• C: regularization hyperparameter for softmax regression
Conclusion
• Regression recap
• Gradient Descent
• Stochastic gradient descent
• Batch/mini-bath gradient descent
• Polynomial regression
• Regularization
• L1
• L2
• ElasticNet
• Logistic Regression

You might also like