Lecture Slides - Linear Reg
Lecture Slides - Linear Reg
Regression
Varol Kayhan, PhD
Agenda
• Regression recap
• Gradient Descent
• Stochastic gradient descent
• Batch/mini-bath gradient descent
• Polynomial regression
• Regularization
• L1
• L2
• ElasticNet
• Logistic Regression
Recall: Regression
: Intercept
, …, : Beta coefficients
, … , : dimensions, features, variables, predictors
Recall: Regression
• Example: House prices with one variable Sqft Price
1,000 110,000
• : Sqft
1,500 150,000
• : Price
… …
Price 𝑎 𝑎
= 𝑆𝑙𝑜𝑝𝑒= 𝛽
𝑏 𝑏 1
𝛽0
Sqft
Recall: Regression
• Unfortunately, life is not that simple
• We usually have more than one variable
• It is a "multi-dimensional" space
• (Impossible to visualize it)
𝑦 = 𝛽0 + 𝛽1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 𝑥 3 + …+ 𝛽 𝑛 𝑥 𝑛
Regression
X: Training set
y: vector of output values
Computationally very costly (if there are lots of features, or lots of data)
2) Approximate solution using an "optimizer": Gradient Descent
Gradient Descent
• A generic optimization algorithm
• Tweak the parameters iteratively to minimize the cost function (SSE,
MSE, RMSE, etc.)
• Analogy:
• Lost in a mountain
• How do you get to the bottom? (i.e., "minimize")
• Feel slope below your feet!
• Then, go in the direction of steepest (descending) slope
Gradient Descent
Cost
Beta value
Gradient Descent - Mechanics
• Initialize the beta coefficients with random values
• Calculate the "gradient" of each beta coefficient
• Based on the gradient and "learning rate", change the beta coefficients.
• New value = Previous value – learning rate x "gradient value"
• If gradient is zero, minimum achieved!
Gradient Descent - Example
Iteration #1: Assign random values for each beta coefficient, calculate the "cost" of each
𝛽0 𝛽1 𝛽2
Iteration #2: Adjust the coefficients, recalculate the cost
𝛽0 𝛽1 𝛽2
Iteration #3 and so forth: Repeat
Learning Rate
• Determines how fast to move in each iteration
• If too small: too many steps to converge
• Might not converge or take too long to converge
• It too large: jump around and never reach minimum
Gradient Descent
• Two challenges:
Iteration #1: Assign random values for each beta coefficient, calculate the "cost" of each for ONE random observation
𝛽0 𝛽1 𝛽2
Iteration #2: Adjust the coefficients, recalculate the cost for another random observation
𝛽0 𝛽1 𝛽2
Iteration #3 and so forth: Repeat
is going in the wrong
direction
Stochastic Gradient Descent (SGD)
• Works even if the cost function is not like a “bowl”
• It can get out of local minima
• Uses a "learning schedule" to gradually decrease the learning rate
• Increases the chances of converging on the optimal solution
alpha value
Very
Comple
simple
x model
Very low (overfitt
model Very high
(underfi
ing)
tting)
L1 Regularization (Lasso Regression)
• Goal: model sparsity
• Cost function = MSE + α
• It eliminates the least important features/variables
(by setting their betas to zero)
• Automatically performs feature selection
Elastic Net
• Mix of L2 and L1
• MSE + rα +
• Control the mix ratio using the term "r" in the cost function:
• 0 <= r <= 1
• r = 0 , then L2
• r = 1 , then L1
• Same as before: α (alpha) controls the magnitude of "regularization"
• α = 0, then no regularization (i.e., a regular regression model)
• Higher values mean more regularization
Early Stopping
• Another regularization technique: stop when validation error is
minimum