Vectorization Of Gradient Descent
Last Updated :
24 Oct, 2020
In Machine Learning, Regression problems can be solved in the following ways:
1. Using Optimization Algorithms - Gradient Descent
- Batch Gradient Descent.
- Stochastic Gradient Descent.
- Mini-Batch Gradient Descent
- Other Advanced Optimization Algorithms like ( Conjugate Descent ... )
2. Using the Normal Equation :
- Using the concept of Linear Algebra.
Let's consider the case for Batch Gradient Descent for Univariate Linear Regression Problem.
The cost function for this Regression Problem is :
J(\Theta)=(1/2m)*\sum_{i=1}^m(h_{\theta}(x^i)-y^i)^2
Goal:
minimize_{\ \theta_{o},\theta_{1}}\ \ J({\theta})
In order to solve this problem, we can either go for a Vectorized approach ( Using the concept of Linear Algebra ) or unvectorized approach (Using for-loop).
1. Unvectorized Approach:
Here in order to solve the below mentioned mathematical expressions, We use for loop.
The above mathematical expression is a part of Cost Function.
\sum_{i=1}^m(h_{\theta}(x^i)-y^i)^2
The above Mathematical Expression is the hypothesis.
h_{\theta}=\theta_{0}x_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+... +\theta_{n}x_{n}\\ where,\\ h_{\theta}=hypothesis.\\
Code: Python Implementation of Unvectorzed Grad
python3
# Import required modules.
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
import time
# Create and plot the data set.
x, y = make_regression(n_samples = 100, n_features = 1,
n_informative = 1, noise = 10, random_state = 42)
plt.scatter(x, y, c = 'red')
plt.xlabel('Feature')
plt.ylabel('Target_Variable')
plt.title('Training Data')
plt.show()
# Convert y from 1d to 2d array.
y = y.reshape(100, 1)
# Number of Iterations for Gradient Descent
num_iter = 1000
# Learning Rate
alpha = 0.01
# Number of Training samples.
m = len(x)
# Initializing Theta.
theta = np.zeros((2, 1),dtype = float)
# Variables
t0 = t1 = 0
Grad0 = Grad1 = 0
# Batch Gradient Descent.
start_time = time.time()
for i in range(num_iter):
# To find Gradient 0.
for j in range(m):
Grad0 = Grad0 + (theta[0] + theta[1] * x[j]) - (y[j])
# To find Gradient 1.
for k in range(m):
Grad1 = Grad1 + ((theta[0] + theta[1] * x[k]) - (y[k])) * x[k]
t0 = theta[0] - (alpha * (1/m) * Grad0)
t1 = theta[1] - (alpha * (1/m) * Grad1)
theta[0] = t0
theta[1] = t1
Grad0 = Grad1 = 0
# Print the model parameters.
print('model parameters:',theta,sep = '\n')
# Print Time Take for Gradient Descent to Run.
print('Time Taken For Gradient Descent in Sec:',time.time()- start_time)
# Prediction on the same training set.
h = []
for i in range(m):
h.append(theta[0] + theta[1] * x[i])
# Plot the output.
plt.plot(x,h)
plt.scatter(x,y,c = 'red')
plt.xlabel('Feature')
plt.ylabel('Target_Variable')
plt.title('Output')
Output:
model parameters:
[[ 1.15857049]
[44.42210912]]
Time Taken For Gradient Descent in Sec: 2.482538938522339
2. Vectorized Approach:
Here in order to solve the below mentioned mathematical expressions, We use Matrix and Vectors (Linear Algebra).
The above mathematical expression is a part of Cost Function.
\sum_{i=1}^m(h_{\theta}(x^i)-y^i)^2
The above Mathematical Expression is the hypothesis.
h_{\theta}=\theta^T.X\\ where,\\ h_{\theta}=hypothesis.\\ \theta= \begin{bmatrix} \theta_{0} \\ \theta_{1}\\ \theta_{2}\\ \theta_{3}\\ .\\ .\\ \theta_{n}\\ \end{bmatrix} X= \begin{bmatrix} {x_{0}} \\ {x_{1}}\\ {x_{2}}\\ {x_{3}}\\ .\\ .\\ {x_{n}}\\ \end{bmatrix}\\
Batch Gradient Descent :
Loop\ until\ converge\{\\ \ \theta_{j}:=\theta_{j}-(1/m)*(\alpha)*\frac{\partial J(\theta)}{\partial \theta_{j}} \\ \}\\ Let, \ Gradients=\frac{\partial J(\theta)}{\partial \theta_{j}}
Concept To Find Gradients Using Matrix Operations:
X\_New= \begin{bmatrix} {x_{0}^1} & {x_{1}^1} \\ {x_{0}^2} & {x_{1}^2}\\ {x_{0}^3} & {x_{1}^3}\\ {x_{0}^4} & {x_{1}^4}\\ . & .\\ . & .\\ . & .\\ {x_{0}^m} & {x_{1}^m} \end{bmatrix}_{m X 2} \theta= \begin{bmatrix} \theta_{0} \\ \theta_{1}\\ \end{bmatrix}_{2X1}\\ where,\\ \x_{0}^i=1\\
H(\theta)=X\_New\ .\ \theta\\ H(\theta)= \begin{bmatrix} {\Theta_{0}}{x_{0}^1}+{\Theta_{1}}{x_{1}^1} \\ {\Theta_{0}}{x_{0}^2}+{\Theta_{1}}{x_{1}^2}\\ {\Theta_{0}}{x_{0}^3}+{\Theta_{1}}{x_{1}^3}\\ {\Theta_{0}}{x_{0}^4}+{\Theta_{1}}{x_{1}^4}\\ .\\ .\\ . \\ {\Theta_{0}}{x_{0}^m}+{\Theta_{1}}{x_{1}^m} \end{bmatrix}_{mX1} And\ \ \ Y= \begin{bmatrix} {y^1}\\ {y^2}\\ {y^3}\\ .\\ .\\ .\\ {y^m} \end{bmatrix}_{mX1}\\
H(\theta)-Y= \begin{bmatrix} {\Theta_{0}}{x_{0}^1}+{\Theta_{1}}{x_{1}^1} -y^1\\ {\Theta_{0}}{x_{0}^2}+{\Theta_{1}}{x_{1}^2}-y^2\\ {\Theta_{0}}{x_{0}^3}+{\Theta_{1}}{x_{1}^3}-y^3\\ {\Theta_{0}}{x_{0}^4}+{\Theta_{1}}{x_{1}^4}-y^4\\ .\\ .\\ . \\ {\Theta_{0}}{x_{0}^m}+{\Theta_{1}}{x_{1}^m}-y^m \end{bmatrix}_{mX1} \\
X\_New^T= \begin{bmatrix} {x_{0}^1\ x_{0}^2\ x_{0}^3\ .\ .\ .\ x_{0}^m}\\ {x_{1}^1\ x_{1}^2\ x_{1}^3\ .\ .\ .\ x_{1}^m} \end{bmatrix}_{2Xm} \\
Gradients=X\_New\ . \ (H(\theta)-Y)\\ = \begin{bmatrix} {x_{0}^1(\Theta x_{0}^1+\Theta x_{1}^1-y^1)\ + \ x_{0}^2(\Theta x_{0}^2+\Theta x_{1}^2-y^2)\ + \ x_{0}^3(\Theta x_{0}^3+\Theta x_{1}^3-y^3)\ + . . .}\\ {x_{1}^1(\Theta x_{0}^1+\Theta x_{1}^1-y^1)\ + \ x_{1}^2(\Theta x_{0}^2+\Theta x_{1}^2-y^2)\ + \ x_{1}^3(\Theta x_{0}^3+\Theta x_{1}^3-y^3)\ + . . .}\\ \end{bmatrix}_{2X1}\\
Finally\ we\ can \ say,\\ \ \ \ Gradients=\frac{\partial J(\theta)}{\partial \theta_{j}}=X\_New^T.(X\_New.\theta-Y)
Code: Python implementation of vectorized Gradient Descent approach
python3
# Import required modules.
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
import time
# Create and plot the data set.
x, y = make_regression(n_samples = 100, n_features = 1,
n_informative = 1, noise = 10, random_state = 42)
plt.scatter(x, y, c = 'red')
plt.xlabel('Feature')
plt.ylabel('Target_Variable')
plt.title('Training Data')
plt.show()
# Adding x0=1 column to x array.
X_New = np.array([np.ones(len(x)), x.flatten()]).T
# Convert y from 1d to 2d array.
y = y.reshape(100, 1)
# Number of Iterations for Gradient Descent
num_iter = 1000
# Learning Rate
alpha = 0.01
# Number of Training samples.
m = len(x)
# Initializing Theta.
theta = np.zeros((2, 1),dtype = float)
# Batch-Gradient Descent.
start_time = time.time()
for i in range(num_iter):
gradients = X_New.T.dot(X_New.dot(theta)- y)
theta = theta - (1/m) * alpha * gradients
# Print the model parameters.
print('model parameters:',theta,sep = '\n')
# Print Time Take for Gradient Descent to Run.
print('Time Taken For Gradient Descent in Sec:',time.time() - start_time)
# Hypothesis.
h = X_New.dot(theta) # Prediction on training data itself.
# Plot the Output.
plt.scatter(x, y, c = 'red')
plt.plot(x ,h)
plt.xlabel('Feature')
plt.ylabel('Target_Variable')
plt.title('Output')
Output:
model parameters:
[[ 1.15857049]
[44.42210912]]
Time Taken For Gradient Descent in Sec: 0.019551515579223633
Observations:
- Implementing a vectorized approach decreases the time taken for execution of Gradient Descent( Efficient Code ).
- Easy to debug.
Similar Reads
Stochastic Gradient Descent In R
Gradient Descent is an iterative optimization process that searches for an objective functionâs optimum value (Minimum/Maximum). It is one of the most used methods for changing a modelâs parameters to reduce a cost function in machine learning projects. In this article, we will learn the concept of
9 min read
What is Gradient descent?
Gradient Descent is a fundamental algorithm in machine learning and optimization. It is used for tasks like training neural networks, fitting regression lines, and minimizing cost functions in models. In this article we will understand what gradient descent is, how it works , mathematics behind it a
8 min read
Different Variants of Gradient Descent
Gradient descent is a fundamental optimization algorithm in machine learning used to minimize functions by iteratively moving towards the minimum. It's important for training models by fine-tuning parameters to reduce prediction errors. In this article, we are going to explore different variants of
5 min read
ML | Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read
Stochastic Gradient Descent Regressor
A key method in data science and machine learning is the stochastic gradient descent (SGD) regression. It is essential to many regression activities and aids in the creation of predictive models for a variety of uses. We will study the idea of the SGD Regressor, its operation, and its importance in
10 min read
Optimization techniques for Gradient Descent
Gradient Descent is a widely used optimization algorithm for machine learning models. However, there are several optimization techniques that can be used to improve the performance of Gradient Descent. Here are some of the most popular optimization techniques for Gradient Descent: Learning Rate Sche
4 min read
Gradient Descent in Linear Regression
Gradient descent is a optimization algorithm used in linear regression to find the best fit line to the data. It works by gradually by adjusting the lineâs slope and intercept to reduce the difference between actual and predicted values. This process helps the model make accurate predictions by mini
4 min read
Gradient Descent With RMSProp from Scratch
RMSprop modifies the traditional gradient descent algorithm by adapting the learning rate for each parameter based on the magnitude of recent gradients. The key advantage of RMSprop is that it helps to smooth the parameter updates and avoid oscillations, particularly when gradients fluctuate over ti
4 min read
Stochastic Gradient Descent Classifier
One essential tool in the data science and machine learning toolkit for a variety of classification tasks is the stochastic gradient descent (SGD) classifier. Through an exploration of its functionality and critical role in data-driven decision-making, we set out to explore the complexities of the S
14 min read
Gradient Descent Algorithm in Machine Learning
Gradient descent is the backbone of the learning process for various algorithms, including linear regression, logistic regression, support vector machines, and neural networks which serves as a fundamental optimization technique to minimize the cost function of a model by iteratively adjusting the m
15+ min read