What is Gradient descent


Understanding Gradient Descent Algorithm in Machine Learning

Machine learning is all about creating models and algorithms that can learn from data and make predictions. One of the most popular algorithms in machine learning is called Gradient Descent. Gradient Descent is a method that is used to minimize the cost function of a machine learning model.

The purpose of this article is to explain Gradient Descent and how it works in machine learning. We will discuss the different types of Gradient Descent algorithms, how to choose the learning rate, and how to implement Gradient Descent in Python.

What is Gradient Descent?

Gradient Descent is a popular optimization algorithm that is used to find the minimum of a function. The function that we are trying to minimize is usually called a cost function or a loss function. In machine learning, we use the cost function to measure the difference between the predicted output and the actual output.

The goal of Gradient Descent is to find the set of parameters (also called weights) that minimize the cost function. The parameters are the values that the machine learning model uses to make a prediction. For example, in a linear regression model, the parameters are the slope and intercept of the line that best fits the data.

How does Gradient Descent work?

The basic idea of Gradient Descent is to iteratively update the parameters in the direction of the negative gradient of the cost function. The negative gradient of the cost function tells us in which direction the function is decreasing the fastest. By moving in this direction, we can get closer to the minimum of the function.

The algorithm starts with some initial values for the parameters and computes the gradient of the cost function with respect to each parameter. The gradient tells us in which direction we need to adjust the parameters to reduce the cost function. We then update each parameter by subtracting a fraction of the gradient from its current value:

  • Initialize the parameters to some random values.
  • Calculate the cost function.
  • Calculate the gradient of the cost function with respect to each parameter.
  • Update each parameter by subtracting the learning rate times the gradient.
  • Repeat steps 2-4 until the cost function converges (i.e., stops changing).

The learning rate is a hyperparameter that controls how much we adjust the parameters in each iteration. If the learning rate is too high, we may overshoot the minimum and never converge. If the learning rate is too low, the algorithm may take too long to converge or get stuck in a local minimum.

Types of Gradient Descent algorithms

There are three types of Gradient Descent algorithms: batch, stochastic, and mini-batch.

  • Batch Gradient Descent: In batch gradient descent, we use the entire training set to compute the gradient of the cost function. This means that we compute the gradient using all the training examples at once. Batch Gradient Descent is very accurate but can be very slow for large datasets.
  • Stochastic Gradient Descent: In stochastic gradient descent, we use only one training example at a time to compute the gradient. This means that we compute the gradient using a randomly selected training example. Stochastic Gradient Descent is much faster than Batch Gradient Descent, but it is less accurate because the gradient is more noisy.
  • Mini-batch Gradient Descent: In mini-batch gradient descent, we use a small batch of training examples to compute the gradient. This means that we compute the gradient using a randomly selected batch of training examples. Mini-batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. It is faster than Batch Gradient Descent and more accurate than Stochastic Gradient Descent.

The choice of the Gradient Descent algorithm depends on the size of the dataset and the computational resources available. If the dataset is small, Batch Gradient Descent may be the best option. If the dataset is large, Stochastic Gradient Descent or Mini-batch Gradient Descent may be better.

How to choose the learning rate?

The learning rate is a hyperparameter that controls how much we adjust the parameters in each iteration. If the learning rate is too low, the algorithm may take too long to converge or get stuck in a local minimum. If the learning rate is too high, we may overshoot the minimum and never converge.

One way to choose the learning rate is to perform a grid search over a range of values. We can start with a small value and gradually increase it until the algorithm stops improving. Another way is to use a learning rate schedule that decreases the learning rate over time.

How to implement Gradient Descent in Python?

Now that we have a good understanding of Gradient Descent, let's see how to implement it in Python. We will use the Scikit-Learn library to generate a synthetic dataset and train a linear regression model using Gradient Descent.

First, we need to import the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

Next, we generate a synthetic dataset using the make_regression function:

X, y = make_regression(n_features=1, noise=5, random_state=42)

X is a matrix of shape (n_samples, n_features) that contains the input features, and y is a vector of shape (n_samples,) that contains the target values.

Next, we define a function to calculate the cost function:

def cost_function(X, y, theta):
predictions = X.dot(theta)
errors = np.subtract(predictions, y)
sqr_errors = np.square(errors)
mean_squared_error = 1/(2*m) * np.sum(sqr_errors)
return mean_squared_error

X is the matrix of input features, y is the target vector, and theta is the vector of parameters. The function computes the predicted output from the input features and the current parameters, computes the difference between the predicted output and the actual output, computes the squared error, and returns the mean squared error.

Next, we define a function to compute the gradient of the cost function:

def compute_gradient(X, y, theta):
predictions = X.dot(theta)
errors = np.subtract(predictions, y)
gradient = 1/m * X.transpose().dot(errors)
return gradient

X is the matrix of input features, y is the target vector, and theta is the vector of parameters. The function computes the predicted output from the input features and the current parameters, computes the difference between the predicted output and the actual output, computes the gradient of the cost function, and returns the gradient.

Next, we define a function to perform Gradient Descent:

def gradient_descent(X, y, starting_theta, learning_rate, num_iterations):
theta = starting_theta
costs = []
for i in range(num_iterations):
gradient = compute_gradient(X, y, theta)
theta = theta - learning_rate * gradient
cost = cost_function(X, y, theta)
costs.append(cost)
print("Iteration %d | Cost: %f" % (i, cost))
return theta, costs

X is the matrix of input features, y is the target vector, starting_theta is the vector of initial parameters, learning_rate is the hyperparameter that controls how much we adjust the parameters in each iteration, and num_iterations is the number of iterations to run the algorithm. The function implements the Gradient Descent algorithm, updates the parameters in the direction of the negative gradient of the cost function, and records the cost function at each iteration.

Next, we run the Gradient Descent algorithm:

m = len(y)
X = np.hstack((np.ones((m, 1)), X))
y = y.reshape(-1, 1)
starting_theta = np.zeros((2, 1))
learning_rate = 0.01
num_iterations = 100
theta, costs = gradient_descent(X, y, starting_theta, learning_rate, num_iterations)

We add a column of ones to the input features to account for the intercept term. We also reshape the target vector to a matrix of shape (n_samples, 1) to make it compatible with the input features. We set the initial parameters to zero and the learning rate to 0.01. We run the Gradient Descent algorithm for 100 iterations and record the cost function at each iteration.

Finally, we plot the cost function:

plt.plot(costs)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Gradient Descent: Cost vs Iterations')
Gradient Descent: Cost vs Iterations

The plot shows that the cost function decreases with each iteration, which means that Gradient Descent is working correctly. The cost function eventually converges, which means that we have found the minimum of the cost function.

Conclusion

Gradient Descent is a popular optimization algorithm that is used to find the minimum of a function. In machine learning, we use Gradient Descent to find the set of parameters that minimize the cost function of a model. The Gradient Descent algorithm works by iteratively updating the parameters in the direction of the negative gradient of the cost function. There are three types of Gradient Descent algorithms: batch, stochastic, and mini-batch. The choice of the Gradient Descent algorithm depends on the size of the dataset and the computational resources available. The learning rate is a hyperparameter that controls how much we adjust the parameters in each iteration. If the learning rate is too low, the algorithm may take too long to converge or get stuck in a local minimum. If the learning rate is too high, we may overshoot the minimum and never converge. We can choose the learning rate using a grid search or a learning rate schedule. We can implement Gradient Descent in Python using the Scikit-Learn library.

Loading...