A Gentle Introduction to Gradient Descent

Confused about gradient descent in machine learning? Here’s what you need to know…

Introduction:

In machine learning and optimization, gradient descent is one of the most important and widely used algorithms. It’s a key technique for training models and fine-tuning parameters to make predictions as accurate as possible. But what exactly is gradient descent, and how does it work?

In this blog post, we will explore gradient descent in simple terms, use a basic example to demonstrate its functionality, dive into the technical details, and provide some code to help you get a better understanding.

What is Gradient Descent? In Simple Terms…

Gradient descent is an optimization algorithm that minimizes the cost function or loss function of a machine learning model. The goal of gradient descent is to adjust the parameters of the model (such as weights in a neural network) to reduce the error in predictions, improving the model’s performance. In other words, the process involves taking steps that go in the direction of the steepest decrease of the cost function.

To help you visualize gradient descent, let’s consider a simple example.

Imagine you’re standing on a smooth hill, and your goal is to reach the lowest point. However, it is a new moon night and there are no lights around you. You can’t see anything, but you can feel the slope beneath your feet. So, you decide to take a small step in the direction of the steepest downward slope (where the ground slopes the most), and then reassess your position. You repeat this process: take a step, check the slope, take another step, and so on—each time getting closer to the lowest point.

In the context of gradient descent:

The hill represents the cost function that needs to be minimized.
The lowest point represents the global minimum (the point where the cost function is as small as possible).
Your steps are the updates to the model’s parameters, moving you closer to the optimal solution.

Gradient Descent in Technical Terms

Let’s break it down into more technical language. In machine learning, you have a model that tries to make predictions. The cost function measures how far the model’s predictions are from the actual results. The objective of gradient descent is to find the model’s parameters (weights, biases, etc.) that minimize this cost function.

Here’s how gradient descent works mathematically:

Start with initial parameters: These could be random values for the model’s weights.
Compute the gradient of the cost function: This is the derivative (slope) of the cost function with respect to each parameter.
Update the parameters: Subtract a fraction of the gradient from the current parameters. The fraction is called the learning rate, which controls the size of the steps taken.
Repeat: Continue this process for many iterations (or until convergence) until the cost function is minimized.

The update rule looks like this:

θ=θ−α⋅∇J(θ)

Where:

θ is the parameter (such as weight or bias)
α is the learning rate
∇J(θ) is the gradient of the cost function with respect to θ (the slope)

Gradient Descent Example Code

Let’s implement gradient descent for a simple linear regression problem using Python. In this case, we want to fit a line to some data points. Our cost function will be the Mean Squared Error (MSE), which measures how far the predicted points are from the actual data points.

Let’s start by importing the necessary libraries and generating some data.

import numpy as np
import matplotlib.pyplot as plt

# Generate some data points (y = 2x + 1)
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1)  # Adding some noise

Now, let’s define the cost function and its gradient.

# Cost function (Mean Squared Error)
def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    return cost

# Gradient of the cost function
def compute_gradient(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    gradient = (1 / m) * X.T.dot(predictions - y)
    return gradient

We can now implement the gradient descent function that will iteratively update our parameters θ.

# Gradient Descent Function
def gradient_descent(X, y, theta, learning_rate=0.01, iterations=1000):
    cost_history = []
    for i in range(iterations):
        gradient = compute_gradient(X, y, theta)
        theta = theta - learning_rate * gradient
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
    return theta, cost_history

Next, we will initialize our parameters θ and start the gradient descent process.

# Adding a column of ones to X for the bias term (intercept)
X_b = np.c_[np.ones((X.shape[0], 1)), X]

# Initializing parameters (random values)
theta_init = np.random.randn(2, 1)

# Running gradient descent
theta_final, cost_history = gradient_descent(X_b, y, theta_init, learning_rate=0.1, iterations=2000)

print(f"Optimal parameters (theta): {theta_final}")

Finally, let’s plot the cost history to see how the cost function decreases over time.

# Plotting the cost history
plt.plot(range(len(cost_history)), cost_history)
plt.xlabel("Iterations")
plt.ylabel("Cost")
plt.title("Cost Function vs Iterations")
plt.show()

This plot should show a steady decrease in the cost as the gradient descent algorithm updates the parameters and moves toward the minimum.

Types of Gradient Descent

There are several variants of gradient descent, each with its own characteristics, as shown below –

Batch Gradient Descent: Uses the entire dataset to compute the gradient at each step. This is what we’ve used in our example. It is computationally expensive for large datasets.
Stochastic Gradient Descent (SGD): Uses a single data point to compute the gradient at each step. It can update parameters more frequently but may be noisier.
Mini-Batch Gradient Descent: Uses a small subset (mini-batch) of the dataset at each step. It balances the computational efficiency of batch gradient descent with the faster updates of stochastic gradient descent.

Thus, we see that the different types of gradient descent differ in how much data they use at each step to update the parameters:

Conclusion

In summary, gradient descent is a foundational algorithm in machine learning that helps us optimize the parameters of a model to minimize the error. Whether for simple linear regression or more complex deep learning models, understanding how gradient descent works is essential for designing and training effective models. By adjusting the learning rate and choosing the right variant of gradient descent, we can ensure that the algorithm converges to the optimal solution.

With the help of gradient descent, machine learning models become smarter and more efficient, empowering us to make predictions and solve problems in countless applications. Whether you’re working with small datasets or building large-scale systems, mastering gradient descent is a crucial skill for any data scientist or machine learning practitioner.