Gradient Descent: The Workhorse of Machine Learning

Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. It's the foundation of training neural networks and many other machine learning models.

The Intuition

Imagine you're blindfolded on a hilly terrain and want to reach the lowest point. You would feel the slope under your feet and step in the direction that goes downhill. That's gradient descent!

The Mathematics

For a function $f(\theta)$, gradient descent updates the parameters $\theta$ iteratively:

$$\theta_{new} = \theta_{old} - \alpha \nabla f(\theta_{old})$$

Where:

$\alpha$ is the learning rate (step size)
$\nabla f(\theta)$ is the gradient (direction of steepest ascent)

Types of Gradient Descent

1. Batch Gradient Descent

Uses the entire dataset to compute the gradient:

$$\theta = \theta - \alpha \frac{1}{n}\sum_{i=1}^{n} \nabla L(f(x_i), y_i)$$

def batch_gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    for _ in range(iterations):
        gradient = (1/m) * X.T @ (X @ theta - y)
        theta = theta - alpha * gradient
    return theta

2. Stochastic Gradient Descent (SGD)

Updates parameters for each training example:

$$\theta = \theta - \alpha \nabla L(f(x_i), y_i)$$

Pros: Faster updates, can escape local minima Cons: Noisy updates, may not converge smoothly

3. Mini-batch Gradient Descent

Uses a subset of data (batch) for each update:

$$\theta = \theta - \alpha \frac{1}{b}\sum_{i=1}^{b} \nabla L(f(x_i), y_i)$$

This is the most commonly used approach in practice.

Choosing the Learning Rate

The learning rate $\alpha$ is crucial:

Too large: May overshoot and diverge
Too small: Slow convergence
Just right: Steady convergence to minimum

Learning Rate Schedules

Step decay: Reduce $\alpha$ by factor every N epochs
Exponential decay: $\alpha_t = \alpha_0 e^{-kt}$
Adam optimizer: Adaptive learning rates for each parameter

Advanced Optimizers

Momentum

Adds a "velocity" term to accelerate convergence:

$$v_t = \beta v_{t-1} + \nabla f(\theta)$$ $$\theta = \theta - \alpha v_t$$

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla f(\theta)$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla f(\theta))^2$$ $$\theta = \theta - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Practical Tips

Normalize your features
Start with a reasonable learning rate (0.001)
Use mini-batches (32-256 samples)
Monitor the loss curve for debugging
Consider using Adam for most applications

Conclusion

Gradient descent is elegantly simple yet incredibly powerful. Understanding its variants and tuning strategies is essential for anyone working in machine learning.