Gradient Descent: The Workhorse of Machine Learning
Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. It's the foundation of training neural networks and many other machine learning models.
The Intuition
Imagine you're blindfolded on a hilly terrain and want to reach the lowest point. You would feel the slope under your feet and step in the direction that goes downhill. That's gradient descent!
The Mathematics
For a function $f(\theta)$, gradient descent updates the parameters $\theta$ iteratively:
$$\theta_{new} = \theta_{old} - \alpha \nabla f(\theta_{old})$$
Where:
- $\alpha$ is the learning rate (step size)
- $\nabla f(\theta)$ is the gradient (direction of steepest ascent)
Types of Gradient Descent
1. Batch Gradient Descent
Uses the entire dataset to compute the gradient:
$$\theta = \theta - \alpha \frac{1}{n}\sum_{i=1}^{n} \nabla L(f(x_i), y_i)$$
def batch_gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
for _ in range(iterations):
gradient = (1/m) * X.T @ (X @ theta - y)
theta = theta - alpha * gradient
return theta
2. Stochastic Gradient Descent (SGD)
Updates parameters for each training example:
$$\theta = \theta - \alpha \nabla L(f(x_i), y_i)$$
Pros: Faster updates, can escape local minima Cons: Noisy updates, may not converge smoothly
3. Mini-batch Gradient Descent
Uses a subset of data (batch) for each update:
$$\theta = \theta - \alpha \frac{1}{b}\sum_{i=1}^{b} \nabla L(f(x_i), y_i)$$
This is the most commonly used approach in practice.
Choosing the Learning Rate
The learning rate $\alpha$ is crucial:
- Too large: May overshoot and diverge
- Too small: Slow convergence
- Just right: Steady convergence to minimum
Learning Rate Schedules
- Step decay: Reduce $\alpha$ by factor every N epochs
- Exponential decay: $\alpha_t = \alpha_0 e^{-kt}$
- Adam optimizer: Adaptive learning rates for each parameter
Advanced Optimizers
Momentum
Adds a "velocity" term to accelerate convergence:
$$v_t = \beta v_{t-1} + \nabla f(\theta)$$ $$\theta = \theta - \alpha v_t$$
Adam (Adaptive Moment Estimation)
Combines momentum with adaptive learning rates:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla f(\theta)$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla f(\theta))^2$$ $$\theta = \theta - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}$$
Practical Tips
- Normalize your features
- Start with a reasonable learning rate (0.001)
- Use mini-batches (32-256 samples)
- Monitor the loss curve for debugging
- Consider using Adam for most applications
Conclusion
Gradient descent is elegantly simple yet incredibly powerful. Understanding its variants and tuning strategies is essential for anyone working in machine learning.