Understanding Gradient Descent via Local Quadratic Models
Why Does the Gradient Descent Update Make Sense?
📋 Gradient Descent Algorithm
xs+1 = xs - ηs ∇f(xs )
Why does this update work? Let's understand it through local quadratic models!
🔍 The Core Idea: Local Quadratic Approximation
Imagine the scenario: You're lost on a mountain and want to reach the lowest point.
The problem: The entire terrain is too complex to see at once.
The solution: Just look at the small region under your feet! In this small region, the terrain can be approximated by a simple paraboloid (quadratic function).
📐 Local Quadratic Model ms (x)
Click "Step" to begin the optimization process
Local quadratic model ms (x)
▶️ Step
⏩ Auto Run
🔄 Reset
💡 Key Observations:
The red parabola (local model) closely matches the blue curve (true function) near the current point
The green point (xs+1 ) is the minimizer of the red parabola
Each step solves a simplified local optimization problem
Step size η controls the strength of the quadratic term, affecting the "trust region" size
📊 Why Add the Quadratic Term?
Without quadratic term: Only first-order Taylor expansion
m(x) = f(xs ) + ⟨∇f(xs ), x - xs ⟩
Problem: This is a linear function with no minimum! (unless gradient is zero)
With quadratic term:
+ (1/(2η)) ||x - xs ||²
✅ Now there's a unique minimizer!
✅ Quadratic term acts as a "trust region" preventing steps that are too large
🎯 Deriving the Gradient Descent Formula
Goal: Minimize the local model
minx ms (x)
Take derivative and set to zero:
∇ms (x) = ∇f(xs ) + (1/η)(x - xs ) = 0
Solve for x:
x = xs - η∇f(xs )
✅ This is the gradient descent update formula!
🎓 Deep Understanding
1. The Essence of Gradient Descent:
Not directly minimizing f(x) (too hard)
Instead, at each step minimizing a local quadratic approximation ms (x)
This local model is easy to solve (closed-form solution)
2. The Role of Step Size η:
Small η: Strong quadratic penalty, small trust region, conservative steps
Large η: Weak quadratic penalty, large trust region, aggressive steps
Need to balance: too small → slow convergence, too large → instability
3. Connection to Newton's Method:
Gradient descent model: m(x) = f(xs ) + ⟨∇f(xs ), x-xs ⟩ + (1/(2η))||x-xs ||²
Newton's method model: m(x) = f(xs ) + ⟨∇f(xs ), x-xs ⟩ + ½⟨x-xs , H(x-xs )⟩
Gradient descent approximates the Hessian matrix H with (1/η)I
4. Why Does This Work?
The local model is a good approximation near the current point
Minimizing the local model is much simpler than minimizing the original function
Each step guarantees local descent (with appropriate step size)
Multiple steps accumulate, eventually reaching a global or local minimum
🔧 Practical Applications
This "local model" idea appears in many optimization algorithms:
Trust Region Methods: Explicitly control trust region size
Proximal Methods: Add regularization terms as penalties
Quasi-Newton Methods: Use better quadratic models
Adaptive Learning Rates: Dynamically adjust η to adapt to local curvature
📚 Mathematical Formulation
The gradient descent step solves:
xs+1 = argminx∈ℝd { ⟨∇f(xs ), x - xs ⟩ + (1/(2ηs )) ||x - xs ||²2 }
Proof: The objective is strictly convex in x. Its gradient with respect to x is:
∇f(xs ) + (1/ηs )(x - xs )
Setting this equal to zero gives the first-order optimality condition:
∇f(xs ) + (1/ηs )(x - xs ) = 0 ⟺ x = xs - ηs ∇f(xs )
which is exactly the gradient descent update! ■