Understanding Gradient Descent via Local Quadratic Models

Why Does the Gradient Descent Update Make Sense?

📋 Gradient Descent Algorithm

x_s+1 = x_s - η_s ∇f(x_s)

Why does this update work? Let's understand it through local quadratic models!

🔍 The Core Idea: Local Quadratic Approximation

Imagine the scenario: You're lost on a mountain and want to reach the lowest point.

The problem: The entire terrain is too complex to see at once.

The solution: Just look at the small region under your feet! In this small region, the terrain can be approximated by a simple paraboloid (quadratic function).

Local Quadratic Model

Near point x_s, function f(x) can be approximated by:

m_s(x) = f(x_s) + ⟨∇f(x_s), x - x_s⟩ + (1/(2η_s)) ||x - x_s||²

= f(x_s) + linear term (gradient) + quadratic term (penalty)

Interpretation:

f(x_s): Current height/value
⟨∇f(x_s), x - x_s⟩: Linear term determined by current slope (gradient)
(1/(2η_s)) ||x - x_s||²: Quadratic penalty prevents stepping too far

🗻 True Function f(x)

📐 Local Quadratic Model m_s(x)

Click "Step" to begin the optimization process

True function f(x)

Local quadratic model m_s(x)

Current point x_s

Next step x_s+1

Optimal point x*

Step Size η:

0.20

💡 Key Observations:

The red parabola (local model) closely matches the blue curve (true function) near the current point
The green point (x_s+1) is the minimizer of the red parabola
Each step solves a simplified local optimization problem
Step size η controls the strength of the quadratic term, affecting the "trust region" size

📊 Why Add the Quadratic Term?

Without quadratic term: Only first-order Taylor expansion

m(x) = f(x_s) + ⟨∇f(x_s), x - x_s⟩

Problem: This is a linear function with no minimum! (unless gradient is zero)

With quadratic term:

+ (1/(2η)) ||x - x_s||²

✅ Now there's a unique minimizer!

✅ Quadratic term acts as a "trust region" preventing steps that are too large

🎯 Deriving the Gradient Descent Formula

Goal: Minimize the local model

min_x m_s(x)

Take derivative and set to zero:

∇m_s(x) = ∇f(x_s) + (1/η)(x - x_s) = 0

Solve for x:

x = x_s - η∇f(x_s)

✅ This is the gradient descent update formula!

🎓 Deep Understanding

1. The Essence of Gradient Descent:

Not directly minimizing f(x) (too hard)
Instead, at each step minimizing a local quadratic approximation m_s(x)
This local model is easy to solve (closed-form solution)

2. The Role of Step Size η:

Small η: Strong quadratic penalty, small trust region, conservative steps
Large η: Weak quadratic penalty, large trust region, aggressive steps
Need to balance: too small → slow convergence, too large → instability

3. Connection to Newton's Method:

Gradient descent model: m(x) = f(x_s) + ⟨∇f(x_s), x-x_s⟩ + (1/(2η))||x-x_s||²

Newton's method model: m(x) = f(x_s) + ⟨∇f(x_s), x-x_s⟩ + ½⟨x-x_s, H(x-x_s)⟩

Gradient descent approximates the Hessian matrix H with (1/η)I

4. Why Does This Work?

The local model is a good approximation near the current point
Minimizing the local model is much simpler than minimizing the original function
Each step guarantees local descent (with appropriate step size)
Multiple steps accumulate, eventually reaching a global or local minimum

🔧 Practical Applications

This "local model" idea appears in many optimization algorithms:

Trust Region Methods: Explicitly control trust region size
Proximal Methods: Add regularization terms as penalties
Quasi-Newton Methods: Use better quadratic models
Adaptive Learning Rates: Dynamically adjust η to adapt to local curvature

📚 Mathematical Formulation

The gradient descent step solves:

x_s+1 = argmin_x∈ℝ^d { ⟨∇f(x_s), x - x_s⟩ + (1/(2η_s)) ||x - x_s||²₂ }

Proof: The objective is strictly convex in x. Its gradient with respect to x is:

∇f(x_s) + (1/η_s)(x - x_s)

Setting this equal to zero gives the first-order optimality condition:

∇f(x_s) + (1/η_s)(x - x_s) = 0 ⟺ x = x_s - η_s∇f(x_s)

which is exactly the gradient descent update! ■