📊 随机梯度下降 (SGD) 可视化

Stochastic Gradient Descent: From Noise to Convergence

💡 核心动机 | Motivation

计算挑战 | Computational Challenge:

在机器学习中,目标函数通常具有有限和结构: \(f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)\),其中 \(n\) 可能非常大(如 \(10^6\) 个数据点)。 计算完整梯度 \(\nabla f(x) = \frac{1}{n}\sum_{i=1}^n \nabla f_i(x)\) 需要 \(O(n)\) 的计算量。
In machine learning, computing the full gradient over all \(n\) data points requires \(O(n)\) computational cost per iteration.

随机梯度思想 | Stochastic Gradient Idea:
随机采样一个分量 \(i \sim \text{Uniform}(\{1,\ldots,n\})\),使用 \(\nabla f_i(x)\) 作为梯度的无偏估计:
\[\mathbb{E}_i[\nabla f_i(x)] = \nabla f(x)\] 每次迭代成本从 \(O(n)\) 降至 \(O(1)\)!

核心权衡 | Key Tradeoff: 用梯度的噪声估计换取计算效率,允许更多次迭代。
Trade noisy gradient estimates for computational efficiency, enabling many more iterations.

🎯 完整梯度下降
Full GD

0.10
迭代次数: 0
目标值: -
梯度范数: -

完整梯度 | Full Gradient

\(x_{t+1} = x_t - \eta \nabla f(x_t)\)

• 每步使用精确梯度
• 确定性路径
• \(O(n)\) 计算成本
• 光滑函数: \(O(1/T)\) 收敛

🎲 随机梯度下降
SGD

0.05
0.5
迭代次数: 0
当前点目标: -
平均点目标: -

随机梯度 | Stochastic Gradient

\(x_{t+1} = x_t - \eta \tilde{g}_t\)

• \(\mathbb{E}[\tilde{g}_t \mid x_t] = \nabla f(x_t)\)
• 噪声路径,需平均化
• \(O(1)\) 计算成本
• \(O(1/\sqrt{T})\) 收敛

📦 Mini-batch SGD

0.08
5
迭代次数: 0
目标值: -
有效方差: -

小批量梯度 | Mini-batch Gradient

\(\bar{g}_t = \frac{1}{m}\sum_{i=1}^m \tilde{g}_i(x_t)\)

• 方差 = \(\sigma^2/m\)
• 平滑路径
• \(O(m)\) 计算成本
• \(O(1/\sqrt{T})\),常数更优

等高线
优化路径
当前位置
最优点
平均点

📐 收敛理论 | Convergence Theory

定理 1:Lipschitz 凸函数的 SGD 收敛

假设 | Assumptions:

• \(f: \mathcal{X} \to \mathbb{R}\) 是凸且 \(L\)-Lipschitz 连续:\(\|\nabla f(x)\|_2 \le L\)
• 随机梯度满足无偏性:\(\mathbb{E}[\tilde{g}_t \mid x_t] = \nabla f(x_t)\)
• 有界方差假设:\(\mathbb{E}[\|\tilde{g}_t - \nabla f(x_t)\|_2^2 \mid x_t] \le \sigma^2\)
• 约束集直径:\(R = \sup_{x \in \mathcal{X}} \|x - x_1\|_2\)

算法 | Algorithm:

\[x_{t+1} = \Pi_{\mathcal{X}}[x_t - \eta_t \tilde{g}_t], \quad \bar{x}_T = \frac{1}{T}\sum_{t=1}^T x_t\]

最优步长选择 | Optimal Step Size:

\[\eta = \frac{R}{\sqrt{L^2 + \sigma^2} \cdot \sqrt{T}}\]

收敛率 | Convergence Rate:

\[\mathbb{E}[f(\bar{x}_T)] - f(x^*) \le \frac{R\sqrt{L^2 + \sigma^2}}{\sqrt{T}} = O\left(\frac{1}{\sqrt{T}}\right)\]

🔑 关键洞察 | Key Insights

1. 次线性收敛:\(O(1/\sqrt{T})\) 比光滑函数上 GD 的 \(O(1/T)\) 慢,这是使用噪声梯度的代价
Sublinear convergence: Slower than GD's \(O(1/T)\) for smooth functions, the price of noisy gradients

2. 方差影响:收敛率依赖于 \(\sqrt{L^2 + \sigma^2}\),当 \(\sigma \gg L\) 时方差主导
Variance matters: Rate depends on \(\sqrt{L^2 + \sigma^2}\), variance dominates when \(\sigma \gg L\)

3. 平均化至关重要:必须返回平均迭代点 \(\bar{x}_T\) 而非最后一次迭代 \(x_T\)。最后迭代点 \(x_T\) 会因噪声而持续振荡,不收敛到确定性极限。
Averaging is crucial: Must return averaged iterate \(\bar{x}_T\), not last iterate \(x_T\). The last iterate oscillates due to noise and doesn't converge to a deterministic limit.

4. 步长必须随 \(T\) 衰减:步长 \(\eta = O(1/\sqrt{T})\) 保证收敛。固定步长会导致围绕最优点的持续振荡。
Step size must decay with \(T\): Step size \(\eta = O(1/\sqrt{T})\) ensures convergence. Constant step size leads to persistent oscillation around optimum.

定理 2:光滑凸函数的 SGD 收敛

额外假设 | Additional Assumption: \(f\) 是 \(\ell\)-光滑(Lipschitz 梯度):
\(\|\nabla f(x) - \nabla f(y)\|_2 \le \ell\|x - y\|_2\)
\(f\) is \(\ell\)-smooth (Lipschitz gradients)

步长选择 | Step Size:

\[\eta_t = \min\left\{\frac{1}{\ell}, \frac{R}{\sigma\sqrt{2t}}\right\}\]

收敛率 | Convergence Rate:

\[\mathbb{E}[f(\bar{x}_T)] - f(x^*) \le \frac{R^2\ell}{2T} + \frac{R\sigma\sqrt{2}}{\sqrt{T}}\]

🔑 两个阶段 | Two Regimes

早期阶段(小 \(T\)):第一项 \(\frac{R^2\ell}{2T} = O(1/T)\) 主导,反映光滑性
Early stage (small \(T\)): First term \(O(1/T)\) dominates, reflects smoothness

后期阶段(大 \(T\)):第二项 \(\frac{R\sigma\sqrt{2}}{\sqrt{T}} = O(1/\sqrt{T})\) 主导,噪声成为瓶颈
Later stage (large \(T\)): Second term \(O(1/\sqrt{T})\) dominates, noise becomes bottleneck

关键观察:随着 \(T \to \infty\),随机噪声占主导,标准 SGD 无法达到优于 \(O(1/\sqrt{T})\) 的速率。这激发了方差降低技术的研究。
Key observation: As \(T \to \infty\), stochastic noise dominates, and standard SGD cannot achieve better than \(O(1/\sqrt{T})\) rate. This motivates variance reduction techniques.

命题:Mini-batch 的方差降低

如果随机梯度 \(\tilde{g}_i(x)\) 是独立同分布的,方差有界为 \(\sigma^2\),则小批量梯度满足:

\[\text{Var}(\bar{g}_t \mid x_t) = \mathbb{E}\left[\left\|\frac{1}{m}\sum_{i=1}^m \tilde{g}_i(x_t) - \nabla f(x_t)\right\|_2^2 \,\Big|\, x_t\right] \le \frac{\sigma^2}{m}\]

证明:由于样本独立,
\(\text{Var}\left(\frac{1}{m}\sum_{i=1}^m \tilde{g}_i\right) = \frac{1}{m^2} \sum_{i=1}^m \text{Var}(\tilde{g}_i) \le \frac{m \cdot \sigma^2}{m^2} = \frac{\sigma^2}{m}\)。

🔑 批大小权衡 | Batch Size Tradeoff

方差降低:批大小 \(m\) 使方差降低 \(1/m\) 倍,改善收敛常数
Variance reduction: Batch size \(m\) reduces variance by factor \(1/m\), improving convergence constant

计算成本:每次迭代计算成本增加 \(m\) 倍
Computational cost: Each iteration costs \(m\) times more

并行效率:在 GPU 上,由于并行化,计算 \(m\) 个梯度几乎与计算 1 个一样快(直到内存限制)
Parallel efficiency: On GPUs, computing \(m\) gradients is nearly as fast as 1 (up to memory limits) due to parallelization

实践建议:选择批大小平衡方差降低与硬件效率。典型值:\(m = 32, 64, 128, 256\)
Practical advice: Choose batch size to balance variance reduction with hardware efficiency. Typical values: \(m = 32, 64, 128, 256\)

⚠️ 常见误解 | Common Misconceptions

误解 1:"SGD 最终会收敛到精确的最优点"
事实:使用固定步长的 SGD 会在最优点附近振荡,不会收敛。必须使用衰减步长 \(\eta_t \to 0\) 或返回平均点。
Fact: SGD with constant step size oscillates around the optimum and doesn't converge. Must use decaying step sizes \(\eta_t \to 0\) or return averaged iterate.

误解 2:"增大批大小 \(m\) 总是更好"
事实:对于固定的计算预算,过大的批大小意味着更少的参数更新次数,可能导致更慢的收敛(总时间)。存在最优批大小。
Fact: For fixed computational budget, overly large batch size means fewer parameter updates, potentially leading to slower convergence in wall-clock time. Optimal batch size exists.

📊 方法对比总结

方法 | Method 每步成本 | Cost/Iter 收敛率 | Rate 优势 | Advantages 劣势 | Disadvantages
Full GD \(O(n)\) \(O(1/T)\) 光滑
\(O(1/\sqrt{T})\) 非光滑
确定性,稳定路径,精确梯度 大数据集时计算昂贵
SGD \(O(1)\) \(O(1/\sqrt{T})\) 每步极快,可扩展,内存高效 噪声大,需要平均化,需调节步长
Mini-batch SGD \(O(m)\) \(O(1/\sqrt{T})\),常数因子更优 平衡方差和效率,GPU 友好,稳定 需要调节批大小和步长

💡 实践指南 | Practical Guidance

何时使用 Full GD:小数据集(\(n < 10^4\)),需要精确优化,凸函数
When to use Full GD: Small datasets (\(n < 10^4\)), need exact optimization, convex functions

何时使用 SGD:大数据集(\(n > 10^6\)),在线学习,内存受限,流数据
When to use SGD: Large datasets (\(n > 10^6\)), online learning, memory constraints, streaming data

何时使用 Mini-batch:大多数深度学习应用,有 GPU/TPU 加速,需要稳定训练
When to use Mini-batch: Most deep learning applications, with GPU/TPU acceleration, need stable training

批大小建议:
• 小批量 (\(m = 32, 64\)): 更好的泛化,适合小模型
• 中批量 (\(m = 128, 256\)): 平衡性能和泛化,最常用
• 大批量 (\(m = 512, 1024\)): 训练稳定,可能需要特殊技巧(如学习率缩放)
Small batch (\(m = 32, 64\)): Better generalization, suitable for small models
Medium batch (\(m = 128, 256\)): Balance performance and generalization, most common
Large batch (\(m = 512, 1024\)): Stable training, may need special techniques (e.g., learning rate scaling)