随机梯度下降 (SGD) 可视化教学

📐 收敛理论 | Convergence Theory

定理 1：Lipschitz 凸函数的 SGD 收敛

假设 | Assumptions:

• \(f: \mathcal{X} \to \mathbb{R}\) 是凸且 \(L\)-Lipschitz 连续：\(\|\nabla f(x)\|_2 \le L\)
• 随机梯度满足无偏性：\(\mathbb{E}[\tilde{g}_t \mid x_t] = \nabla f(x_t)\)
• 有界方差假设：\(\mathbb{E}[\|\tilde{g}_t - \nabla f(x_t)\|_2^2 \mid x_t] \le \sigma^2\)
• 约束集直径：\(R = \sup_{x \in \mathcal{X}} \|x - x_1\|_2\)

算法 | Algorithm:

\[x_{t+1} = \Pi_{\mathcal{X}}[x_t - \eta_t \tilde{g}_t], \quad \bar{x}_T = \frac{1}{T}\sum_{t=1}^T x_t\]

最优步长选择 | Optimal Step Size:

\[\eta = \frac{R}{\sqrt{L^2 + \sigma^2} \cdot \sqrt{T}}\]

收敛率 | Convergence Rate:

\[\mathbb{E}[f(\bar{x}_T)] - f(x^*) \le \frac{R\sqrt{L^2 + \sigma^2}}{\sqrt{T}} = O\left(\frac{1}{\sqrt{T}}\right)\]

🔑 关键洞察 | Key Insights

1. 次线性收敛：\(O(1/\sqrt{T})\) 比光滑函数上 GD 的 \(O(1/T)\) 慢，这是使用噪声梯度的代价
Sublinear convergence: Slower than GD's \(O(1/T)\) for smooth functions, the price of noisy gradients

2. 方差影响：收敛率依赖于 \(\sqrt{L^2 + \sigma^2}\)，当 \(\sigma \gg L\) 时方差主导
Variance matters: Rate depends on \(\sqrt{L^2 + \sigma^2}\), variance dominates when \(\sigma \gg L\)

3. 平均化至关重要：必须返回平均迭代点 \(\bar{x}_T\) 而非最后一次迭代 \(x_T\)。最后迭代点 \(x_T\) 会因噪声而持续振荡，不收敛到确定性极限。
Averaging is crucial: Must return averaged iterate \(\bar{x}_T\), not last iterate \(x_T\). The last iterate oscillates due to noise and doesn't converge to a deterministic limit.

4. 步长必须随 \(T\) 衰减：步长 \(\eta = O(1/\sqrt{T})\) 保证收敛。固定步长会导致围绕最优点的持续振荡。
Step size must decay with \(T\): Step size \(\eta = O(1/\sqrt{T})\) ensures convergence. Constant step size leads to persistent oscillation around optimum.

定理 2：光滑凸函数的 SGD 收敛

步长选择 | Step Size:

\[\eta_t = \min\left\{\frac{1}{\ell}, \frac{R}{\sigma\sqrt{2t}}\right\}\]

收敛率 | Convergence Rate:

\[\mathbb{E}[f(\bar{x}_T)] - f(x^*) \le \frac{R^2\ell}{2T} + \frac{R\sigma\sqrt{2}}{\sqrt{T}}\]

🔑 两个阶段 | Two Regimes

早期阶段（小 \(T\)）：第一项 \(\frac{R^2\ell}{2T} = O(1/T)\) 主导，反映光滑性
Early stage (small \(T\)): First term \(O(1/T)\) dominates, reflects smoothness

后期阶段（大 \(T\)）：第二项 \(\frac{R\sigma\sqrt{2}}{\sqrt{T}} = O(1/\sqrt{T})\) 主导，噪声成为瓶颈
Later stage (large \(T\)): Second term \(O(1/\sqrt{T})\) dominates, noise becomes bottleneck

关键观察：随着 \(T \to \infty\)，随机噪声占主导，标准 SGD 无法达到优于 \(O(1/\sqrt{T})\) 的速率。这激发了方差降低技术的研究。
Key observation: As \(T \to \infty\), stochastic noise dominates, and standard SGD cannot achieve better than \(O(1/\sqrt{T})\) rate. This motivates variance reduction techniques.

命题：Mini-batch 的方差降低

如果随机梯度 \(\tilde{g}_i(x)\) 是独立同分布的，方差有界为 \(\sigma^2\)，则小批量梯度满足：

\[\text{Var}(\bar{g}_t \mid x_t) = \mathbb{E}\left[\left\|\frac{1}{m}\sum_{i=1}^m \tilde{g}_i(x_t) - \nabla f(x_t)\right\|_2^2 \,\Big|\, x_t\right] \le \frac{\sigma^2}{m}\]

证明：由于样本独立，
\(\text{Var}\left(\frac{1}{m}\sum_{i=1}^m \tilde{g}_i\right) = \frac{1}{m^2} \sum_{i=1}^m \text{Var}(\tilde{g}_i) \le \frac{m \cdot \sigma^2}{m^2} = \frac{\sigma^2}{m}\)。

🔑 批大小权衡 | Batch Size Tradeoff

方差降低：批大小 \(m\) 使方差降低 \(1/m\) 倍，改善收敛常数
Variance reduction: Batch size \(m\) reduces variance by factor \(1/m\), improving convergence constant

计算成本：每次迭代计算成本增加 \(m\) 倍
Computational cost: Each iteration costs \(m\) times more

并行效率：在 GPU 上，由于并行化，计算 \(m\) 个梯度几乎与计算 1 个一样快（直到内存限制）
Parallel efficiency: On GPUs, computing \(m\) gradients is nearly as fast as 1 (up to memory limits) due to parallelization

实践建议：选择批大小平衡方差降低与硬件效率。典型值：\(m = 32, 64, 128, 256\)
Practical advice: Choose batch size to balance variance reduction with hardware efficiency. Typical values: \(m = 32, 64, 128, 256\)

⚠️ 常见误解 | Common Misconceptions

误解 1："SGD 最终会收敛到精确的最优点"
事实：使用固定步长的 SGD 会在最优点附近振荡，不会收敛。必须使用衰减步长 \(\eta_t \to 0\) 或返回平均点。
Fact: SGD with constant step size oscillates around the optimum and doesn't converge. Must use decaying step sizes \(\eta_t \to 0\) or return averaged iterate.

误解 2："增大批大小 \(m\) 总是更好"
事实：对于固定的计算预算，过大的批大小意味着更少的参数更新次数，可能导致更慢的收敛（总时间）。存在最优批大小。
Fact: For fixed computational budget, overly large batch size means fewer parameter updates, potentially leading to slower convergence in wall-clock time. Optimal batch size exists.

📊 方法对比总结

方法 \| Method	每步成本 \| Cost/Iter	收敛率 \| Rate	优势 \| Advantages	劣势 \| Disadvantages
Full GD	\(O(n)\)	\(O(1/T)\) 光滑 \(O(1/\sqrt{T})\) 非光滑	确定性，稳定路径，精确梯度	大数据集时计算昂贵
SGD	\(O(1)\)	\(O(1/\sqrt{T})\)	每步极快，可扩展，内存高效	噪声大，需要平均化，需调节步长
Mini-batch SGD	\(O(m)\)	\(O(1/\sqrt{T})\)，常数因子更优	平衡方差和效率，GPU 友好，稳定	需要调节批大小和步长

💡 实践指南 | Practical Guidance

何时使用 Full GD：小数据集（\(n < 10^4\)），需要精确优化，凸函数
When to use Full GD: Small datasets (\(n < 10^4\)), need exact optimization, convex functions

何时使用 SGD：大数据集（\(n > 10^6\)），在线学习，内存受限，流数据
When to use SGD: Large datasets (\(n > 10^6\)), online learning, memory constraints, streaming data

何时使用 Mini-batch：大多数深度学习应用，有 GPU/TPU 加速，需要稳定训练
When to use Mini-batch: Most deep learning applications, with GPU/TPU acceleration, need stable training

批大小建议：
• 小批量 (\(m = 32, 64\)): 更好的泛化，适合小模型
• 中批量 (\(m = 128, 256\)): 平衡性能和泛化，最常用
• 大批量 (\(m = 512, 1024\)): 训练稳定，可能需要特殊技巧（如学习率缩放）
Small batch (\(m = 32, 64\)): Better generalization, suitable for small models
Medium batch (\(m = 128, 256\)): Balance performance and generalization, most common
Large batch (\(m = 512, 1024\)): Stable training, may need special techniques (e.g., learning rate scaling)

📊 随机梯度下降 (SGD) 可视化

💡 核心动机 | Motivation

🎯 完整梯度下降
Full GD

完整梯度 | Full Gradient