🌍 欧式几何 vs 非欧几何

Euclidean vs Non-Euclidean Geometry: Understanding Distance in Different Spaces

🎯 从几何到优化 | From Geometry to Optimization

核心问题 | Key Question: 什么是"距离"？What is "distance"?

在不同的空间中，"距离"和"最短路径"的定义是不同的。理解这一点对于设计优化算法至关重要。
In different spaces, the definitions of "distance" and "shortest path" differ fundamentally. Understanding this is crucial for designing optimization algorithms.

欧式空间 \((\mathbb{R}^n)\): \(d(x,y) = \|x - y\|_2 = \sqrt{\sum_{i}(x_i - y_i)^2}\)

Riemannian 流形: \(d(x,y) = \inf\{L(\gamma) : \gamma \text{ is a curve from } x \text{ to } y\}\)

优化中的启示 | Insight for Optimization:

优化算法本质上是在参数空间中寻找"最短路径"到达最优点。选择与空间几何结构相匹配的度量，能够获得更优的收敛性质和算法性能。
Optimization algorithms fundamentally seek the "shortest path" to the optimum in parameter space. Choosing a metric that matches the geometric structure of the space yields better convergence properties and algorithm performance.

Mirror Descent 的动机: 当优化约束域具有非欧几何结构（如概率单纯形 \(\Delta^n = \{x \in \mathbb{R}^n_+ : \sum_i x_i = 1\}\)），使用 Bregman 散度而非欧式距离，算法能够自然地保持约束并获得更好的收敛率。
Motivation for Mirror Descent: When the constraint set has non-Euclidean structure (e.g., probability simplex), using Bregman divergence instead of Euclidean distance allows the algorithm to naturally preserve constraints and achieve better convergence rates.

📐 欧式几何 | Euclidean Geometry

平面 \(\mathbb{R}^2\) 中的直线 | Straight Lines in \(\mathbb{R}^2\)

度量结构 | Metric Structure: \((\mathbb{R}^n, \|\cdot\|_2)\) 配备欧式范数 / Equipped with Euclidean norm

距离公式 | Distance Formula: \(d(x,y) = \|x - y\|_2 = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}\)

测地线 | Geodesics: 两点间的直线段（唯一最短路径）/ Straight line segments (unique shortest path)

几何性质 | Geometric Properties:

• 平行公设：通过直线外一点有唯一平行线
Parallel postulate: Unique parallel through external point
• 三角形内角和 = \(\pi\)
Triangle angle sum = \(\pi\)
• 曲率 \(K = 0\)（平坦空间）
Curvature \(K = 0\) (flat space)

直线路径

起点

终点

🌐 非欧几何 | Non-Euclidean Geometry

球面 \(S^2\) 中的大圆 | Great Circles on \(S^2\)

度量结构 | Metric Structure: \((S^2, g)\) Riemannian 流形，其中 \(g\) 是球面度量 / Riemannian manifold with spherical metric

距离公式 | Distance Formula (intrinsic):
\(d(x,y) = R\cdot\arccos(\langle x,y\rangle)\) 其中 \(\langle \cdot,\cdot\rangle\) 是内积
where \(\langle \cdot,\cdot\rangle\) is the inner product in \(\mathbb{R}^3\)

测地线 | Geodesics: 大圆弧（球面上的"直线"）/ Great circle arcs (the "straight lines" on sphere)

几何性质 | Geometric Properties:

• 不存在平行线：所有大圆最终相交
No parallel lines: All great circles intersect
• 三角形内角和 \(> \pi\)（多余角 = \(K \cdot\)面积）
Triangle angle sum \(> \pi\) (excess = \(K\cdot\)area)
• 正曲率 \(K = 1/R^2\)（弯曲空间）
Positive curvature \(K = 1/R^2\) (curved space)

大圆路径

起点

终点

💡 从几何到优化算法设计 | From Geometry to Algorithm Design

🛫 直观例子：飞机航线 | Intuitive Example: Flight Routes

从北京（39.9°N, 116.4°E）飞往纽约（40.7°N, -74.0°W）：
Flying from Beijing (39.9°N, 116.4°E) to New York (40.7°N, -74.0°W):

• 在平面地图（Mercator投影）上：看似弯曲的路线
On flat map (Mercator projection): Appears as curved route
• 在地球表面（S²）上：沿大圆的测地线，实际最短！
On Earth's surface (S²): Great circle geodesic, actually shortest!
• 关键：内蕴几何（intrinsic geometry）决定最短路径
Key: The intrinsic geometry determines the shortest path

如果飞行员按照平面地图的"直线"飞行，实际上走的是球面上的弦（chord），但弦不在地球表面上，因此不是可行路径。
If pilots fly along the "straight line" on a flat map, they follow a chord through the sphere—but this is not on Earth's surface, hence not a feasible path.

📊 数学类比：优化中的几何结构 | Mathematical Analogy: Geometry in Optimization

问题设定 | Problem Setting: \(\min_{x\in\mathcal{X}} f(x)\)，其中 \(\mathcal{X} \subset \mathbb{R}^n\) 是约束集
where \(\mathcal{X} \subset \mathbb{R}^n\) is the constraint set

1. 欧式梯度下降 (Euclidean GD):

\[x_{t+1} = \Pi_{\mathcal{X}}[x_t - \eta\nabla f(x_t)]\]

• 度量：欧式距离 \(D(x,y) = \frac{1}{2}\|x-y\|_2^2\)
Metric: Euclidean distance
• 适用：无约束或简单凸约束（如 \(\ell_2\) 球）
Suitable for: Unconstrained or simple convex constraints
• 需要投影算子 \(\Pi_{\mathcal{X}}\) 保持约束
Requires projection operator \(\Pi_{\mathcal{X}}\) to maintain constraints

2. Mirror Descent (Bregman GD):

\[x_{t+1} = \nabla\psi^*[\nabla\psi(x_t) - \eta\nabla f(x_t)]\]

• 度量：Bregman 散度 \(D_{\psi}(x,y) = \psi(x) - \psi(y) - \langle\nabla\psi(y), x-y\rangle\)
Metric: Bregman divergence
• \(\psi\) 是镜像映射（mirror map），强凸且可微
\(\psi\) is the mirror map, strongly convex and differentiable
• 适用：具有特殊几何结构的约束集
Suitable for: Constraint sets with special geometric structure

🎯 关键例子：概率单纯形 | Key Example: Probability Simplex

约束集: \(\Delta^n = \{x \in \mathbb{R}^n_+ : \sum_i x_i = 1\}\)（所有离散概率分布的空间）
Constraint set: Space of all discrete probability distributions

为什么需要 Mirror Descent？| Why Mirror Descent?

欧式几何的问题：

• 欧式投影 \(\Pi_{\Delta^n}\) 计算复杂（需求解 QP）
Euclidean projection \(\Pi_{\Delta^n}\) is computationally expensive (requires solving QP)
• 不尊重概率分布的自然几何结构
Does not respect the natural geometry of probability distributions

信息几何的优势：

• 选择 \(\psi(x) = \sum_i x_i \log x_i\)（负熵）
Choose \(\psi(x) = \sum_i x_i \log x_i\) (negative entropy)
• Bregman 散度 = KL 散度：\(D_{\text{KL}}(x\|y) = \sum_i x_i \log\frac{x_i}{y_i}\)
Bregman divergence = KL divergence
• 更新规则自动保持非负性和归一化！
Update rule automatically preserves non-negativity and normalization!

\[x_{t+1,i} \propto x_{t,i} \exp\left(-\eta\cdot\frac{\partial f}{\partial x_i}\right)\] 乘性更新 | Multiplicative update

📈 理论优势 | Theoretical Advantages

收敛率 (Convex):

• GD: \(O(1/\sqrt{T})\) 当使用最优步长
• MD: \(O(1/\sqrt{T})\) 但常数因子更优，特别是当初始点距离最优解较远时
MD: \(O(1/\sqrt{T})\) but with better constants, especially when starting far from optimum

适应性 (Adaptivity):

• GD: 对所有维度一视同仁
GD: Treats all dimensions equally
• MD: 根据 \(\psi\) 的 Hessian 自适应地缩放不同维度
MD: Adaptively scales different dimensions via Hessian of \(\psi\)

稀疏性 (Sparsity):

• 负熵镜像映射鼓励稀疏解（许多分量趋向 0）
Negative entropy mirror map encourages sparse solutions

🔗 统一框架 | Unified Framework

Mirror Descent 是一个统一的优化框架：
Mirror Descent provides a unified optimization framework:

• 选择 \(\psi(x) = \frac{1}{2}\|x\|_2^2\) → MD 退化为标准 GD
Choose \(\psi(x) = \frac{1}{2}\|x\|_2^2\) → MD reduces to standard GD

• 选择 \(\psi(x) = \sum_i x_i \log x_i\) → 得到 Exponentiated Gradient
Choose \(\psi(x) = \sum_i x_i \log x_i\) → Obtain Exponentiated Gradient

• 选择 \(\psi(X) = -\log \det(X)\) → 得到正定矩阵空间上的优化
Choose \(\psi(X) = -\log \det(X)\) → Optimization on positive definite matrices

核心思想：让算法的几何与问题的几何相匹配！
Core Idea: Match the algorithm's geometry to the problem's geometry!