Sophia: A Scalable Stochastic Second-order Optimizer

PremiumOptimization

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Introduction

Large language models (LLMs) have achieved remarkable performance as their scale increases, but their pre-training process is extremely expensive in terms of time, compute, and cost. For example, training models like PaLM can take months and cost millions of dollars. Therefore, improving the efficiency of optimization algorithms is a critical bottleneck in scaling LLMs.

Currently, Adam and its variants dominate LLM training due to their robustness and effectiveness. However, designing faster optimizers is challenging for two main reasons:

  • The theoretical understanding of Adam's preconditioning mechanism is still limited.
  • More advanced second-order methods (Hessian-based) are typically too computationally expensive for large-scale models.

To address this, the paper introduces Sophia (Second-order Clipped Stochastic Optimization), a scalable second-order optimizer with the following key ideas:

  • It uses a lightweight diagonal Hessian estimate as a preconditioner.
  • It applies element-wise clipping to control update magnitudes and improve stability.
  • It computes Hessian estimates infrequently (every k steps) to reduce overhead.

Empirically, Sophia achieves significant improvements:

  • 2x speedup over Adam in terms of training steps, total compute, and wall-clock time.
  • Shows better scaling behavior as model size increases.

Conceptually, Sophia improves optimization by:

  • Better adapting to heterogeneous curvature across parameters (sharp vs flat directions).
  • Penalizing updates more in sharp directions and less in flat ones.
  • Avoiding instability from non-convexity via clipping.

Theoretically, the authors show that Sophia can achieve convergence rates independent of the condition number, highlighting its advantage over traditional first-order methods.

Method

We first instantiate gradient descent (GD) and Adam on a simplified 2D problem and motivate the use of second-order information and per-coordinate clipping. Then, we present Sophia in detail and the pseudo-code in Algorithm 3. We introduce two choices of estimators of diagonal Hessian used in Sophia.

Motivation

Heterogeneous curvatures. The loss functions of modern deep learning problems often have different curvatures across different parameter dimensions. E.g., on a 125M-parameter GPT-2 model, the distribution of positive diagonal entries of the Hessian is dispersed.

We demonstrate the limitations of Adam and GD on heterogeneous landscapes by considering a two-dimensional loss function L(θ1,θ2)=L1(θ1)+L2(θ2)L(\theta_1, \theta_2) = L_1 (\theta_1) + L_2 (\theta_2) where L1L_1 is much sharper than L2L_2.[^1]

[^1]: Concretely, L1(θ1)=8(θ11)2(1.3θ12+2θ1+1)L_1 (\theta_1) = 8(\theta_1 - 1)^2 (1.3 \theta_1^2 + 2 \theta_1 + 1) and L2(θ2)=1/2(θ24)2L_2 (\theta_2) = 1/2 (\theta_2 - 4)^2

2D landscape

Recall that GD's update in this setting is:

θ1θ1ηL1(θ1)andθ2θ2ηL2(θ2)\theta_1 \leftarrow \theta_1 - \eta L_1' (\theta_1) \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta L_2' (\theta_2)

A common simplification of Adam that is more amenable to analysis is SignGD, where Adam's update is simplified to ηL(θ)/L(θ)=ηsign(L(θ))\eta \nabla L(\theta) / |\nabla L(\theta)| = \eta \cdot \operatorname{sign}(\nabla L(\theta)) (where all operations are entry-wise). Applying the update rule to our setting gives:

θ1θ1ηsign(L1(θ1))andθ2θ2ηsign(L2(θ2))\theta_1 \leftarrow \theta_1 - \eta \operatorname{sign}(L_1' (\theta_1)) \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta \operatorname{sign}(L_2' (\theta_2))

Limitation of GD and SignGD (Adam)

The optimal learning rate should be inversely proportional to the curvature. Let h1h_1 and h2h_2 be the curvatures of L1L_1 and L2L_2 at the optimum.

If h1h2h_1 \gg h_2 (sharp vs. flat dimension):

  • The optimal learning rate for θ1\theta_1 is 1/h1\approx 1/h_1 (small).
  • The optimal learning rate for θ2\theta_2 is 1/h2\approx 1/h_2 (much larger).

GD must choose a small step size to remain stable in sharp directions. This leads to very slow convergence in flat dimensions.

The update size of SignGD is the learning rate η\eta in all dimensions.

This causes two problems:

  • Flat directions: progress is slow because each step reduces the loss only slightly.
  • Sharp directions: updates may overshoot and cause oscillations around the minimum.

To stabilize the sharp direction, the learning rate must decay toward zero, which further slows convergence in flat directions. The trajectory of Adam in this example is indeed similar to SignGD.

The behavior of SignGD and Adam above indicates that a more aggressive pre-conditioning is needed — sharp dimensions should have relatively smaller updates than flat dimensions so that the decrease of loss is equalized in all dimensions. As suggested by well-established literature on second-order optimization (Boyd & Vandenberghe, 2004) for convex functions, the optimal preconditioner should be the Hessian, which captures the curvature on each dimension; as in Newton's method, the update is the gradient divided by the Hessian in each dimension:

θ1θ1ηL1(θ1)/h1andθ2θ2ηL2(θ2)/h2\theta_1 \leftarrow \theta_1 - \eta L_1' (\theta_1) / h_1 \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta L_2' (\theta_2) / h_2

Limitations of Newton's Method

For non-convex functions, vanilla Newton's method could converge to a global maximum when the local curvature is negative. Newton's method quickly converges to a saddle point instead of a local minimum in some cases. The curvature might also change rapidly along the trajectory, making the second-order information unreliable.

To address these limitations, we propose considering only pre-conditioners that capture positive curvature, and introduce a per-coordinate clipping mechanism to mitigate the rapid change of Hessian (more detail in Section 2.2). Applying our algorithm on the toy case results in the following update:

θ1θ1ηclip(L1(θ1)/max{h1,ϵ},ρ)θ2θ2ηclip(L2(θ2)/max{h2,ϵ},ρ)\theta_1 \leftarrow \theta_1 - \eta \operatorname{clip}(L_1' (\theta_1) / \max\{h_1, \epsilon\}, \rho)\\ \theta_2 \leftarrow \theta_2 - \eta \operatorname{clip}(L_2' (\theta_2) / \max\{h_2, \epsilon\}, \rho)

where ρ\rho is a constant to control the worst-case update size, and ϵ\epsilon is a small constant to prevent division by zero. When the curvature of some dimension is rapidly changing or negative and thus the second-order information is misleading and possibly leads to a huge update before clipping, the clipping mechanism kicks in and the optimizer defaults to SignGD.

As shown by the results, the update starts off similarly to SignGD due to the clipping mechanism in the non-convex region, making descent opposed to converging to a local maximum. Then, in the convex valley, it converges to the global minimum with a few steps.

Compared with SignGD and Adam, it makes much faster progress in the flat dimension θ2\theta_2 (because the update is bigger in dimension θ2\theta_2), while avoiding bouncing in the sharp dimension θ1\theta_1 (because the update is significantly shrunk in the sharp dimension θ1\theta_1).

Sophia: Second-order Clipped Stochastic Optimization

The motivation section demonstrates that Adam does not sufficiently adapt to the heterogeneous curvatures. On the other hand, vanilla Newton's method has a pre-conditioner optimal for convex functions, but is vulnerable to negative curvature and rapid change of Hessian. With these insights, we design a new optimizer, Sophia, which is more adaptive to heterogeneous curvatures than Adam, more resistant to non-convexity and rapid change of Hessian than Newton's method, and also uses a low-cost pre-conditioner.

We use θt\theta_t to denote the parameter at time step tt. At each step, we sample a mini-batch from the data distribution and calculate the mini-batch loss, denoted by Lt(θt)L_t (\theta_t). We denote by gtg_t the gradient of Lt(θt)L_t (\theta_t), i.e. gt=Lt(θt)g_t = \nabla L_t (\theta_t). Let mtm_t be the EMA of gradients, mtβ1mt1+(1β1)gtm_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t, which is the numerator of the update.

EMA of diagonal Hessian estimates. Similar to the gradient of the mini-batch loss function, the estimated diagonal Hessian can also have large noise. Inspired by the EMA of moments of gradients in Adam, we also denoise the diagonal Hessian estimates with EMA across iterations. We update the EMA every kk steps, resulting in the following update rule for the diagonal Hessian estimate:

ht=β2htk+(1β2)h^tif tmodk=1,else ht=ht1h_t = \beta_2 h_{t-k} + (1 - \beta_2) \hat{h}_t \quad \text{if } t \bmod k = 1, \quad \text{else } h_t = h_{t-1}

Per-coordinate clipping. As discussed in Section 2.1, on nonconvex functions, vanilla Newton's method, which uses Hessian as the pre-conditioner, may converge to local maxima instead of local minima. In addition, the inaccuracy of Hessian estimates and the change of Hessian along the trajectory can make the second-order information unreliable. To this end, we (1) only consider the positive entries of the diagonal Hessian and (2) introduce per-coordinate clipping to the update. For a clipping threshold ρ>0\rho > 0, let the clipping function be clip(z,ρ)=max{min{z,ρ},ρ}\operatorname{clip}(z,\rho) = \max\{\min\{z,\rho\}, -\rho\} where all operations are applied entry-wise. The update rule is as follows:

θt+1θtηtclip(mt/max{γht,ϵ},1)\theta_{t+1} \leftarrow \theta_t - \eta_t \cdot \operatorname{clip}(m_t / \max\{\gamma \cdot h_t, \epsilon\}, 1)

This is a premium article

Sign in and subscribe to unlock the full article.