Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Introduction

Large language models (LLMs) have achieved remarkable performance as their scale increases, but their pre-training process is extremely expensive in terms of time, compute, and cost. For example, training models like PaLM can take months and cost millions of dollars. Therefore, improving the efficiency of optimization algorithms is a critical bottleneck in scaling LLMs.

Currently, Adam and its variants dominate LLM training due to their robustness and effectiveness. However, designing faster optimizers is challenging for two main reasons:

The theoretical understanding of Adam's preconditioning mechanism is still limited.
More advanced second-order methods (Hessian-based) are typically too computationally expensive for large-scale models.

To address this, the paper introduces Sophia (Second-order Clipped Stochastic Optimization), a scalable second-order optimizer with the following key ideas:

It uses a lightweight diagonal Hessian estimate as a preconditioner.
It applies element-wise clipping to control update magnitudes and improve stability.
It computes Hessian estimates infrequently (every k steps) to reduce overhead.

Empirically, Sophia achieves significant improvements:

2x speedup over Adam in terms of training steps, total compute, and wall-clock time.
Shows better scaling behavior as model size increases.

Conceptually, Sophia improves optimization by:

Better adapting to heterogeneous curvature across parameters (sharp vs flat directions).
Penalizing updates more in sharp directions and less in flat ones.
Avoiding instability from non-convexity via clipping.

Theoretically, the authors show that Sophia can achieve convergence rates independent of the condition number, highlighting its advantage over traditional first-order methods.

Method

We first instantiate gradient descent (GD) and Adam on a simplified 2D problem and motivate the use of second-order information and per-coordinate clipping. Then, we present Sophia in detail and the pseudo-code in Algorithm 3. We introduce two choices of estimators of diagonal Hessian used in Sophia.

Motivation

Heterogeneous curvatures. The loss functions of modern deep learning problems often have different curvatures across different parameter dimensions. E.g., on a 125M-parameter GPT-2 model, the distribution of positive diagonal entries of the Hessian is dispersed.

We demonstrate the limitations of Adam and GD on heterogeneous landscapes by considering a two-dimensional loss function $L(\theta_1, \theta_2) = L_1 (\theta_1) + L_2 (\theta_2)$ where $L_1$ is much sharper than $L_2$ .[^1]

[^1]: Concretely, $L_1 (\theta_1) = 8(\theta_1 - 1)^2 (1.3 \theta_1^2 + 2 \theta_1 + 1)$ and $L_2 (\theta_2) = 1/2 (\theta_2 - 4)^2$

2D landscape

Recall that GD's update in this setting is:

\theta_1 \leftarrow \theta_1 - \eta L_1' (\theta_1) \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta L_2' (\theta_2)

A common simplification of Adam that is more amenable to analysis is SignGD, where Adam's update is simplified to $\eta \nabla L(\theta) / |\nabla L(\theta)| = \eta \cdot \operatorname{sign}(\nabla L(\theta))$ (where all operations are entry-wise). Applying the update rule to our setting gives:

\theta_1 \leftarrow \theta_1 - \eta \operatorname{sign}(L_1' (\theta_1)) \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta \operatorname{sign}(L_2' (\theta_2))

Limitation of GD and SignGD (Adam)

The optimal learning rate should be inversely proportional to the curvature. Let $h_1$ and $h_2$ be the curvatures of $L_1$ and $L_2$ at the optimum.

If $h_1 \gg h_2$ (sharp vs. flat dimension):

The optimal learning rate for $\theta_1$ is $\approx 1/h_1$ (small).
The optimal learning rate for $\theta_2$ is $\approx 1/h_2$ (much larger).

GD must choose a small step size to remain stable in sharp directions. This leads to very slow convergence in flat dimensions.

The update size of SignGD is the learning rate $\eta$ in all dimensions.

This causes two problems:

Flat directions: progress is slow because each step reduces the loss only slightly.
Sharp directions: updates may overshoot and cause oscillations around the minimum.

To stabilize the sharp direction, the learning rate must decay toward zero, which further slows convergence in flat directions. The trajectory of Adam in this example is indeed similar to SignGD.

The behavior of SignGD and Adam above indicates that a more aggressive pre-conditioning is needed — sharp dimensions should have relatively smaller updates than flat dimensions so that the decrease of loss is equalized in all dimensions. As suggested by well-established literature on second-order optimization (Boyd & Vandenberghe, 2004) for convex functions, the optimal preconditioner should be the Hessian, which captures the curvature on each dimension; as in Newton's method, the update is the gradient divided by the Hessian in each dimension:

\theta_1 \leftarrow \theta_1 - \eta L_1' (\theta_1) / h_1 \quad \text{and} \quad \theta_2 \leftarrow \theta_2 - \eta L_2' (\theta_2) / h_2

Limitations of Newton's Method

For non-convex functions, vanilla Newton's method could converge to a global maximum when the local curvature is negative. Newton's method quickly converges to a saddle point instead of a local minimum in some cases. The curvature might also change rapidly along the trajectory, making the second-order information unreliable.

To address these limitations, we propose considering only pre-conditioners that capture positive curvature, and introduce a per-coordinate clipping mechanism to mitigate the rapid change of Hessian (more detail in Section 2.2). Applying our algorithm on the toy case results in the following update:

\theta_1 \leftarrow \theta_1 - \eta \operatorname{clip}(L_1' (\theta_1) / \max\{h_1, \epsilon\}, \rho)\\ \theta_2 \leftarrow \theta_2 - \eta \operatorname{clip}(L_2' (\theta_2) / \max\{h_2, \epsilon\}, \rho)

where $\rho$ is a constant to control the worst-case update size, and $\epsilon$ is a small constant to prevent division by zero. When the curvature of some dimension is rapidly changing or negative and thus the second-order information is misleading and possibly leads to a huge update before clipping, the clipping mechanism kicks in and the optimizer defaults to SignGD.

As shown by the results, the update starts off similarly to SignGD due to the clipping mechanism in the non-convex region, making descent opposed to converging to a local maximum. Then, in the convex valley, it converges to the global minimum with a few steps.

Compared with SignGD and Adam, it makes much faster progress in the flat dimension $\theta_2$ (because the update is bigger in dimension $\theta_2$ ), while avoiding bouncing in the sharp dimension $\theta_1$ (because the update is significantly shrunk in the sharp dimension $\theta_1$ ).

Sophia: Second-order Clipped Stochastic Optimization

The motivation section demonstrates that Adam does not sufficiently adapt to the heterogeneous curvatures. On the other hand, vanilla Newton's method has a pre-conditioner optimal for convex functions, but is vulnerable to negative curvature and rapid change of Hessian. With these insights, we design a new optimizer, Sophia, which is more adaptive to heterogeneous curvatures than Adam, more resistant to non-convexity and rapid change of Hessian than Newton's method, and also uses a low-cost pre-conditioner.

We use $\theta_t$ to denote the parameter at time step $t$ . At each step, we sample a mini-batch from the data distribution and calculate the mini-batch loss, denoted by $L_t (\theta_t)$ . We denote by $g_t$ the gradient of $L_t (\theta_t)$ , i.e. $g_t = \nabla L_t (\theta_t)$ . Let $m_t$ be the EMA of gradients, $m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t$ , which is the numerator of the update.

EMA of diagonal Hessian estimates. Similar to the gradient of the mini-batch loss function, the estimated diagonal Hessian can also have large noise. Inspired by the EMA of moments of gradients in Adam, we also denoise the diagonal Hessian estimates with EMA across iterations. We update the EMA every $k$ steps, resulting in the following update rule for the diagonal Hessian estimate:

h_t = \beta_2 h_{t-k} + (1 - \beta_2) \hat{h}_t \quad \text{if } t \bmod k = 1, \quad \text{else } h_t = h_{t-1}

Per-coordinate clipping. As discussed in Section 2.1, on nonconvex functions, vanilla Newton's method, which uses Hessian as the pre-conditioner, may converge to local maxima instead of local minima. In addition, the inaccuracy of Hessian estimates and the change of Hessian along the trajectory can make the second-order information unreliable. To this end, we (1) only consider the positive entries of the diagonal Hessian and (2) introduce per-coordinate clipping to the update. For a clipping threshold $\rho > 0$ , let the clipping function be $\operatorname{clip}(z,\rho) = \max\{\min\{z,\rho\}, -\rho\}$ where all operations are applied entry-wise. The update rule is as follows:

\theta_{t+1} \leftarrow \theta_t - \eta_t \cdot \operatorname{clip}(m_t / \max\{\gamma \cdot h_t, \epsilon\}, 1)

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Introduction

Method

Motivation

Sophia: Second-order Clipped Stochastic Optimization

This is a premium article