Shampoo: Preconditioned stochastic tensor optimization

1 Introduction

Shampoo is closely related to AdaGrad. The diagonal (i.e., element-wise) version of AdaGrad is extremely popular in practice and frequently applied to tasks ranging from learning linear models over sparse features to training of large deep-learning models. In contrast, the full-matrix version of AdaGrad analyzed in prior work is rarely used in practice due to the prohibitive memory and runtime requirements associated with maintaining a full preconditioner. Shampoo can be viewed as an efficient, practical and provable apparatus for approximately and implicitly using the full AdaGrad preconditioner, without falling back to diagonal matrices.

Algorithm 1. Shampoo for matrices

Initialize $W_1 = \mathbf{0}_{m \times n}$ ; $L_0 = \epsilon I_m$ ; $R_0 = \epsilon I_n$ ;
for $t = 1,\dots,T$ do
- Receive loss function $f_t: \mathbb{R}^{m \times n} \to \mathbb{R}$ ;
- Compute gradient $G_t = \nabla f_t (W_t)$ ( $G_t \in \mathbb{R}^{m \times n}$ );
- Update preconditioners:
  - $L_t = L_{t-1} + G_t G_t^T$ ;
  - $R_t = R_{t-1} + G_t^T G_t$ ;
- Update parameters:
  - $W_{t+1} = W_t - \eta L_t^{-1/4} G_t R_t^{-1/4}$ ;

Algorithm 2. AdaGrad with full matrices

Input: $\eta>0,\delta\geq0$
Variables: $s \in \mathbb{R}^d, H_t \in \mathbb{R}^{d \times d}, G_t \in \mathbb{R}^{d \times d}$
Initialize: $x_1 = 0, S_0 = 0, H_0 =0, G_0 = 0$
for $t = 1$ $t = 1$ to $T$ $T$ do
- suffer loss $f_t (x_t)$
- receive subgradient $g_t \in \partial f_t (x_t)$
- update $G_t = G_{t-1} + g_t g_t^T, S_t = G_t^{1/2}$
- set $H_t = \delta I + S_t$ , $\Psi_t (x) = 1/2 \langle x,H_t x \rangle$
- Composite Mirror Descent Update: $x_{t+1} = \arg\min_{x \in \mathcal{X}} \{ \eta \langle g_t, x \rangle + \eta \phi(x) + B_{\Psi_t} (x, x_t) \}$

For Algorithm 2, set $\phi = 0, \delta = 0$ , we have the update:

x_{t+1} = x_t - \eta G_t^{-1/2} g_t

Shampoo: Preconditioned Stochastic Tensor Optimization

Shampoo: Preconditioned stochastic tensor optimization

1 Introduction

Algorithm 1. Shampoo for matrices

Algorithm 2. AdaGrad with full matrices

2 Background and technical tools

This is a premium article