Shampoo: Preconditioned Stochastic Tensor Optimization

PremiumOptimization

Shampoo: Preconditioned stochastic tensor optimization

1 Introduction

Shampoo is closely related to AdaGrad. The diagonal (i.e., element-wise) version of AdaGrad is extremely popular in practice and frequently applied to tasks ranging from learning linear models over sparse features to training of large deep-learning models. In contrast, the full-matrix version of AdaGrad analyzed in prior work is rarely used in practice due to the prohibitive memory and runtime requirements associated with maintaining a full preconditioner. Shampoo can be viewed as an efficient, practical and provable apparatus for approximately and implicitly using the full AdaGrad preconditioner, without falling back to diagonal matrices.

Algorithm 1. Shampoo for matrices

  • Initialize W1=0m×nW_1 = \mathbf{0}_{m \times n}; L0=ϵImL_0 = \epsilon I_m; R0=ϵInR_0 = \epsilon I_n;

  • for t=1,,Tt = 1,\dots,T do

    • Receive loss function ft:Rm×nRf_t: \mathbb{R}^{m \times n} \to \mathbb{R};
    • Compute gradient Gt=ft(Wt)G_t = \nabla f_t (W_t) (GtRm×nG_t \in \mathbb{R}^{m \times n});
    • Update preconditioners:
      • Lt=Lt1+GtGtTL_t = L_{t-1} + G_t G_t^T;
      • Rt=Rt1+GtTGtR_t = R_{t-1} + G_t^T G_t;
    • Update parameters:
      • Wt+1=WtηLt1/4GtRt1/4W_{t+1} = W_t - \eta L_t^{-1/4} G_t R_t^{-1/4};

Algorithm 2. AdaGrad with full matrices

  1. Input: η>0,δ0\eta>0,\delta\geq0
  2. Variables: sRd,HtRd×d,GtRd×ds \in \mathbb{R}^d, H_t \in \mathbb{R}^{d \times d}, G_t \in \mathbb{R}^{d \times d}
  3. Initialize: x1=0,S0=0,H0=0,G0=0x_1 = 0, S_0 = 0, H_0 =0, G_0 = 0
  4. for t=1t = 1 to TT do
    • suffer loss ft(xt)f_t (x_t)
    • receive subgradient gtft(xt)g_t \in \partial f_t (x_t)
    • update Gt=Gt1+gtgtT,St=Gt1/2G_t = G_{t-1} + g_t g_t^T, S_t = G_t^{1/2}
    • set Ht=δI+StH_t = \delta I + S_t, Ψt(x)=1/2x,Htx\Psi_t (x) = 1/2 \langle x,H_t x \rangle
    • Composite Mirror Descent Update: xt+1=argminxX{ηgt,x+ηϕ(x)+BΨt(x,xt)}x_{t+1} = \arg\min_{x \in \mathcal{X}} \{ \eta \langle g_t, x \rangle + \eta \phi(x) + B_{\Psi_t} (x, x_t) \}

For Algorithm 2, set ϕ=0,δ=0\phi = 0, \delta = 0, we have the update:

xt+1=xtηGt1/2gtx_{t+1} = x_t - \eta G_t^{-1/2} g_t

2 Background and technical tools

This is a premium article

Sign in and subscribe to unlock the full article.