Shampoo: Preconditioned stochastic tensor optimization
1 Introduction
Shampoo is closely related to AdaGrad. The diagonal (i.e., element-wise) version of AdaGrad is extremely popular in practice and frequently applied to tasks ranging from learning linear models over sparse features to training of large deep-learning models. In contrast, the full-matrix version of AdaGrad analyzed in prior work is rarely used in practice due to the prohibitive memory and runtime requirements associated with maintaining a full preconditioner. Shampoo can be viewed as an efficient, practical and provable apparatus for approximately and implicitly using the full AdaGrad preconditioner, without falling back to diagonal matrices.
Algorithm 1. Shampoo for matrices
-
Initialize W1=0m×n; L0=ϵIm; R0=ϵIn;
-
for t=1,…,T do
- Receive loss function ft:Rm×n→R;
- Compute gradient Gt=∇ft(Wt) (Gt∈Rm×n);
- Update preconditioners:
- Lt=Lt−1+GtGtT;
- Rt=Rt−1+GtTGt;
- Update parameters:
- Wt+1=Wt−ηLt−1/4GtRt−1/4;
Algorithm 2. AdaGrad with full matrices
- Input: η>0,δ≥0
- Variables: s∈Rd,Ht∈Rd×d,Gt∈Rd×d
- Initialize: x1=0,S0=0,H0=0,G0=0
- for t=1 to T do
- suffer loss ft(xt)
- receive subgradient gt∈∂ft(xt)
- update Gt=Gt−1+gtgtT,St=Gt1/2
- set Ht=δI+St, Ψt(x)=1/2⟨x,Htx⟩
- Composite Mirror Descent Update:
xt+1=argx∈Xmin{η⟨gt,x⟩+ηϕ(x)+BΨt(x,xt)}
For Algorithm 2, set ϕ=0,δ=0, we have the update:
xt+1=xt−ηGt−1/2gt
2 Background and technical tools
This is a premium article
Sign in and subscribe to unlock the full article.