Mike's Archive

SOAP: Improving and Stabilizing Shampoo Using Adam

SOAP runs AdamW in the eigenbasis provided by Shampoo, outperforming both on language model pre-training.

An efficient Kronecker-product approximation of the full AdaGrad preconditioner for matrix-structured parameters.

A lightweight diagonal Hessian estimator with per-coordinate clipping, achieving 2x speedup over Adam.

A block-wise Kronecker-factored approximation to the Fisher information matrix for efficient natural gradient descent.

The ADAGRAD algorithm that adapts learning rates per-feature based on historical gradient information.

Proves that the stochastic subgradient method converges on Whitney stratifiable and definable functions, including deep networks.

Gradient descent, accelerated methods, stochastic gradient, and adaptive methods (AdaGrad, RMSProp, Adam) with convergence guarantees.