SOAP: Improving and Stabilizing Shampoo Using Adam
SOAP runs AdamW in the eigenbasis provided by Shampoo, outperforming both on language model pre-training.
Shampoo: Preconditioned Stochastic Tensor Optimization
An efficient Kronecker-product approximation of the full AdaGrad preconditioner for matrix-structured parameters.
Sophia: A Scalable Stochastic Second-order Optimizer
A lightweight diagonal Hessian estimator with per-coordinate clipping, achieving 2x speedup over Adam.
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
A block-wise Kronecker-factored approximation to the Fisher information matrix for efficient natural gradient descent.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
The ADAGRAD algorithm that adapts learning rates per-feature based on historical gradient information.
Stochastic Subgradient Method Converges on Tame Functions
Proves that the stochastic subgradient method converges on Whitney stratifiable and definable functions, including deep networks.
Optimization for Machine Learning
Gradient descent, accelerated methods, stochastic gradient, and adaptive methods (AdaGrad, RMSProp, Adam) with convergence guarantees.