Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Notation
Vectors and scalars are lower case italic letters, such as
A sequence of vectors is represented as , and entries of each vector are represented as .
The subdifferential set of a function at a point is denoted as , and a particular vector in the subdifferential set is denoted by or . When a function is differentiable, we use .
denotes the inner product between vectors and .
The Bregman divergence associated with a strongly convex and differentiable function is
denote the matrix obtained by concatenating the subgradient sequence.
We denote the th row of this matrix, which amounts to the concatenation of the th component of each subgradient we observe, by .
The outer product matrix .
What regret means
In online learning, we imagine a repeated game between a learner and an environment (or adversary). At each time step :
- The learner picks a decision or prediction vector .
- The environment reveals a loss function .
- The learner suffers the loss .
After rounds, we compare the learner's total loss to that of the best fixed decision in hindsight, :
This quantity is called the regret. It measures how much worse the learner performed than the best static predictor that could have been chosen with full knowledge of the future.
Standard subgradient algorithms then move the predictor in the opposite direction of while maintaining via the projected gradient update (e.g., Zinkevich, 2003)
In contrast, let the Mahalanobis norm and denote the projection of a point onto according to by . Using this notation, their generalization of standard gradient descent employs the update
The above algorithm is computationally impractical in high dimensions since it requires computation of the root of the matrix , the outer product matrix. Thus they specialize the update to
Both the inverse and root of can be computed in linear time.
In this paper, they consider several different online learning algorithms and their stochastic convex optimization counterparts.
Setting
They consider the online convex optimization setting where, at each time step , the learner picks a decision or prediction vector each round.
Instead of a single per-step loss , each composite loss is defined as
where and are convex functions.
The regret definition
The regret measures how much worse the algorithm performs (in total) than the best fixed decision that knows the entire sequence ahead of time.