Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Notation

Vectors and scalars are lower case italic letters, such as $x \in \mathcal{X}$

A sequence of vectors is represented as $x_t, x_{t+1}, \dots$ , and entries of each vector are represented as $x_{t,j}$ .

The subdifferential set of a function $f$ at a point $x$ is denoted as $\partial f(x)$ , and a particular vector in the subdifferential set is denoted by $f'(x) \in \partial f$ or $g_t \in \partial f_t (x_t)$ . When a function is differentiable, we use $\nabla f$ .

$\langle x,y \rangle$ denotes the inner product between vectors $x$ and $y$ .

The Bregman divergence associated with a strongly convex and differentiable function $\Psi$ is

B_\Psi (x,y) = \Psi(x) - \Psi(y) - \langle\nabla \Psi(y), x - y\rangle

$g_{1:t} = [g_1 \dots, g_t]$ denote the matrix obtained by concatenating the subgradient sequence.

We denote the $i$ th row of this matrix, which amounts to the concatenation of the $i$ th component of each subgradient we observe, by $g_{1:t,i}$ .

The outer product matrix $G_t = \sum_{\tau=1}^t g_\tau g_\tau^T$ .

What regret means

In online learning, we imagine a repeated game between a learner and an environment (or adversary). At each time step $t = 1, 2, \dots, T$ :

The learner picks a decision or prediction vector $x_t \in \mathcal{X} \subset \mathbb{R}^d$ .
The environment reveals a loss function $f_t: \mathcal{X} \to \mathbb{R}$ .
The learner suffers the loss $f_t (x_t)$ .

After $T$ rounds, we compare the learner's total loss to that of the best fixed decision in hindsight, $x^*$ :

R(T) = \sum_{t=1}^T f_t (x_t) - \inf_{x \in \mathcal{X}} \sum_{t=1}^T f_t (x)

This quantity $R(T)$ is called the regret. It measures how much worse the learner performed than the best static predictor that could have been chosen with full knowledge of the future.

Standard subgradient algorithms then move the predictor $x_t$ in the opposite direction of $g_t$ while maintaining $x_{t+1} \in \mathcal{X}$ via the projected gradient update (e.g., Zinkevich, 2003)

x_{t+1} = \Pi_\mathcal{X} (x_t - \eta g_t) = \arg\min_{x \in \mathcal{X}} \|x - (x_t - \eta g_t)\|_2^2

In contrast, let the Mahalanobis norm $\|\cdot\|_A = \sqrt{\langle\cdot,A\cdot\rangle}$ and denote the projection of a point $y$ onto $\mathcal{X}$ according to $A$ by $\Pi_\mathcal{X}^A (y) = \arg\min_{x \in \mathcal{X}} \|x-y\|_A = \arg\min_{x \in \mathcal{X}} \langle x-y,A (x-y)\rangle$ . Using this notation, their generalization of standard gradient descent employs the update

x_{t+1} = \Pi_\mathcal{X}^{G_t^{1/2}} (x_t - \eta G_t^{-1/2} g_t)

The above algorithm is computationally impractical in high dimensions since it requires computation of the root of the matrix $G_t$ , the outer product matrix. Thus they specialize the update to

x_{t+1} = \Pi_\mathcal{X}^{\operatorname{diag}(G_t)^{1/2}} (x_t - \eta \operatorname{diag}(G_t)^{-1/2} g_t) .

Both the inverse and root of $\operatorname{diag}(G_t)$ can be computed in linear time.

In this paper, they consider several different online learning algorithms and their stochastic convex optimization counterparts.

Setting

They consider the online convex optimization setting where, at each time step $t = 1, 2, \dots, T$ , the learner picks a decision or prediction vector $x_t \in \mathcal{X} \subset \mathbb{R}^d$ each round.

Instead of a single per-step loss $f_t (x_t)$ , each composite loss is defined as

\Phi_t (x) = f_t (x) + \phi(x)

where $f_t$ and $\phi$ are convex functions.

The regret definition

The regret measures how much worse the algorithm performs (in total) than the best fixed decision $x^*$ that knows the entire sequence ahead of time.

R_\Phi (T) = \sum_{t=1}^T \Phi_t (x_t) - \Phi_t (x^*) = \sum_{t=1}^T [f_t (x_t) + \phi(x_t)] - [f_t (x^*) + \phi(x^*)]

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Notation

What regret means

Setting

The regret definition

This is a premium article