Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

PremiumOptimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Notation

Vectors and scalars are lower case italic letters, such as xXx \in \mathcal{X}

A sequence of vectors is represented as xt,xt+1,x_t, x_{t+1}, \dots, and entries of each vector are represented as xt,jx_{t,j}.

The subdifferential set of a function ff at a point xx is denoted as f(x)\partial f(x), and a particular vector in the subdifferential set is denoted by f(x)ff'(x) \in \partial f or gtft(xt)g_t \in \partial f_t (x_t). When a function is differentiable, we use f\nabla f.

x,y\langle x,y \rangle denotes the inner product between vectors xx and yy.

The Bregman divergence associated with a strongly convex and differentiable function Ψ\Psi is

BΨ(x,y)=Ψ(x)Ψ(y)Ψ(y),xyB_\Psi (x,y) = \Psi(x) - \Psi(y) - \langle\nabla \Psi(y), x - y\rangle

g1:t=[g1,gt]g_{1:t} = [g_1 \dots, g_t] denote the matrix obtained by concatenating the subgradient sequence.

We denote the iith row of this matrix, which amounts to the concatenation of the iith component of each subgradient we observe, by g1:t,ig_{1:t,i}.

The outer product matrix Gt=τ=1tgτgτTG_t = \sum_{\tau=1}^t g_\tau g_\tau^T.

What regret means

In online learning, we imagine a repeated game between a learner and an environment (or adversary). At each time step t=1,2,,Tt = 1, 2, \dots, T:

  1. The learner picks a decision or prediction vector xtXRdx_t \in \mathcal{X} \subset \mathbb{R}^d.
  2. The environment reveals a loss function ft:XRf_t: \mathcal{X} \to \mathbb{R}.
  3. The learner suffers the loss ft(xt)f_t (x_t).

After TT rounds, we compare the learner's total loss to that of the best fixed decision in hindsight, xx^*:

R(T)=t=1Tft(xt)infxXt=1Tft(x)R(T) = \sum_{t=1}^T f_t (x_t) - \inf_{x \in \mathcal{X}} \sum_{t=1}^T f_t (x)

This quantity R(T)R(T) is called the regret. It measures how much worse the learner performed than the best static predictor that could have been chosen with full knowledge of the future.

Standard subgradient algorithms then move the predictor xtx_t in the opposite direction of gtg_t while maintaining xt+1Xx_{t+1} \in \mathcal{X} via the projected gradient update (e.g., Zinkevich, 2003)

xt+1=ΠX(xtηgt)=argminxXx(xtηgt)22x_{t+1} = \Pi_\mathcal{X} (x_t - \eta g_t) = \arg\min_{x \in \mathcal{X}} \|x - (x_t - \eta g_t)\|_2^2

In contrast, let the Mahalanobis norm A=,A\|\cdot\|_A = \sqrt{\langle\cdot,A\cdot\rangle} and denote the projection of a point yy onto X\mathcal{X} according to AA by ΠXA(y)=argminxXxyA=argminxXxy,A(xy)\Pi_\mathcal{X}^A (y) = \arg\min_{x \in \mathcal{X}} \|x-y\|_A = \arg\min_{x \in \mathcal{X}} \langle x-y,A (x-y)\rangle. Using this notation, their generalization of standard gradient descent employs the update

xt+1=ΠXGt1/2(xtηGt1/2gt)x_{t+1} = \Pi_\mathcal{X}^{G_t^{1/2}} (x_t - \eta G_t^{-1/2} g_t)

The above algorithm is computationally impractical in high dimensions since it requires computation of the root of the matrix GtG_t, the outer product matrix. Thus they specialize the update to

xt+1=ΠXdiag(Gt)1/2(xtηdiag(Gt)1/2gt).x_{t+1} = \Pi_\mathcal{X}^{\operatorname{diag}(G_t)^{1/2}} (x_t - \eta \operatorname{diag}(G_t)^{-1/2} g_t) .

Both the inverse and root of diag(Gt)\operatorname{diag}(G_t) can be computed in linear time.

In this paper, they consider several different online learning algorithms and their stochastic convex optimization counterparts.

Setting

They consider the online convex optimization setting where, at each time step t=1,2,,Tt = 1, 2, \dots, T, the learner picks a decision or prediction vector xtXRdx_t \in \mathcal{X} \subset \mathbb{R}^d each round.

Instead of a single per-step loss ft(xt)f_t (x_t), each composite loss is defined as

Φt(x)=ft(x)+ϕ(x)\Phi_t (x) = f_t (x) + \phi(x)

where ftf_t and ϕ\phi are convex functions.

The regret definition

The regret measures how much worse the algorithm performs (in total) than the best fixed decision xx^* that knows the entire sequence ahead of time.

RΦ(T)=t=1TΦt(xt)Φt(x)=t=1T[ft(xt)+ϕ(xt)][ft(x)+ϕ(x)]R_\Phi (T) = \sum_{t=1}^T \Phi_t (x_t) - \Phi_t (x^*) = \sum_{t=1}^T [f_t (x_t) + \phi(x_t)] - [f_t (x^*) + \phi(x^*)]

This is a premium article

Sign in and subscribe to unlock the full article.