Stochastic Subgradient Method Converges on Tame Functions

PremiumOptimization

Stochastic subgradient method converges on tame functions

Introduction

Motivation

Though variants of the stochastic subgradient method date back to Robbins-Monro's pioneering 1951 work, their convergence behavior is still largely not understood in nonsmooth and nonconvex settings. In particular, the following question remains open.

Does the (stochastic) subgradient method have any convergence guarantees on locally Lipschitz functions, which may be neither smooth nor convex?

That this question remains unanswered is somewhat concerning as the stochastic subgradient method forms a core numerical subroutine for several widely used solvers.

Main result

This paper provides a positive answer to this question for a wide class of locally Lipschitz functions. Aside from mild technical conditions, the only meaningful assumption we make is that ff strictly decreases along any trajectory x()x(\cdot) of the differential inclusion x˙f(x(t))\dot{x} \in -\partial f(x(t)) emanating from a noncritical point. Under this assumption, a standard Lyapunov-type argument shows that every limit point of the stochastic subgradient method is critical for ff, almost surely.

The outline of this paper is as follows. In Section 2, we fix the notation for the rest of the manuscript. Section 3 provides a self-contained treatment of asymptotic consistency for discrete approximations of differential inclusions. In Section 4, we specialize the results of the previous section to the stochastic subgradient method. Finally, in Section 5, we verify the sufficient conditions for subsequential convergence for a broad class of locally Lipschitz functions, including those that are subdifferentially regular and Whitney stratifiable. In particular, we specialize our results to deep learning settings. In the final Section 6, we extend the results of the previous sections to the proximal setting.

Preliminaries

Absolutely continuous curves

Definition (Uniformly converge): Let SRS \subset \mathbb{R} be a non-empty subset. A sequence of functions {fn}\{f_n\} is said to converge uniformly on SS to a function ff if for any ϵ>0\epsilon > 0, there exists NNN \in \mathbb{N} such that fn(x)f(x)<ϵ|f_n (x) - f(x)| < \epsilon for all nNn \geq N and all xSx \in S.

Definition (Locally Lipschitz functions): A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is said to be locally Lipschitz if for each bounded subset BRdB \subset \mathbb{R}^d, there exists a constant K>0K > 0 such that for any x,yBx, y \in B, we have f(x1)f(x2)Kx1x2|f(x_1) - f(x_2)| \leq K |x_1 - x_2|. (The vertical bars denote the Euclidean norm)

Any continuous function x:R+Rdx: \mathbb{R}_+ \to \mathbb{R}^d is called a curve in Rd\mathbb{R}^d. All curves in Rd\mathbb{R}^d comprise the set C(R+,Rd)\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d). We will say that a sequence of functions fkf_k converges to ff in C(R+,Rd)\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d) if fkf_k converge to ff uniformly on compact intervals, that is, for all T>0T > 0, we have

limksupt[0,T]fk(t)f(t)=0.\lim_{k \to \infty} \sup_{t \in [0, T]} \|f_k (t) - f(t)\| = 0.

Definition (Absolute continuity): Let II be an interval in the real line R\mathbb{R}. A function f:IRf: I \to \mathbb{R} is absolutely continuous on II if for every positive number ϵ\epsilon, there is a positive number δ\delta such that whenever a finite sequence of pairwise disjoint sub-intervals (xk,yk)(x_k, y_k) of II with xk<ykx_k < y_k satisfies k=1n(ykxk)<δ\sum_{k=1}^n (y_k - x_k) < \delta, then k=1nf(yk)f(xk)<ϵ\sum_{k=1}^n |f(y_k) - f(x_k)| < \epsilon.

The following conditions on a real-valued function ff on a compact interval [a,b][a,b] are equivalent:

  1. ff is absolutely continuous on [a,b][a,b].
  2. There exists a Lebesgue integrable function gg on [a,b][a,b] such that for all x[a,b]x \in [a,b],
f(x)=f(a)+axg(t)dt.f(x) = f(a) + \int_a^x g(t) \,dt.

Moreover, if this is the case, then equality g(x)=f(x)g(x) = f'(x) holds for a.e. x[a,b]x \in [a,b]. Henceforth, for brevity, we will call absolutely continuous curves arcs. We will often use the observation that if f:RdRf : \mathbb{R}^d \to \mathbb{R} is locally Lipschitz continuous and xx is an arc, then the composition fxf \circ x is absolutely continuous.

Proof: Locally lipschitz on compact set implies lipschitz. Lipschitz implies absolutely continuous.

Set-valued maps and the Clarke subdifferential

A set-valued map G:RdRmG: \mathbb{R}^d \rightrightarrows \mathbb{R}^m is a mapping from a set XRd\mathcal{X} \subseteq \mathbb{R}^d to the power set of Rm\mathbb{R}^m. We will use the notation

G1(v):={xX:vG(x)}G^{-1} (v) := \{ x \in \mathcal{X} : v \in G(x) \}

for the preimage of a vector vRmv \in \mathbb{R}^m.

Definition (outer-semicontinuous): The map GG is outer-semicontinuous at a point xXx \in \mathcal{X} if for any sequences xixx_i \to x and viG(xi)v_i \in G(x_i) converging to some vRmv \in \mathbb{R}^m, the inclusion vG(x)v \in G(x) holds.

Theorem (Rademacher's theorem): If UU is an open subset of Rn\mathbb{R}^n and f:URmf: U \to \mathbb{R}^m is Lipschitz continuous, then ff is differentiable almost everywhere in UU; that is, the points in UU at which ff is not differentiable form a set of Lebesgue measure zero.

Consider a locally Lipschitz continuous function f:RdRf:\mathbb{R}^d \to \mathbb{R}. The well-known Rademacher's theorem guarantees that ff is differentiable almost everywhere. Taking this into account, the Clarke subdifferential of ff at any point xx is the set

f(x):=conv{limif(xi):xiΩx}.\partial f(x) := \operatorname{conv}\{ \lim_{i \to \infty} \nabla f(x_i) : x_i \xrightarrow{\Omega} x\}.

where Ω\Omega is any full-measure subset of Rd\mathbb{R}^d where ff is differentiable. It is known that the Clarke subdifferential is a nonempty, compact, convex set for all xRdx \in \mathbb{R}^d and that the map xf(x)x \mapsto \partial f(x) is outer-semicontinuous.

Analogously to the smooth setting, a point xRdx \in \mathbb{R}^d is called (Clarke) critical for ff whenever the inclusion 0f(x)0 \in \partial f(x) holds.

Differential inclusions and discrete approximations

Functional convergence of discrete approximations

Definition (trajectory): Let X\mathcal{X} be a closed set and let G:XRdG: \mathcal{X} \rightrightarrows \mathbb{R}^d be a set-valued map. Then an arc x:R+Rdx: \mathbb{R}_+ \to \mathbb{R}^d is called a trajectory of GG if it satisfies the differential inclusion

x˙(t)G(x(t))for a.e. t0.(trajectory)\dot{x}(t) \in G(x(t)) \quad \text{for a.e. } t \geq 0. \tag{trajectory}

In this work, we will primarily focus on iterative algorithms that aim to asymptotically track a trajectory of (trajectory).

Throughout, we will consider the following iteration sequence:

xk+1=xk+αk(yk+ξk),(iteration)x_{k+1} = x_k + \alpha_k (y_k + \xi_k), \tag{iteration}

Here αk>0\alpha_k > 0 is a sequence of step-sizes, yky_k should be thought of as an approximate evaluation of GG at some point near xkx_k, and ξk\xi_k is a sequence of "errors".

Our goal is to isolate reasonable conditions, under which the sequence {xk}\{x_k\} asymptotically tracks a trajectory of (trajectory).

Assumption (Standing assumptions):

  1. All limit points of {xk}\{x_k\} lie in X\mathcal{X}.
  2. The iterates are bounded, i.e., supk1xk<\sup_{k\geq 1} \|x_k\| < \infty and supk1yk<\sup_{k\geq 1} \|y_k\| < \infty.
  3. The sequence {αk}\{\alpha_k\} is nonnegative, square summable, but not summable:
αk0,k=1αk=,k=1αk2<.\alpha_k \geq 0, \quad \sum_{k=1}^\infty \alpha_k = \infty, \quad \sum_{k=1}^\infty \alpha_k^2 < \infty.
  1. The weighted noise sequence is convergent: k=1nαkξkv\sum_{k=1}^n \alpha_k \xi_k \to v for some vv as kk \to \infty

  2. For any unbounded increasing sequence {kj}N\{k_j\} \subset \mathbb{N} such that xkjx_{k_j} converges to some point xˉ\bar{x}, it holds:

limndist(1nj=1nykj,G(xˉ))=0.\lim_{n \to \infty} \operatorname{dist}\left(\frac{1}{n} \sum_{j=1}^n y_{k_j}, G(\bar{x})\right) = 0.

Remark: Some comments are in order. Conditions 1, 2, and 3 are in some sense minimal, though the boundedness condition must be checked for each particular algorithm. Condition 4 guarantees that the noise sequence ξk\xi_k does not grow too quickly relative to the rate at which αk\alpha_k decrease. The key Condition 5 summarizes the way in which the values yky_k are approximate evaluations of GG, up to convexification.

Define the time points t0=0t_0 = 0 and tm=k=1m1αkt_m = \sum_{k=1}^{m-1} \alpha_k, for m1m\geq 1. Let x()x(\cdot) be the linear interpolation of the discrete path:

x(t):=xk+ttktk+1tk(xk+1xk)x(t):= x_k + \frac{t - t_k}{t_{k+1} - t_k} (x_{k+1} - x_k)

For each τ0\tau \geq 0, define the time-shifted curve xτ()=x(τ+)x^\tau (\cdot) = x(\tau + \cdot).

Theorem (Functional approximation): Suppose Assumption A holds. Then for any sequence {τk}k=1R+\{\tau_k\}_{k=1}^\infty \subseteq \mathbb{R}_+, the set of functions {xτk()}\{x^{\tau_k} (\cdot)\} is relatively compact in C(R+,Rd)\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d). If in addition τk\tau_k \to \infty as kk \to \infty, all the limit points z()z(\cdot) of {xτk()}\{x^{\tau_k} (\cdot)\} in C(R+,Rd)\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d) are trajectories of the differential inclusion.

Subsequential convergence to equilibrium points

A primary application of the discrete process (iteration) is to solve the inclusion

0G(z).0 \in G(z).

Assumption (Lyapunov condition): There exists a continuous function ϕ:RdR\phi: \mathbb{R}^d \to \mathbb{R}, which is bounded from below, and such that the following property holds.

  1. (Weak Sard) For a dense set of values rRr \in \mathbb{R}, the intersection ϕ1(r)G1(0)\phi^{-1}(r) \cap G^{-1} (0) is empty.
  2. (Descent) Whenever z:R+Rdz: \mathbb{R}_+ \to \mathbb{R}^d is a trajectory of the differential inclusion and 0G(z(0))0 \notin G(z(0)), there exists T>0T > 0 satisfying
ϕ(z(T))<supt[0,T]ϕ(z(t))ϕ(z(0)).\phi(z(T)) < \sup_{t \in [0, T]} \phi(z(t)) \leq \phi(z(0)).

Remark: The weak Sard property says that the set of noncritical values of ff is dense in R\mathbb{R}. The descent property says that ϕ\phi eventually strictly decreases along the trajectories of the differential inclusion z˙(t)G(z(t))\dot{z}(t) \in G(z(t)) emanating from any non-equilibrium point.

Theorem: Suppose Assumptions A and B hold. Then every limit point of {xk}k1\{x_k\}_{k\geq 1} lies in G1(0)G^{-1} (0) and the function values {ϕ(xk)}k1\{\phi(x_k)\}_{k\geq 1} converge.

This is a premium article

Sign in and subscribe to unlock the full article.