Stochastic subgradient method converges on tame functions

Introduction

Motivation

Though variants of the stochastic subgradient method date back to Robbins-Monro's pioneering 1951 work, their convergence behavior is still largely not understood in nonsmooth and nonconvex settings. In particular, the following question remains open.

Does the (stochastic) subgradient method have any convergence guarantees on locally Lipschitz functions, which may be neither smooth nor convex?

That this question remains unanswered is somewhat concerning as the stochastic subgradient method forms a core numerical subroutine for several widely used solvers.

Main result

This paper provides a positive answer to this question for a wide class of locally Lipschitz functions. Aside from mild technical conditions, the only meaningful assumption we make is that $f$ strictly decreases along any trajectory $x(\cdot)$ of the differential inclusion $\dot{x} \in -\partial f(x(t))$ emanating from a noncritical point. Under this assumption, a standard Lyapunov-type argument shows that every limit point of the stochastic subgradient method is critical for $f$ , almost surely.

The outline of this paper is as follows. In Section 2, we fix the notation for the rest of the manuscript. Section 3 provides a self-contained treatment of asymptotic consistency for discrete approximations of differential inclusions. In Section 4, we specialize the results of the previous section to the stochastic subgradient method. Finally, in Section 5, we verify the sufficient conditions for subsequential convergence for a broad class of locally Lipschitz functions, including those that are subdifferentially regular and Whitney stratifiable. In particular, we specialize our results to deep learning settings. In the final Section 6, we extend the results of the previous sections to the proximal setting.

Preliminaries

Absolutely continuous curves

Definition (Uniformly converge): Let $S \subset \mathbb{R}$ be a non-empty subset. A sequence of functions $\{f_n\}$ is said to converge uniformly on $S$ to a function $f$ if for any $\epsilon > 0$ , there exists $N \in \mathbb{N}$ such that $|f_n (x) - f(x)| < \epsilon$ for all $n \geq N$ and all $x \in S$ .

Definition (Locally Lipschitz functions): A function $f: \mathbb{R}^n \to \mathbb{R}$ is said to be locally Lipschitz if for each bounded subset $B \subset \mathbb{R}^d$ , there exists a constant $K > 0$ such that for any $x, y \in B$ , we have $|f(x_1) - f(x_2)| \leq K |x_1 - x_2|$ . (The vertical bars denote the Euclidean norm)

Any continuous function $x: \mathbb{R}_+ \to \mathbb{R}^d$ is called a curve in $\mathbb{R}^d$ . All curves in $\mathbb{R}^d$ comprise the set $\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d)$ . We will say that a sequence of functions $f_k$ converges to $f$ in $\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d)$ if $f_k$ converge to $f$ uniformly on compact intervals, that is, for all $T > 0$ , we have

\lim_{k \to \infty} \sup_{t \in [0, T]} \|f_k (t) - f(t)\| = 0.

Definition (Absolute continuity): Let $I$ be an interval in the real line $\mathbb{R}$ . A function $f: I \to \mathbb{R}$ is absolutely continuous on $I$ if for every positive number $\epsilon$ , there is a positive number $\delta$ such that whenever a finite sequence of pairwise disjoint sub-intervals $(x_k, y_k)$ of $I$ with $x_k < y_k$ satisfies $\sum_{k=1}^n (y_k - x_k) < \delta$ , then $\sum_{k=1}^n |f(y_k) - f(x_k)| < \epsilon$ .

The following conditions on a real-valued function $f$ on a compact interval $[a,b]$ are equivalent:

$f$ is absolutely continuous on $[a,b]$ .
There exists a Lebesgue integrable function $g$ on $[a,b]$ such that for all $x \in [a,b]$ ,

f(x) = f(a) + \int_a^x g(t) \,dt.

Moreover, if this is the case, then equality $g(x) = f'(x)$ holds for a.e. $x \in [a,b]$ . Henceforth, for brevity, we will call absolutely continuous curves arcs. We will often use the observation that if $f : \mathbb{R}^d \to \mathbb{R}$ is locally Lipschitz continuous and $x$ is an arc, then the composition $f \circ x$ is absolutely continuous.

Proof: Locally lipschitz on compact set implies lipschitz. Lipschitz implies absolutely continuous.

Set-valued maps and the Clarke subdifferential

A set-valued map $G: \mathbb{R}^d \rightrightarrows \mathbb{R}^m$ is a mapping from a set $\mathcal{X} \subseteq \mathbb{R}^d$ to the power set of $\mathbb{R}^m$ . We will use the notation

G^{-1} (v) := \{ x \in \mathcal{X} : v \in G(x) \}

for the preimage of a vector $v \in \mathbb{R}^m$ .

Definition (outer-semicontinuous): The map $G$ is outer-semicontinuous at a point $x \in \mathcal{X}$ if for any sequences $x_i \to x$ and $v_i \in G(x_i)$ converging to some $v \in \mathbb{R}^m$ , the inclusion $v \in G(x)$ holds.

Theorem (Rademacher's theorem): If $U$ is an open subset of $\mathbb{R}^n$ and $f: U \to \mathbb{R}^m$ is Lipschitz continuous, then $f$ is differentiable almost everywhere in $U$ ; that is, the points in $U$ at which $f$ is not differentiable form a set of Lebesgue measure zero.

Consider a locally Lipschitz continuous function $f:\mathbb{R}^d \to \mathbb{R}$ . The well-known Rademacher's theorem guarantees that $f$ is differentiable almost everywhere. Taking this into account, the Clarke subdifferential of $f$ at any point $x$ is the set

\partial f(x) := \operatorname{conv}\{ \lim_{i \to \infty} \nabla f(x_i) : x_i \xrightarrow{\Omega} x\}.

where $\Omega$ is any full-measure subset of $\mathbb{R}^d$ where $f$ is differentiable. It is known that the Clarke subdifferential is a nonempty, compact, convex set for all $x \in \mathbb{R}^d$ and that the map $x \mapsto \partial f(x)$ is outer-semicontinuous.

Analogously to the smooth setting, a point $x \in \mathbb{R}^d$ is called (Clarke) critical for $f$ whenever the inclusion $0 \in \partial f(x)$ holds.

Differential inclusions and discrete approximations

Functional convergence of discrete approximations

Definition (trajectory): Let $\mathcal{X}$ be a closed set and let $G: \mathcal{X} \rightrightarrows \mathbb{R}^d$ be a set-valued map. Then an arc $x: \mathbb{R}_+ \to \mathbb{R}^d$ is called a trajectory of $G$ if it satisfies the differential inclusion
$\dot{x}(t) \in G(x(t)) \quad \text{for a.e. } t \geq 0. \tag{trajectory}$

In this work, we will primarily focus on iterative algorithms that aim to asymptotically track a trajectory of (trajectory).

Throughout, we will consider the following iteration sequence:

x_{k+1} = x_k + \alpha_k (y_k + \xi_k), \tag{iteration}

Here $\alpha_k > 0$ is a sequence of step-sizes, $y_k$ should be thought of as an approximate evaluation of $G$ at some point near $x_k$ , and $\xi_k$ is a sequence of "errors".

Our goal is to isolate reasonable conditions, under which the sequence $\{x_k\}$ asymptotically tracks a trajectory of (trajectory).

Assumption (Standing assumptions):

All limit points of $\{x_k\}$ lie in $\mathcal{X}$ .

The iterates are bounded, i.e., $\sup_{k\geq 1} \|x_k\| < \infty$ and $\sup_{k\geq 1} \|y_k\| < \infty$ .

The sequence $\{\alpha_k\}$ is nonnegative, square summable, but not summable:

$\alpha_k \geq 0, \quad \sum_{k=1}^\infty \alpha_k = \infty, \quad \sum_{k=1}^\infty \alpha_k^2 < \infty.$

The weighted noise sequence is convergent: $\sum_{k=1}^n \alpha_k \xi_k \to v$ for some $v$ as $k \to \infty$

For any unbounded increasing sequence $\{k_j\} \subset \mathbb{N}$ such that $x_{k_j}$ converges to some point $\bar{x}$ , it holds:

$\lim_{n \to \infty} \operatorname{dist}\left(\frac{1}{n} \sum_{j=1}^n y_{k_j}, G(\bar{x})\right) = 0.$

Remark: Some comments are in order. Conditions 1, 2, and 3 are in some sense minimal, though the boundedness condition must be checked for each particular algorithm. Condition 4 guarantees that the noise sequence $\xi_k$ does not grow too quickly relative to the rate at which $\alpha_k$ decrease. The key Condition 5 summarizes the way in which the values $y_k$ are approximate evaluations of $G$ , up to convexification.

Define the time points $t_0 = 0$ and $t_m = \sum_{k=1}^{m-1} \alpha_k$ , for $m\geq 1$ . Let $x(\cdot)$ be the linear interpolation of the discrete path:

x(t):= x_k + \frac{t - t_k}{t_{k+1} - t_k} (x_{k+1} - x_k)

For each $\tau \geq 0$ , define the time-shifted curve $x^\tau (\cdot) = x(\tau + \cdot)$ .

Theorem (Functional approximation): Suppose Assumption A holds. Then for any sequence $\{\tau_k\}_{k=1}^\infty \subseteq \mathbb{R}_+$ , the set of functions $\{x^{\tau_k} (\cdot)\}$ is relatively compact in $\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d)$ . If in addition $\tau_k \to \infty$ as $k \to \infty$ , all the limit points $z(\cdot)$ of $\{x^{\tau_k} (\cdot)\}$ in $\mathcal{C}(\mathbb{R}_+, \mathbb{R}^d)$ are trajectories of the differential inclusion.

Subsequential convergence to equilibrium points

A primary application of the discrete process (iteration) is to solve the inclusion

0 \in G(z).

Assumption (Lyapunov condition): There exists a continuous function $\phi: \mathbb{R}^d \to \mathbb{R}$ , which is bounded from below, and such that the following property holds.

(Weak Sard) For a dense set of values $r \in \mathbb{R}$ , the intersection $\phi^{-1}(r) \cap G^{-1} (0)$ is empty.

(Descent) Whenever $z: \mathbb{R}_+ \to \mathbb{R}^d$ is a trajectory of the differential inclusion and $0 \notin G(z(0))$ , there exists $T > 0$ satisfying

$\phi(z(T)) < \sup_{t \in [0, T]} \phi(z(t)) \leq \phi(z(0)).$

Remark: The weak Sard property says that the set of noncritical values of $f$ is dense in $\mathbb{R}$ . The descent property says that $\phi$ eventually strictly decreases along the trajectories of the differential inclusion $\dot{z}(t) \in G(z(t))$ emanating from any non-equilibrium point.

Theorem: Suppose Assumptions A and B hold. Then every limit point of $\{x_k\}_{k\geq 1}$ lies in $G^{-1} (0)$ and the function values $\{\phi(x_k)\}_{k\geq 1}$ converge.

Stochastic Subgradient Method Converges on Tame Functions