Stochastic subgradient method converges on tame functions
Introduction
Motivation
Though variants of the stochastic subgradient method date back to Robbins-Monro's pioneering 1951 work, their convergence behavior is still largely not understood in nonsmooth and nonconvex settings. In particular, the following question remains open.
Does the (stochastic) subgradient method have any convergence guarantees on locally Lipschitz functions, which may be neither smooth nor convex?
That this question remains unanswered is somewhat concerning as the stochastic subgradient method forms a core numerical subroutine for several widely used solvers.
Main result
This paper provides a positive answer to this question for a wide class of locally Lipschitz functions. Aside from mild technical conditions, the only meaningful assumption we make is that strictly decreases along any trajectory of the differential inclusion emanating from a noncritical point. Under this assumption, a standard Lyapunov-type argument shows that every limit point of the stochastic subgradient method is critical for , almost surely.
The outline of this paper is as follows. In Section 2, we fix the notation for the rest of the manuscript. Section 3 provides a self-contained treatment of asymptotic consistency for discrete approximations of differential inclusions. In Section 4, we specialize the results of the previous section to the stochastic subgradient method. Finally, in Section 5, we verify the sufficient conditions for subsequential convergence for a broad class of locally Lipschitz functions, including those that are subdifferentially regular and Whitney stratifiable. In particular, we specialize our results to deep learning settings. In the final Section 6, we extend the results of the previous sections to the proximal setting.
Preliminaries
Absolutely continuous curves
Definition (Uniformly converge): Let be a non-empty subset. A sequence of functions is said to converge uniformly on to a function if for any , there exists such that for all and all .
Definition (Locally Lipschitz functions): A function is said to be locally Lipschitz if for each bounded subset , there exists a constant such that for any , we have . (The vertical bars denote the Euclidean norm)
Any continuous function is called a curve in . All curves in comprise the set . We will say that a sequence of functions converges to in if converge to uniformly on compact intervals, that is, for all , we have
Definition (Absolute continuity): Let be an interval in the real line . A function is absolutely continuous on if for every positive number , there is a positive number such that whenever a finite sequence of pairwise disjoint sub-intervals of with satisfies , then .
The following conditions on a real-valued function on a compact interval are equivalent:
- is absolutely continuous on .
- There exists a Lebesgue integrable function on such that for all ,
Moreover, if this is the case, then equality holds for a.e. . Henceforth, for brevity, we will call absolutely continuous curves arcs. We will often use the observation that if is locally Lipschitz continuous and is an arc, then the composition is absolutely continuous.
Proof: Locally lipschitz on compact set implies lipschitz. Lipschitz implies absolutely continuous.
Set-valued maps and the Clarke subdifferential
A set-valued map is a mapping from a set to the power set of . We will use the notation
for the preimage of a vector .
Definition (outer-semicontinuous): The map is outer-semicontinuous at a point if for any sequences and converging to some , the inclusion holds.
Theorem (Rademacher's theorem): If is an open subset of and is Lipschitz continuous, then is differentiable almost everywhere in ; that is, the points in at which is not differentiable form a set of Lebesgue measure zero.
Consider a locally Lipschitz continuous function . The well-known Rademacher's theorem guarantees that is differentiable almost everywhere. Taking this into account, the Clarke subdifferential of at any point is the set
where is any full-measure subset of where is differentiable. It is known that the Clarke subdifferential is a nonempty, compact, convex set for all and that the map is outer-semicontinuous.
Analogously to the smooth setting, a point is called (Clarke) critical for whenever the inclusion holds.
Differential inclusions and discrete approximations
Functional convergence of discrete approximations
Definition (trajectory): Let be a closed set and let be a set-valued map. Then an arc is called a trajectory of if it satisfies the differential inclusion
In this work, we will primarily focus on iterative algorithms that aim to asymptotically track a trajectory of (trajectory).
Throughout, we will consider the following iteration sequence:
Here is a sequence of step-sizes, should be thought of as an approximate evaluation of at some point near , and is a sequence of "errors".
Our goal is to isolate reasonable conditions, under which the sequence asymptotically tracks a trajectory of (trajectory).
Assumption (Standing assumptions):
- All limit points of lie in .
- The iterates are bounded, i.e., and .
- The sequence is nonnegative, square summable, but not summable:
The weighted noise sequence is convergent: for some as
For any unbounded increasing sequence such that converges to some point , it holds:
Remark: Some comments are in order. Conditions 1, 2, and 3 are in some sense minimal, though the boundedness condition must be checked for each particular algorithm. Condition 4 guarantees that the noise sequence does not grow too quickly relative to the rate at which decrease. The key Condition 5 summarizes the way in which the values are approximate evaluations of , up to convexification.
Define the time points and , for . Let be the linear interpolation of the discrete path:
For each , define the time-shifted curve .
Theorem (Functional approximation): Suppose Assumption A holds. Then for any sequence , the set of functions is relatively compact in . If in addition as , all the limit points of in are trajectories of the differential inclusion.
Subsequential convergence to equilibrium points
A primary application of the discrete process (iteration) is to solve the inclusion
Assumption (Lyapunov condition): There exists a continuous function , which is bounded from below, and such that the following property holds.
- (Weak Sard) For a dense set of values , the intersection is empty.
- (Descent) Whenever is a trajectory of the differential inclusion and , there exists satisfying
Remark: The weak Sard property says that the set of noncritical values of is dense in . The descent property says that eventually strictly decreases along the trajectories of the differential inclusion emanating from any non-equilibrium point.
Theorem: Suppose Assumptions A and B hold. Then every limit point of lies in and the function values converge.