Optimization for Machine Learning

PremiumOptimization

Optimization for Machine Learning

Chapter 2: Unconstrained optimization

Definition 1.2.21 (Strongly convex function)

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} in C1C^1 is μ\mu-strongly convex if for all (u,v)(Rd)2(u,v) \in (\mathbb{R}^d)^2 and t[0,1]t \in [0,1],

f(tu+(1t)v)tf(u)+(1t)f(v)μ2t(1t)uv2.f(t u + (1-t)v) \leq t f(u) + (1-t) f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2.

Theorem 1.2.10 Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be an element of C1C^1. Then the function ff is μ\mu-strongly convex if and only if

u,vRd,f(v)f(u)+f(u)T(vu)+μ2vu2.\forall u,v \in \mathbb{R}^d, f(v) \geq f(u) + \nabla f(u)^T (v-u) + \frac{\mu}{2} \|v-u\|^2.

Proof. (\Rightarrow)

Let w=(1t)u+tv=u+t(vu)w = (1-t) u + t v = u + t (v-u). By the definition of strong convexity, we have

f(w)(1t)f(u)+tf(v)μ2t(1t)uv2f(v)f(u)+f(w)f(u)t+μ2(1t)uv2f(v)f(u)+f(u+t(vu))f(u)t+μ2(1t)uv2\begin{aligned} f(w) &\leq (1-t) f(u) + t f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2\\ f(v) &\geq f(u) + \frac{f(w) - f(u)}{t} + \frac{\mu}{2} (1-t) \|u-v\|^2\\ f(v) &\geq f(u) + \frac{f(u + t(v-u)) - f(u)}{t} + \frac{\mu}{2} (1-t) \|u-v\|^2 \end{aligned}

Taking the limit t0t \to 0, we obtain

f(v)f(u)+f(u)T(vu)+μ2vu2.f(v) \geq f(u) + \nabla f(u)^T (v-u) + \frac{\mu}{2} \|v-u\|^2.

(\Leftarrow)

Set w=tu+(1t)vw = t u + (1-t) v, we have

f(u)f(w)+f(w)T(uw)+μ2uw2f(v)f(w)+f(w)T(vw)+μ2vw2.\begin{aligned} f(u) &\geq f(w) + \nabla f(w)^T (u-w) + \frac{\mu}{2} \|u-w\|^2\\ f(v) &\geq f(w) + \nabla f(w)^T (v-w) + \frac{\mu}{2} \|v-w\|^2. \end{aligned}

Multiplying the first inequality by tt and the second by (1t)(1-t), and summing them, we obtain

tf(u)+(1t)f(v)f(w)+f(w)T(t(uw)+(1t)(vw))+μ2(tuw2+(1t)vw2)tf(u)+(1t)f(v)f(w)+μ2(t(1t)(uv)2+(1t)t(vu)2)tf(u)+(1t)f(v)f(tu+(1t)v)+μ2t(1t)uv2f(tu+(1t)v)tf(u)+(1t)f(v)μ2t(1t)uv2.\begin{aligned} t f(u) + (1-t) f(v) &\geq f(w) + \nabla f(w)^T (t(u-w) + (1-t)(v-w)) + \frac{\mu}{2} (t \|u-w\|^2 + (1-t) \|v-w\|^2)\\ t f(u) + (1-t) f(v) &\geq f(w) + \frac{\mu}{2} (t \|(1-t) (u-v)\|^2 + (1-t) \|t (v-u)\|^2)\\ t f(u) + (1-t) f(v) &\geq f(t u + (1-t) v) + \frac{\mu}{2} t (1-t) \|u-v\|^2\\ f(t u + (1-t) v) &\leq t f(u) + (1-t) f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2. \end{aligned}

Theorem 1.2.2 (First order Taylor expansion)

Let fC1,1(Rd)f \in C^{1,1}(\mathbb{R}^d) with L>0L >0. For any vectors w,zRdw,z \in \mathbb{R}^d, one has

f(z)f(w)+f(w)T(zw)+L2zw2f(z) \leq f(w) + \nabla f(w)^T (z-w) + \frac{L}{2} \|z-w\|^2

Proof. Observe that

f(z)f(w)f(w)T(zw)=01(f(w+t(zw))f(w))T(zw)dtf(z) - f(w) - \nabla f(w)^T (z-w) = \int_0^1 (\nabla f(w + t(z-w)) - \nabla f(w))^T (z-w) \,dt

Hence

f(z)f(w)f(w)T(zw)2=01(f(w+t(zw))f(w))T(zw)dt201f(w+t(zw))f(w)2zw2dt01Ltzw22dt=L2zw22.\begin{aligned} \|f(z) - f(w) - \nabla f(w)^T (z-w)\|_2 &= \|\int_0^1 (\nabla f(w + t(z-w)) - \nabla f(w))^T (z-w) \,dt\|_2\\ &\leq \int_0^1 \|\nabla f(w + t(z-w)) - \nabla f(w)\|_2 \|z-w\|_2 \,dt\\ &\leq \int_0^1 L t \|z-w\|_2^2 \,dt\\ &= \frac{L}{2} \|z-w\|_2^2. \end{aligned}

Unconstrained optimization

minwRdf(w)\min_{w \in \mathbb{R}^d} f(w)

Assumption 2.0.1

The objective function ff is

  1. CL1,1(Rd)C^{1,1}_L (\mathbb{R}^d) for L>0L > 0,
  2. bounded below by flowRf_{\text{low}} \in \mathbb{R} (i.e. f(w)flow  wRdf(w) \geq f_{\text{low}} \;\forall w \in \mathbb{R}^d)

2.1 Gradient descent

For any wRdw \in \mathbb{R}^d, two cases may occur:

  1. f(w)=0\nabla f(w) = 0. Then ww is possibly a local minimum of ff.
  2. f(w)0\nabla f(w) \neq 0. Then we can show that ff can be decreased by moving ww in the direction f(w)-\nabla f(w).

2.1.1 Algorithm

Algorithm 1: Gradient descent for minimizing the function ff.

  1. Initialization: Choose w0Rdw_0 \in \mathbb{R}^d.
  2. For k=0,1,2,k = 0,1,2,\dots do
    1. Compute the gradient f(wk)\nabla f(w_k).
    2. Define a step size αk>0\alpha_k > 0.
    3. Set wk+1=wkαkf(wk)w_{k+1} = w_k - \alpha_k \nabla f(w_k).
  3. End for

Algorithm 1 actually describes a framework rather than a specific method. There exist numerous variants upon the gradient descent paradigm.

Why choose f(wk)-\nabla f(w_k) as a descent direction?

A vector vRdv \in \mathbb{R}^d is called a direction of descent of a function f:RdRf: \mathbb{R}^d \to \mathbb{R} at xRdx \in \mathbb{R}^d, if there exists δ>0\delta > 0 such that f(x+αv)<f(x)f(x+ \alpha v) < f(x) for all 0<α<δ0<\alpha<\delta.

Clearly, if

f(x;v):=limα0+f(x+αv)f(x)α<0,f'(x;v) := \lim_{\alpha \to 0+} \frac{f(x + \alpha v) - f(x)}{\alpha} < 0,

then vv is a direction of descent.

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be differentiable at xRdx \in \mathbb{R}^d with f(x)0\nabla f(x) \neq 0. Then the optimal solution to the problem

mind{f(x;d):d1}\min_d \{f'(x;d) : \|d\| \leq 1\}

is given by d=f(x)f(x)d^* = -\frac{\nabla f(x)}{\|\nabla f(x)\|}. Thus f(x)-\nabla f(x) is a direction of steepest descent of ff at xx.

Proof. From differentiability of ff at xx, we have f(x;d)=limα0+f(x+αd)f(x)α=f(x)Td.f'(x;d) = \lim_{\alpha \to 0+} \frac{f(x + \alpha d) - f(x)}{\alpha} = \nabla f(x)^T d.

By Cauchy-Schwarz inequality, we have f(x)Tdf(x)df(x)\nabla f(x)^T d \geq -\|\nabla f(x)\| \|d\| \geq -\|\nabla f(x)\|, with equality holding if and only if d=f(x)f(x)d = -\frac{\nabla f(x)}{\|\nabla f(x)\|}.

Stop criterion

Things we can monitor to stop the algorithm:

  1. whether the method converged to a solution
  2. whether the method is making sufficient progress

Methods:

  1. Norm of the gradient: If f(wk)<ϵ\|\nabla f(w_k)\| < \epsilon
  2. Relative variation of the iterates: If wk+1wk<ϵ\|w_{k+1} - w_k\| < \epsilon

Choose the initial point

  1. Random initialization
  2. Using a suitable initial guess based on prior knowledge of the problem (may be close to a local minimum)

2.1.2 Choose step size

1. Constant step size

Provided ff satisfies Assumption 2.0.1, there exists an interval of values that lead to convergence of gradient descent. In particular, the choice

αk=α=1L,\alpha_k = \alpha = \frac{1}{L},

where LL is the Lipschitz constant for the gradient, is well suited for that problem.

2. Decreasing step size

Another classical technique for selecting the step size consists in defining a decreasing sequence {αk}\{\alpha_k\} such that αk0\alpha_k \to 0 prior to running the method. This choice can also lead to converging method, but it risks producing steps that are unnecessarily small in norm. In fact, a good decreasing strategy should drive αk\alpha_k to 0 quickly enough for convergence, but slowly enough that the norm of the steps do not approach 0 too rapidly.

3. Adaptive step size using a line search

Algorithm 2: Backtracking line search in direction dd.

  1. Inputs: wRd,dRd,α0Rdw \in \mathbb{R}^d, d \in \mathbb{R}^d, \alpha_0 \in \mathbb{R}^d.
  2. Initialization: Choose α=α0\alpha = \alpha_0.
  3. While f(w+αd)>f(w)f(w + \alpha d) > f(w)
  4. Set αα/2\alpha \to \alpha/2.
  5. End while
  6. Output: α\alpha.

2.1.3 Theoretical analysis of gradient descent

Proposition 2.1.1 Consider the kk-th iteration of Algorithm 1 applied to fCL1,1(Rd)f \in C^{1,1}_L (\mathbb{R}^d), and assume that f(wk)0\nabla f(w_k) \neq 0. Then, if 0<αk<2L0<\alpha_k<\frac{2}{L}, we have

f(wkαkf(wk))<f(wk).f(w_k - \alpha_k \nabla f(w_k)) < f(w_k).

If we choose αk=1/L\alpha_k = 1/L, then

f(wk+1)f(wk)12Lf(wk)2.f(w_{k+1}) \leq f(w_k) - \frac{1}{2L} \|\nabla f(w_k)\|^2.

Proof. By Theorem 1.2.2, we have

f(wk+1)f(wk)+f(wk)T(wk+1wk)+L2wk+1wk2=f(wk)αkf(wk)Tf(wk)+L2αk2f(wk)2=f(wk)(αkL2αk2)f(wk)2.\begin{aligned} f(w_{k+1}) &\leq f(w_k) + \nabla f(w_k)^T (w_{k+1} - w_k) + \frac{L}{2} \|w_{k+1} - w_k\|^2\\ &= f(w_k) - \alpha_k \nabla f(w_k)^T \nabla f(w_k) + \frac{L}{2} \alpha_k^2 \|\nabla f(w_k)\|^2\\ &= f(w_k) - (\alpha_k - \frac{L}{2} \alpha_k^2) \|\nabla f(w_k)\|^2. \end{aligned}

In order to ensure that the function value decreases at each iteration, we need to choose αk\alpha_k such that αkL2αk2>0\alpha_k - \frac{L}{2} \alpha_k^2 > 0, which is equivalent to 0<αk<2/L0 < \alpha_k < 2/L.

The result of Proposition 2.1.1 will be instrumental to obtain complexity guarantees on Algorithm 1 in three different settings: nonconvex, convex, and strongly convex.

Nonconvex case

In the nonconvex case, we aim at bounding the number of iterations required to drive the gradient norm below some threshold ϵ>0\epsilon > 0: this means that we should be able to show that the gradient norm actually goes below this threshold, which is a guarantee of convergence.

Why focus on the number of iterations?

The number of iterations is critical in nonconvex optimization for the following reasons:

1. Limited computational resources: Real-world applications have constraints on time and resources (e.g., GPU, memory). If an algorithm requires too many iterations to reduce the gradient norm below ϵ\epsilon, it may be impractical due to high costs or delays.

2. Measure of convergence speed: The number of iterations reflects how quickly an algorithm converges. Ideally, we want fewer iterations to achieve f<ϵ\|\nabla f\| < \epsilon. Analyzing iteration count helps compare algorithm efficiency.

3. Theoretical guarantee: Proving an upper bound on iterations provides a theoretical assurance of convergence. This guides practical implementation, such as parameter tuning or estimating runtime.

Theorem 2.1.1 (Complexity of gradient descent for nonconvex functions)

Let fCL1,1(Rd)f \in C^{1,1}_L (\mathbb{R}^d) satisfying Assumption 2.0.1. Suppose that we apply Algorithm 1 with a constant step size αk=1/L\alpha_k = 1/L. Then, for any K1K\geq 1, we have

min0kK1f(wk)O(1K)\min_{0\leq k\leq K-1} \|\nabla f(w_k)\| \leq \mathcal{O}\left(\frac{1}{\sqrt{K}}\right)

Proof. Let KK be an iteration index such that for every k=0,,K1k = 0,\dots,K-1, we have f(wk)>ϵ\|\nabla f(w_k)\| > \epsilon.

From Proposition 2.1.1, we have k=0,,K1\forall k = 0,\dots,K-1,

f(wk+1)f(wk)12Lf(wk)2f(wk)12L(min0kK1f(wk))2\begin{aligned} f(w_{k+1}) &\leq f(w_k) - \frac{1}{2L} \|\nabla f(w_k)\|^2\\ &\leq f(w_k) - \frac{1}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2 \end{aligned}

By summing across all such iterations, we obtain:

k=0K1f(wk+1)k=0K1f(wk)K2L(min0kK1f(wk))2\sum_{k=0}^{K-1} f(w_{k+1}) \leq \sum_{k=0}^{K-1} f(w_k) - \frac{K}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2

Removing identical terms on both sides, we get

f(wK)f(w0)K2L(min0kK1f(wk))2.f(w_K) \leq f(w_0) - \frac{K}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2.

Using the assumption that f(wK)flowf(w_K) \geq f_{\text{low}}, we obtain

min0kK1f(wk)(2L(f(w0)flow)K)1/2=O(1K).\min_{0\leq k\leq K-1} \|\nabla f(w_k)\| \leq \left(\frac{2L (f(w_0) - f_{\text{low}})}{K}\right)^{1/2} = \mathcal{O}\left(\frac{1}{\sqrt{K}}\right).

Equivalently, we say that the worst-case complexity of gradient descent is O(ϵ2)\mathcal{O}(\epsilon^{-2}).

If we set the stop criterion to be f(wk)<ϵ\|\nabla f(w_k)\| < \epsilon, a similar proof guarantees that the number of iterations required to meet this criterion can be obtained by

\begin{aligned}

This is a premium article

Sign in and subscribe to unlock the full article.