Optimization for Machine Learning

Chapter 2: Unconstrained optimization

Definition 1.2.21 (Strongly convex function)

A function $f: \mathbb{R}^d \to \mathbb{R}$ in $C^1$ is $\mu$ -strongly convex if for all $(u,v) \in (\mathbb{R}^d)^2$ and $t \in [0,1]$ ,

f(t u + (1-t)v) \leq t f(u) + (1-t) f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2.

Theorem 1.2.10 Let $f: \mathbb{R}^d \to \mathbb{R}$ be an element of $C^1$ . Then the function $f$ is $\mu$ -strongly convex if and only if

\forall u,v \in \mathbb{R}^d, f(v) \geq f(u) + \nabla f(u)^T (v-u) + \frac{\mu}{2} \|v-u\|^2.

Proof. ( $\Rightarrow$ )

Let $w = (1-t) u + t v = u + t (v-u)$ . By the definition of strong convexity, we have

\begin{aligned} f(w) &\leq (1-t) f(u) + t f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2\\ f(v) &\geq f(u) + \frac{f(w) - f(u)}{t} + \frac{\mu}{2} (1-t) \|u-v\|^2\\ f(v) &\geq f(u) + \frac{f(u + t(v-u)) - f(u)}{t} + \frac{\mu}{2} (1-t) \|u-v\|^2 \end{aligned}

Taking the limit $t \to 0$ , we obtain

f(v) \geq f(u) + \nabla f(u)^T (v-u) + \frac{\mu}{2} \|v-u\|^2.

( $\Leftarrow$ )

Set $w = t u + (1-t) v$ , we have

\begin{aligned} f(u) &\geq f(w) + \nabla f(w)^T (u-w) + \frac{\mu}{2} \|u-w\|^2\\ f(v) &\geq f(w) + \nabla f(w)^T (v-w) + \frac{\mu}{2} \|v-w\|^2. \end{aligned}

Multiplying the first inequality by $t$ and the second by $(1-t)$ , and summing them, we obtain

\begin{aligned} t f(u) + (1-t) f(v) &\geq f(w) + \nabla f(w)^T (t(u-w) + (1-t)(v-w)) + \frac{\mu}{2} (t \|u-w\|^2 + (1-t) \|v-w\|^2)\\ t f(u) + (1-t) f(v) &\geq f(w) + \frac{\mu}{2} (t \|(1-t) (u-v)\|^2 + (1-t) \|t (v-u)\|^2)\\ t f(u) + (1-t) f(v) &\geq f(t u + (1-t) v) + \frac{\mu}{2} t (1-t) \|u-v\|^2\\ f(t u + (1-t) v) &\leq t f(u) + (1-t) f(v) - \frac{\mu}{2} t (1-t) \|u-v\|^2. \end{aligned}

Theorem 1.2.2 (First order Taylor expansion)

Let $f \in C^{1,1}(\mathbb{R}^d)$ with $L >0$ . For any vectors $w,z \in \mathbb{R}^d$ , one has

f(z) \leq f(w) + \nabla f(w)^T (z-w) + \frac{L}{2} \|z-w\|^2

Proof. Observe that

f(z) - f(w) - \nabla f(w)^T (z-w) = \int_0^1 (\nabla f(w + t(z-w)) - \nabla f(w))^T (z-w) \,dt

Hence

\begin{aligned} \|f(z) - f(w) - \nabla f(w)^T (z-w)\|_2 &= \|\int_0^1 (\nabla f(w + t(z-w)) - \nabla f(w))^T (z-w) \,dt\|_2\\ &\leq \int_0^1 \|\nabla f(w + t(z-w)) - \nabla f(w)\|_2 \|z-w\|_2 \,dt\\ &\leq \int_0^1 L t \|z-w\|_2^2 \,dt\\ &= \frac{L}{2} \|z-w\|_2^2. \end{aligned}

Unconstrained optimization

\min_{w \in \mathbb{R}^d} f(w)

Assumption 2.0.1

The objective function $f$ is

$C^{1,1}_L (\mathbb{R}^d)$ for $L > 0$ ,
bounded below by $f_{\text{low}} \in \mathbb{R}$ (i.e. $f(w) \geq f_{\text{low}} \;\forall w \in \mathbb{R}^d$ )

2.1 Gradient descent

For any $w \in \mathbb{R}^d$ , two cases may occur:

$\nabla f(w) = 0$ . Then $w$ is possibly a local minimum of $f$ .
$\nabla f(w) \neq 0$ . Then we can show that $f$ can be decreased by moving $w$ in the direction $-\nabla f(w)$ .

2.1.1 Algorithm

Algorithm 1: Gradient descent for minimizing the function $f$ .

Initialization: Choose $w_0 \in \mathbb{R}^d$ .
For $k = 0,1,2,\dots$ $k = 0, 1, 2, \dots$ do
1. Compute the gradient $\nabla f(w_k)$ .
2. Define a step size $\alpha_k > 0$ .
3. Set $w_{k+1} = w_k - \alpha_k \nabla f(w_k)$ .
End for

Algorithm 1 actually describes a framework rather than a specific method. There exist numerous variants upon the gradient descent paradigm.

Why choose $-\nabla f(w_k)$ as a descent direction?

A vector $v \in \mathbb{R}^d$ is called a direction of descent of a function $f: \mathbb{R}^d \to \mathbb{R}$ at $x \in \mathbb{R}^d$ , if there exists $\delta > 0$ such that $f(x+ \alpha v) < f(x)$ for all $0<\alpha<\delta$ .

Clearly, if

f'(x;v) := \lim_{\alpha \to 0+} \frac{f(x + \alpha v) - f(x)}{\alpha} < 0,

then $v$ is a direction of descent.

Let $f: \mathbb{R}^d \to \mathbb{R}$ be differentiable at $x \in \mathbb{R}^d$ with $\nabla f(x) \neq 0$ . Then the optimal solution to the problem

\min_d \{f'(x;d) : \|d\| \leq 1\}

is given by $d^* = -\frac{\nabla f(x)}{\|\nabla f(x)\|}$ . Thus $-\nabla f(x)$ is a direction of steepest descent of $f$ at $x$ .

Proof. From differentiability of $f$ at $x$ , we have $f'(x;d) = \lim_{\alpha \to 0+} \frac{f(x + \alpha d) - f(x)}{\alpha} = \nabla f(x)^T d.$

By Cauchy-Schwarz inequality, we have $\nabla f(x)^T d \geq -\|\nabla f(x)\| \|d\| \geq -\|\nabla f(x)\|$ , with equality holding if and only if $d = -\frac{\nabla f(x)}{\|\nabla f(x)\|}$ .

Stop criterion

Things we can monitor to stop the algorithm:

whether the method converged to a solution
whether the method is making sufficient progress

Methods:

Norm of the gradient: If $\|\nabla f(w_k)\| < \epsilon$
Relative variation of the iterates: If $\|w_{k+1} - w_k\| < \epsilon$

Choose the initial point

Random initialization
Using a suitable initial guess based on prior knowledge of the problem (may be close to a local minimum)

2.1.2 Choose step size

1. Constant step size

Provided $f$ satisfies Assumption 2.0.1, there exists an interval of values that lead to convergence of gradient descent. In particular, the choice

\alpha_k = \alpha = \frac{1}{L},

where $L$ is the Lipschitz constant for the gradient, is well suited for that problem.

2. Decreasing step size

Another classical technique for selecting the step size consists in defining a decreasing sequence $\{\alpha_k\}$ such that $\alpha_k \to 0$ prior to running the method. This choice can also lead to converging method, but it risks producing steps that are unnecessarily small in norm. In fact, a good decreasing strategy should drive $\alpha_k$ to 0 quickly enough for convergence, but slowly enough that the norm of the steps do not approach 0 too rapidly.

3. Adaptive step size using a line search

Algorithm 2: Backtracking line search in direction $d$ .

Inputs: $w \in \mathbb{R}^d, d \in \mathbb{R}^d, \alpha_0 \in \mathbb{R}^d$ .
Initialization: Choose $\alpha = \alpha_0$ .
While $f(w + \alpha d) > f(w)$
Set $\alpha \to \alpha/2$ .
End while
Output: $\alpha$ .

2.1.3 Theoretical analysis of gradient descent

Proposition 2.1.1 Consider the $k$ -th iteration of Algorithm 1 applied to $f \in C^{1,1}_L (\mathbb{R}^d)$ , and assume that $\nabla f(w_k) \neq 0$ . Then, if $0<\alpha_k<\frac{2}{L}$ , we have

f(w_k - \alpha_k \nabla f(w_k)) < f(w_k).

If we choose $\alpha_k = 1/L$ , then

f(w_{k+1}) \leq f(w_k) - \frac{1}{2L} \|\nabla f(w_k)\|^2.

Proof. By Theorem 1.2.2, we have

\begin{aligned} f(w_{k+1}) &\leq f(w_k) + \nabla f(w_k)^T (w_{k+1} - w_k) + \frac{L}{2} \|w_{k+1} - w_k\|^2\\ &= f(w_k) - \alpha_k \nabla f(w_k)^T \nabla f(w_k) + \frac{L}{2} \alpha_k^2 \|\nabla f(w_k)\|^2\\ &= f(w_k) - (\alpha_k - \frac{L}{2} \alpha_k^2) \|\nabla f(w_k)\|^2. \end{aligned}

In order to ensure that the function value decreases at each iteration, we need to choose $\alpha_k$ such that $\alpha_k - \frac{L}{2} \alpha_k^2 > 0$ , which is equivalent to $0 < \alpha_k < 2/L$ .

The result of Proposition 2.1.1 will be instrumental to obtain complexity guarantees on Algorithm 1 in three different settings: nonconvex, convex, and strongly convex.

Nonconvex case

In the nonconvex case, we aim at bounding the number of iterations required to drive the gradient norm below some threshold $\epsilon > 0$ : this means that we should be able to show that the gradient norm actually goes below this threshold, which is a guarantee of convergence.

Why focus on the number of iterations?

The number of iterations is critical in nonconvex optimization for the following reasons:

1. Limited computational resources: Real-world applications have constraints on time and resources (e.g., GPU, memory). If an algorithm requires too many iterations to reduce the gradient norm below $\epsilon$ , it may be impractical due to high costs or delays.

2. Measure of convergence speed: The number of iterations reflects how quickly an algorithm converges. Ideally, we want fewer iterations to achieve $\|\nabla f\| < \epsilon$ . Analyzing iteration count helps compare algorithm efficiency.

3. Theoretical guarantee: Proving an upper bound on iterations provides a theoretical assurance of convergence. This guides practical implementation, such as parameter tuning or estimating runtime.

Theorem 2.1.1 (Complexity of gradient descent for nonconvex functions)

Let $f \in C^{1,1}_L (\mathbb{R}^d)$ satisfying Assumption 2.0.1. Suppose that we apply Algorithm 1 with a constant step size $\alpha_k = 1/L$ . Then, for any $K\geq 1$ , we have

\min_{0\leq k\leq K-1} \|\nabla f(w_k)\| \leq \mathcal{O}\left(\frac{1}{\sqrt{K}}\right)

Proof. Let $K$ be an iteration index such that for every $k = 0,\dots,K-1$ , we have $\|\nabla f(w_k)\| > \epsilon$ .

From Proposition 2.1.1, we have $\forall k = 0,\dots,K-1$ ,

\begin{aligned} f(w_{k+1}) &\leq f(w_k) - \frac{1}{2L} \|\nabla f(w_k)\|^2\\ &\leq f(w_k) - \frac{1}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2 \end{aligned}

By summing across all such iterations, we obtain:

\sum_{k=0}^{K-1} f(w_{k+1}) \leq \sum_{k=0}^{K-1} f(w_k) - \frac{K}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2

Removing identical terms on both sides, we get

f(w_K) \leq f(w_0) - \frac{K}{2L} (\min_{0\leq k\leq K-1} \|\nabla f(w_k)\|)^2.

Using the assumption that $f(w_K) \geq f_{\text{low}}$ , we obtain

\min_{0\leq k\leq K-1} \|\nabla f(w_k)\| \leq \left(\frac{2L (f(w_0) - f_{\text{low}})}{K}\right)^{1/2} = \mathcal{O}\left(\frac{1}{\sqrt{K}}\right).

Equivalently, we say that the worst-case complexity of gradient descent is $\mathcal{O}(\epsilon^{-2})$ .

If we set the stop criterion to be $\|\nabla f(w_k)\| < \epsilon$ , a similar proof guarantees that the number of iterations required to meet this criterion can be obtained by

\begin{aligned}