∇f(w)=0. Then we can show that f can be decreased by moving w in the direction −∇f(w).
2.1.1 Algorithm
Algorithm 1: Gradient descent for minimizing the function f.
Initialization: Choose w0∈Rd.
For k=0,1,2,… do
Compute the gradient ∇f(wk).
Define a step size αk>0.
Set wk+1=wk−αk∇f(wk).
End for
Algorithm 1 actually describes a framework rather than a specific method. There exist numerous variants upon the gradient descent paradigm.
Why choose −∇f(wk) as a descent direction?
A vector v∈Rd is called a direction of descent of a function f:Rd→R at x∈Rd, if there exists δ>0 such that f(x+αv)<f(x) for all 0<α<δ.
Clearly, if
f′(x;v):=α→0+limαf(x+αv)−f(x)<0,
then v is a direction of descent.
Let f:Rd→R be differentiable at x∈Rd with ∇f(x)=0. Then the optimal solution to the problem
dmin{f′(x;d):∥d∥≤1}
is given by d∗=−∥∇f(x)∥∇f(x). Thus −∇f(x) is a direction of steepest descent of f at x.
Proof. From differentiability of f at x, we have f′(x;d)=limα→0+αf(x+αd)−f(x)=∇f(x)Td.
By Cauchy-Schwarz inequality, we have ∇f(x)Td≥−∥∇f(x)∥∥d∥≥−∥∇f(x)∥, with equality holding if and only if d=−∥∇f(x)∥∇f(x).
Stop criterion
Things we can monitor to stop the algorithm:
whether the method converged to a solution
whether the method is making sufficient progress
Methods:
Norm of the gradient: If ∥∇f(wk)∥<ϵ
Relative variation of the iterates: If ∥wk+1−wk∥<ϵ
Choose the initial point
Random initialization
Using a suitable initial guess based on prior knowledge of the problem (may be close to a local minimum)
2.1.2 Choose step size
1. Constant step size
Provided f satisfies Assumption 2.0.1, there exists an interval of values that lead to convergence of gradient descent. In particular, the choice
αk=α=L1,
where L is the Lipschitz constant for the gradient, is well suited for that problem.
2. Decreasing step size
Another classical technique for selecting the step size consists in defining a decreasing sequence {αk} such that αk→0 prior to running the method. This choice can also lead to converging method, but it risks producing steps that are unnecessarily small in norm. In fact, a good decreasing strategy should drive αk to 0 quickly enough for convergence, but slowly enough that the norm of the steps do not approach 0 too rapidly.
3. Adaptive step size using a line search
Algorithm 2: Backtracking line search in direction d.
Inputs: w∈Rd,d∈Rd,α0∈Rd.
Initialization: Choose α=α0.
While f(w+αd)>f(w)
Set α→α/2.
End while
Output: α.
2.1.3 Theoretical analysis of gradient descent
Proposition 2.1.1 Consider the k-th iteration of Algorithm 1 applied to f∈CL1,1(Rd), and assume that ∇f(wk)=0. Then, if 0<αk<L2, we have
In order to ensure that the function value decreases at each iteration, we need to choose αk such that αk−2Lαk2>0, which is equivalent to 0<αk<2/L.
The result of Proposition 2.1.1 will be instrumental to obtain complexity guarantees on Algorithm 1 in three different settings: nonconvex, convex, and strongly convex.
Nonconvex case
In the nonconvex case, we aim at bounding the number of iterations required to drive the gradient norm below some threshold ϵ>0: this means that we should be able to show that the gradient norm actually goes below this threshold, which is a guarantee of convergence.
Why focus on the number of iterations?
The number of iterations is critical in nonconvex optimization for the following reasons:
1. Limited computational resources: Real-world applications have constraints on time and resources (e.g., GPU, memory). If an algorithm requires too many iterations to reduce the gradient norm below ϵ, it may be impractical due to high costs or delays.
2. Measure of convergence speed: The number of iterations reflects how quickly an algorithm converges. Ideally, we want fewer iterations to achieve ∥∇f∥<ϵ. Analyzing iteration count helps compare algorithm efficiency.
3. Theoretical guarantee: Proving an upper bound on iterations provides a theoretical assurance of convergence. This guides practical implementation, such as parameter tuning or estimating runtime.
Theorem 2.1.1 (Complexity of gradient descent for nonconvex functions)
Let f∈CL1,1(Rd) satisfying Assumption 2.0.1. Suppose that we apply Algorithm 1 with a constant step size αk=1/L. Then, for any K≥1, we have
0≤k≤K−1min∥∇f(wk)∥≤O(K1)
Proof. Let K be an iteration index such that for every k=0,…,K−1, we have ∥∇f(wk)∥>ϵ.
Equivalently, we say that the worst-case complexity of gradient descent is O(ϵ−2).
If we set the stop criterion to be ∥∇f(wk)∥<ϵ, a similar proof guarantees that the number of iterations required to meet this criterion can be obtained by