Optimization

Optimization

nonlinear programming

unconstrained minimization

first-order methods

conjugate gradient

mainly for solving large and sparse linear systems from iterative methods point of view

references: Nonlinear Programming , Iterative Methods for Linear and Non-linear Equations

second-order methods

Newton's method

linear system

conjugate gradient method is a Krylov subspace method, by the minimization property, we have $||x^k - x^\star||_A = \displaystyle \min_{p \in P_k, p(0)=1} ||P(A)(x^0 - x^\star)||_A$
Use $||x^k - x^\star||_A \le ||x^0 - x^\star||_A \displaystyle \min_{p \in P_k, p(0)=1} \displaystyle \max_{z\in\delta(A)}|p(z)|$

if the spectrum of $A$ is clustered
The tightest bound for general matrices is
$||x^k - x^\star||_A \le 2 (\frac{\sqrt{\kappa} -1}{\sqrt{\kappa} + 1})^k ||x^0 - x^\star||_A$
one can get this result by using Chebychev polynomial, for details see here

line search

constant step size rule (no search at all)
minimization rule, full relaxation (used in theory only) \[ \alpha_k = \mbox{argmin}_{\alpha > 0} f(x^k +\alpha p^k) \]
Armijo rule(sufficient decrease) \[ f(x^k + \alpha p^k) \le f(x^k) + c_1 \alpha \nabla f(x^k)^T p^k\]
curvature condition (rule out unacceptably small steps)
\[ f(x^k + \alpha p^k)^T p^k \ge c_2 \nabla f(x^k)^T p^k\]
Wolfe condition (sufficient decrease + curvature condition)
Goldstein rule
\[ c \nabla f(x^k)^T p^k \le f(x^k) - f(x^k + \alpha p^k) \le (1-c) \nabla f(x^k)^T p^k\] where $ c \in (0, \frac{1}{2})$
see Numerical Optimization for nice thorough discussion

gradient method

various search direction

steepest descent, i.e. negative gradient $ - \nabla f$
general descent direction, i.e. $p^k = -D^k \nabla f(x^k)$ where $ D^k$ is positive definite

constrained minimization

penalty function methods
barrier function methods

nonlinear system

global convergence

standard assumption for global convergence analysis in gradient method is \[ f \in C^{1,1}_L(\mathbb{R}^n)\] i.e. the first-order derivative of $ f $ is Lipchitz, following which we have the descent lemma
\[ |f(y) -f(x)- \nabla f(x)^T(y-x)| \le \frac{L}{2}||y-x||^2\]
note that in all cases, we can only get \[ \lim_{k\to\infty}||\nabla f(x^k)||=0 \] or even weaker result

steepest descent direction

for constant step size rule, minimization rule, Goldstein rule, we have
\[ \frac{\omega}{L} \displaystyle \sum_{k=0}^{\infty} ||\nabla f(x^k)||^2 \le f(x^0) - f^\star \]
rate of convergence result can be obtained only for \[ \min_{1\le i\le k}||\nabla f(x^k)||\]
see equation (1.2.14) of Introductory Lectures on Convex Optimization

local convergence
standard assumption for local convergence analysis is

$ f \in C^{2,2}_M(\mathbb{R}^n)$
$ H(x^\star)$ is positive definite
$x^0$ is sufficiently close to $x^\star$

steepest descent direction
for constant step size rule, we have linear convergence rate
\[ ||x^k - x^\star|| = O(\beta^k)\]
see theorem 1.2.4 of Introductory Lectures on Convex Optimization

general descent direction
for Goldstein rule, we have the following Zoutendijk condition
\[ \displaystyle \sum_{k=0}^{\infty} \cos \theta_k ||\nabla f(x^k)||^2 < \infty \]
see theorem 3.2 of Numerical Optimization

local convergence
standard assumption for local convergence analysis is

$ f \in C^{2,2}_M(\mathbb{R}^n)$
$ H(x^\star)$ is positive definite
$x^0$ is sufficiently close to $x^\star$

standard Newton's method (step size = 1)
we have quadratic convergence rate
\[ ||x^{k+1} - x^\star|| = O(||x^k - x^\star||^2)\]
see theorem 1.2.5 of Introductory Lectures on Convex Optimization

quasi-Newton

convex optimization

with rich theory and nice convergence guarantee

convex function

several inequalities from Introductory Lectures on Convex Optimization
any practical first-order method applied on convex function has lower complexity bound at initial stage \[ f(x^k) - f(x^\star) \ge \Theta(\frac{1}{k^2})\]see theorem 2.1.7 of Introductory Lectures on Convex Optimization

complexity bound
for constant size steepest descent method applied on $ f \in \mathcal{F}^{1,1}_L$, we have \[ f(x^k) - f(x^\star) \le \Theta(\frac{1}{k})\] see theorem 2.1.14 of Introductory Lectures on Convex Optimization

strongly convex function

convergence rate
for constant size steepest descent method applied on $ f \in S^{1,1}_{\mu, L}$, we have linear convergence rate. Note that this result is global convergence. see theorem 2.1.15 of Introductory Lectures on Convex Optimization