6 Algorithms
6.1 Convex optimization in finite dimension
6.1.1 Center of gravity
We consider the following simple iterative algorithm: let \(S_1=\mathcal{X}\), and for \(t\ge 1\) do the following:
Compute
\begin{equation} c_t=\frac{1}{\operatorname {vol}(S_t)}\int _{x\in S_t} x\, dx. \label{eq:2.1} \end{equation}1Query the first order oracle at \(c_t\) and obtain \(w_t\in \partial f(c_t)\). Let
\[ S_{t+1}=S_t\cap \{ x\in \mathbb {R}^n:(x-c_t)^\top w_t\le 0\} . \]
If stopped after \(t\) queries to the first order oracle then we use \(t\) queries to a zeroth order oracle to output
This procedure is known as the center of gravity method, it was discovered independently on both sides of the Wall by Levin [1965] and Newman [1965].
The center of gravity method satisfies
Let \(x^*\) be such that \(f(x^*)=\min _{x\in \mathcal{X}}f(x)\). Since \(w_t\in \partial f(c_t)\) one has
and thus
which clearly implies that one can never remove the optimal point from our sets in consideration, that is \(x^*\in S_t\) for any \(t\). Without loss of generality we can assume that we always have \(w_t\neq 0\), for otherwise one would have \(f(c_t)=f(x^*)\) which immediately concludes the proof. Now using that \(w_t\neq 0\) for any \(t\) and Lemma 2.2 one clearly obtains
For \(\varepsilon \in [0,1]\), let \(\mathcal{X}_\varepsilon =\{ (1-\varepsilon )x^*+\varepsilon x,\ x\in \mathcal{X}\} \). Note that \(\mathrm{vol}(\mathcal{X}_\varepsilon )=\varepsilon ^n\mathrm{vol}(\mathcal{X})\). These volume computations show that for \(\varepsilon {\gt}\left(1-\frac{1}{e}\right)^{t/n}\) one has \(\mathrm{vol}(\mathcal{X}_\varepsilon ){\gt}\mathrm{vol}(S_{t+1})\). In particular this implies that for \(\varepsilon {\gt}\left(1-\frac{1}{e}\right)^{t/n}\), there must exist a time \(r\in \{ 1,\dots ,t\} \), and \(x_\varepsilon \in \mathcal{X}_\varepsilon \), such that \(x_\varepsilon \in S_r\) and \(x_\varepsilon \notin S_{r+1}\). In particular by (2.2) one has \(f(c_r){\lt}f(x_\varepsilon )\). On the other hand by convexity of \(f\) one clearly has \(f(x_\varepsilon )\le f(x^*)+2\varepsilon B\). This concludes the proof.
(Grünbaum [1960]). Let \(\mathcal{K}\) be a centered convex set, i.e., \(\int _{x\in \mathcal{K}} x\, dx = 0\), then for any \(w\in \mathbb {R}^n\), \(w\neq 0\), one has
6.1.2 Ellipsoid method
Let \(\mathcal{E}_0=\{ x\in \mathbb {R}^n:(x-c_0)^\top H_0^{-1}(x-c_0)\le 1\} \). For any \(w\in \mathbb {R}^n\), \(w\neq 0\), there exists an ellipsoid \(\mathcal{E}\) such that
and
Furthermore for \(n\ge 2\) one can take \(\mathcal{E}=\{ x\in \mathbb {R}^n:(x-c)^\top H^{-1}(x-c)\le 1\} \) where
For \(n=1\) the result is obvious, in fact we even have \(\operatorname {vol}(\mathcal{E})\le \tfrac {1}{2}\operatorname {vol}(\mathcal{E}_0)\).
For \(n\ge 2\) one can simply verify that the ellipsoid given by (2.5) and (2.6) satisfy the required properties (2.3) and (2.4). Rather than bluntly doing these computations we will show how to derive (2.5) and (2.6). As a by-product this will also show that the ellipsoid defined by
By doing a quick picture, one can see that it makes sense to look for an ellipsoid \(\mathcal E\) that would be centered at \(c=-tw\), with \(t\in [0,1]\) (presumably \(t\) will be small), and such that one principal direction is \(w\) (with inverse squared semi-axis \(a{\gt}0\)), and the other principal directions are all orthogonal to \(w\) (with the same inverse squared semi-axes \(b{\gt}0\)). In other words we are looking for \(\mathcal E=\{ x:(x-c)^\top H^{-1}(x-c)\le 1\} \) with
Now we have to express our constraints on the fact that \(\mathcal E\) should contain the half Euclidean ball \(\{ x\in \mathcal B:x^\top w\le 0\} \). Since we are also looking for \(\mathcal E\) to be as small as possible, it makes sense to ask for \(\mathcal E\) to "touch" the Euclidean ball, both at \(x=-w\), and at the equator \(\partial \mathcal B\cap w^\perp \). The former condition can be written as:
while the latter is expressed as:
As one can see from the above two equations, we are still free to choose any value for \(t\in [0,1/2)\) (the fact that we need \(t{\lt}1/2\) comes from \(b=1-\left(\dfrac {t}{1-t}\right)^2{\gt}0\)). Quite naturally we take the value that minimizes the volume of the resulting ellipsoid. Note that
where \(f(h)=h^2(2h-h^2)^{\, n-1}\). Elementary computations show that the maximum of \(f\) (on \([1,2]\)) is attained at \(h=1+\frac{1}{n}\) (which corresponds to \(t=\frac{1}{n+1}\)), and the value is
where the lower bound follows again from elementary computations. Thus we showed that, for \(\mathcal{E}_0=\mathcal{B}\), (2.3) and (2.4) are satisfied with the ellipsoid given by the set of points \(x\) satisfying:
We consider now an arbitrary ellipsoid \(\mathcal{E}_0=\{ x\in \mathbb {R}^n:(x-c_0)^{\top }H_0^{-1}(x-c_0)\le 1\} \). Let \(\Phi (x)=c_0+H_0^{1/2}x\), then clearly \(\mathcal{E}_0=\Phi (\mathcal{B})\) and \(\{ x:w^{\top }(x-c_0)\le 0\} =\Phi (\{ x:(H_0^{1/2}w)^{\top }x\le 0\} )\). Thus in this case the image by \(\Phi \) of the ellipsoid given in (2.7) with \(w\) replaced by \(H_0^{1/2}w\) will satisfy (2.3) and (2.4). It is easy to see that this corresponds to an ellipsoid defined by
Applying Sherman–Morrison formula to (2.8) one can recover (2.6) which concludes the proof.
For \(t \ge 2n^2\log (R/r)\) the ellipsoid method satisfies \(\{ c_1,\dots ,c_t\} \cap \mathcal{X}\neq \varnothing \) and
6.2 Dimension-free convex optimization
6.2.1 Subgradient descent
For \(t \ge 1\):
The projected subgradient descent method with \(\eta = \frac{R}{L\sqrt{t}}\) satisfies
Using the definition of subgradients, the definition of the method, and the elementary identity \(2a^\top b=\| a\| ^2+\| b\| ^2-\| a-b\| ^2\), one obtains
Now note that \(\| g_s\| \le L\), and furthermore by Lemma 3.1
Summing the resulting inequality over \(s\), and using that \(\| x_1-x^*\| \le R\) yield
Plugging in the value of \(\eta \) directly gives the statement (recall that by convexity \(f((1/t)\sum _{s=1}^t x_s)\le \tfrac {1}{t}\sum _{s=1}^t f(x_s)\)).
The projected subgradient descent algorithm with time-varying step size \((\eta _t)_{t\ge 1}\), that is
Let \(f\) be \(\alpha \)-strongly convex and \(L\)-Lipschitz on \(\mathcal{X}\). Then projected subgradient descent with \(\eta _s=\frac{2}{\alpha (s+1)}\) satisfies
Coming back to our original analysis of projected subgradient descent in Section 3.1 and using the strong convexity assumption one immediately obtains
Multiplying this inequality by \(s\) yields
Now sum the resulting inequality over \(s=1\) to \(s=t\), and apply Jensen’s inequality to obtain the claimed statement.
6.2.2 Gradient descent
Let \(f\) be convex and \(\beta \)-smooth on \(\mathbb {R}^n\). Then gradient descent with \(\eta =\frac{1}{\beta }\) satisfies
Using (3.5) and the definition of the method one has
In particular, denoting \(\delta _s=f(x_s)-f(x^*)\), this shows:
One also has by convexity
We will prove that \(\| x_s-x^*\| \) is decreasing with \(s\), which with the two above displays will imply
Let us see how to use this last inequality to conclude the proof. Let
then 1
Thus it only remains to show that \(\| x_s-x^*\| \) is decreasing with \(s\). Using Lemma 3.5 one immediately gets
We use this as follows (together with \(\nabla f(x^*)=0\))
which concludes the proof.
Let \(f\) be a \(\beta \)-smooth function on \(\mathbb {R}^n\). Then for any \(x,y\in \mathbb {R}^n\), one has
We represent \(f(x)-f(y)\) as an integral, apply Cauchy–Schwarz and then \(\beta \)-smoothness:
Let \(f\) be such that (3.4) holds true. Then for any \(x,y\in \mathbb {R}^n\), one has
Let \(z=y-\frac{1}{\beta }\big(\nabla f(y)-\nabla f(x)\big)\). Then one has
Let \(x,y\in \mathcal{X}\), \(x^{+}=\Pi _{\mathcal{X}}\bigl(x-\tfrac {1}{\beta }\nabla f(x)\bigr)\), and \(g_{\mathcal{X}}(x)=\beta (x-x^{+})\). Then the following holds true:
We first observe that
Indeed the above inequality is equivalent to
which follows from Lemma 3.1. Now we use 4 as follows to prove the lemma (we also use (3.4) which still holds true in the constrained case)
Let \(f\) be convex and \(\beta \)-smooth on \(\mathcal{X}\). Then projected gradient descent with \(\eta =\frac{1}{\beta }\) satisfies
Lemma 3.6 immediately gives
and
We will prove that \(\| x_s-x^*\| \) is decreasing with \(s\), which with the two above displays will imply
An easy induction shows that
Thus it only remains to show that \(\| x_s-x^*\| \) is decreasing with \(s\). Using Lemma 3.6 one can see that \(g_{\mathcal X}(x_s)^\top (x_s-x^*)\ge \frac{1}{2\beta }\| g_{\mathcal X}(x_s)\| ^2\) which implies
\(\square \)
Let \(f\) be \(\alpha \)-strongly convex and \(\beta \)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta =\frac{1}{\beta }\) satisfies for \(t\ge 0\),
Using (3.14) with \(y=x^*\) one directly obtains
which concludes the proof.
Let \(f\) be \(\beta \)-smooth and \(\alpha \)-strongly convex on \(\mathbb {R}^n\). Then for all \(x,y\in \mathbb {R}^n\), one has
Let \(\varphi (x)=f(x)-\frac{\alpha }{2}\| x\| ^2\). By definition of \(\alpha \)-strong convexity one has that \(\varphi \) is convex. Furthermore one can show that \(\varphi \) is \((\beta -\alpha )\)-smooth by proving (3.4) (and using that it implies smoothness). Thus using (3.6) one gets
which gives the claimed result with straightforward computations. (Note that if \(\alpha =\beta \) the smoothness of \(\varphi \) directly implies that \(\nabla f(x)-\nabla f(y)=\alpha (x-y)\) which proves the lemma in this case.)
Let \(f\) be \(\beta \)-smooth and \(\alpha \)-strongly convex on \(\mathbb {R}^n\). Then gradient descent with \(\eta =\frac{2}{\alpha +\beta }\) satisfies
First note that by \(\beta \)-smoothness (since \(\nabla f(x^*)=0\)) one has
Now using Lemma 3.11 one obtains
which concludes the proof.
6.2.3 Nesterov
Nesterov’s accelerated gradient descent can be described as follows: Start at an arbitrary initial point \(x_1=y_1\) and then iterate the following equations for \(t\ge 1\),
Let \(f\) be \(\alpha \)-strongly convex and \(\beta \)-smooth, then Nesterov’s accelerated gradient descent satisfies
We define \(\alpha \)-strongly convex quadratic functions \(\Phi _s\), \(s\ge 1\) by
induction as follows:
Intuitively \(\Phi _s\) becomes a finer and finer approximation (from below) to \(f\) in the following sense:
The above inequality can be proved immediately by induction, using the fact that by \(\alpha \)-strong convexity one has
Equation (3.18) by itself does not say much, for it to be useful one needs to understand how “far” below \(f\) is \(\Phi _s\). The following inequality answers this question:
The rest of the proof is devoted to showing that (3.19) holds true, but first let us see how to combine (3.18) and (3.19) to obtain the rate given by the theorem (we use that by \(\beta \)-smoothness one has \(f(x)-f(x^*)\le \frac{\beta }{2}\| x-x^*\| ^2\)):
We now prove (3.19) by induction (note that it is true at \(s=1\) since \(x_1=y_1\)). Let \(\Phi _s^*=\min _{x\in \mathbb {R}^n}\Phi _s(x)\). Using the definition of \(y_{s+1}\) (and \(\beta \)-smoothness), convexity, and the induction hypothesis, one gets
Thus we now have to show that
To prove this inequality we have to understand better the functions \(\Phi _s\). First note that \(\nabla ^2\Phi _s(x)=\alpha I_n\) (immediate by induction) and thus \(\Phi _s\) has to be of the following form:
for some \(v_s\in \mathbb {R}^n\). Now observe that by differentiating (3.17) and using the above form of \(\Phi _s\) one obtains
In particular \(\Phi _{s+1}\) is by definition minimized at \(v_{s+1}\) which can now be defined by induction using the above identity, precisely:
Using the form of \(\Phi _s\) and \(\Phi _{s+1}\), as well as the original definition (3.17) one gets the following identity by evaluating \(\Phi _{s+1}\) at \(x_s\):
Note that thanks to (3.21) one has
which combined with (3.22) yields
Finally we show by induction that \(v_s-x_s=\sqrt\kappa (x_s-y_s)\), which concludes the proof of (3.20) and thus also concludes the proof of the theorem:
where the first equality comes from (3.21), the second from the induction hypothesis, the third from the definition of \(y_{s+1}\) and the last one from the definition of \(x_{s+1}\).
Nesterov’s accelerated gradient descent for the case \(\alpha = 0\). First we define the following sequences:
(Note that \(\gamma _t \le 0\).) Now the algorithm is simply defined by the following equations, with \(x_1 = y_1\) an arbitrary initial point,
Let \(f\) be a convex and \(\beta \)-smooth function, then Nesterov’s accelerated gradient descent satisfies
Using the unconstrained version of Lemma 3.6 one obtains
Similarly we also get
Now multiplying (3.23) by \((\lambda _s - 1)\) and adding the result to (3.24), one obtains with \(\delta _s = f(y_s) - f(x^*)\),
Multiplying this inequality by \(\lambda _s\) and using that by definition \(\lambda _{s-1}^2 = \lambda _s^2 - \lambda _s\), as well as the elementary identity \(2a^\top b - \| a\| ^2 = \| b\| ^2 - \| b-a\| ^2\),
one obtains
Next remark that, by definition, one has
Putting together (3.25) and (3.26) one gets with \(u_s=\lambda _s x_s-(\lambda _s-1)y_s-x^*\),
Summing these inequalities from \(s=1\) to \(s=t-1\) one obtains:
By induction it is easy to see that \(\lambda _{t-1}\ge \tfrac {t}{2}\) which concludes the proof.
6.2.4 Frank-Wolfe
The conditional gradient descent, introduced in Frank and Wolfe [1956], performs the following update for \(t\ge 1\), where \((\gamma _s)_{s\ge 1}\) is a fixed sequence,
Let \(f\) be a convex and \(\beta \)-smooth function w.r.t. some norm \(\| \cdot \| \), \(R=\sup _{x,y\in \mathcal{X}}\| x-y\| \), and \(\gamma _s=\frac{2}{s+1}\) for \(s\ge 1\). Then for any \(t\ge 2\), one has
The following inequalities hold true, using respectively \(\beta \)-smoothness (it can easily be seen that (3.4) holds true for smoothness in an arbitrary norm), the definition of \(x_{s+1}\), the definition of \(y_s\), and
the convexity of \(f\):
Rewriting this inequality in terms of \(\delta _s=f(x_s)-f(x^*)\) one obtains
A simple induction using that \(\gamma _s=\frac{2}{s+1}\) finishes the proof (note that the initialization is done at step \(2\) with the above inequality yielding \(\delta _2\le \frac{\beta }{2}R^2\)).
6.3 Almost dimension-free convex optimization in non-Euclidean spaces
6.3.1 Mirror descent
Let \(\mathcal{D}\subset \mathbb {R}^n\) be a convex open set such that \(\mathcal{X}\) is included in its closure, that is \(\mathcal{X}\subset \overline{\mathcal{D}}\), and \(\mathcal{X}\cap \mathcal{D}\neq \varnothing \). We say that \(\Phi :\mathcal{D}\to \mathbb {R}\) is a mirror map if it satisfies the following properties:
\(\Phi \) is strictly convex and differentiable.
The gradient of \(\Phi \) takes all possible values, that is \(\nabla \Phi (\mathcal{D})=\mathbb {R}^n\).
The gradient of \(\Phi \) diverges on the boundary of \(\mathcal{D}\), that is
\[ \lim _{x\to \partial \mathcal{D}}\| \nabla \Phi (x)\| =+\infty . \]
In mirror descent the projection is done via the Bregman divergence associated to \(\Phi \). Precisely one defines
Let \(x\in \mathcal{X}\cap \mathcal{D}\) and \(y\in \mathcal{D}\), then
which also implies
The proof is an immediate corollary of Proposition 1.3 together with the fact that \(\nabla _x D_{\Phi }(x,y)=\nabla \Phi (x)-\nabla \Phi (y)\).
The mirror descent strategy based on a mirror map \(\Phi \) is as follows: let \(x_1\in \operatorname*{arg\, min}_{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)\). Then for \(t\ge 1\), let \(y_{t+1}\in \mathcal{D}\) such that
and
Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap \mathcal{D}\) w.r.t. \(\| \cdot \| \). Let \(R^2=\sup _{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)-\Phi (x_1)\), and \(f\) be convex and L-Lipschitz w.r.t. \(\| \cdot \| \). Then mirror descent with \(\eta =\dfrac {R}{L}\sqrt{\dfrac {2\rho }{t}}\) satisfies
Let \(x\in \mathcal X\cap \mathcal{D}\). The claimed bound will be obtained by taking a limit \(x\to x^*\). Now by convexity of \(f\), the definition of mirror descent, equation (4.1), and Lemma 4.1, one has
The term \(D_\Phi (x,x_s)-D_\Phi (x,x_{s+1})\) will lead to a telescopic sum when summing over \(s=1\) to \(s=t\), and it remains to bound the other term as follows using \(\rho \)-strong convexity of the mirror map and \(az-bz^2\le \tfrac {a^2}{4b},\ \forall z\in \mathbb R\):
We proved
which concludes the proof up to trivial computation.
6.3.2 Dual averaging
Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap D\) w.r.t. \(\| \cdot \| \). Let \(R^2=\sup _{x\in \mathcal{X}\cap D}\Phi (x)-\Phi (x_1)\), and \(f\) be convex and \(L\)-Lipschitz w.r.t. \(\| \cdot \| \). Then dual averaging with \(\eta =\dfrac {R}{L}\sqrt{\dfrac {\rho }{2t}}\) satisfies
We define \(\psi _t(x)=\eta \sum _{s=1}^t g_s^\top x+\Phi (x)\), so that \(x_t\in \arg \min _{x\in \mathcal{X}\cap \mathcal{D}}\psi _{t-1}(x)\). Since \(\Phi \) is \(\rho \)-strongly convex one clearly has that \(\psi _t\) is \(\rho \)-strongly convex, and thus
where the second inequality comes from the first order optimality condition for \(x_{t+1}\) (see Proposition 1.3). Next observe that
Putting together the two above displays and using Cauchy-Schwarz (with the assumption \(\| g_t\| _* \le L\)) one obtains
In particular this shows that \(\| x_{t+1}-x_t\| \le \frac{2\eta L}{\rho }\) and thus with the above display
Now we claim that for any \(x\in \mathcal X\cap \mathcal D\),
which would clearly conclude the proof thanks to (4.7) and straightforward computations. Equation (4.8) is equivalent to
and we now prove the latter equation by induction. At \(t=0\) it is true since \(x_1\in \arg \min _{x\in \mathcal X\cap \mathcal D}\Phi (x)\). The following inequalities prove the inductive step, where we use the induction hypothesis at \(x=x_{t+1}\) for the first inequality, and the definition of \(x_{t+1}\) for the second inequality:
6.3.3 Mirror prox
Mirror prox is described by the following equations:
Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap \mathcal{D}\) w.r.t. \(\| \cdot \| \). Let \(R^{2}=\sup _{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)-\Phi (x_{1})\), and \(f\) be convex and \(\beta \)-smooth w.r.t. \(\| \cdot \| \). Then mirror prox with \(\eta =\frac{\rho }{\beta }\) satisfies
Let \(x\in \mathcal{X}\cap D\). We write
We will now bound separately these three terms. For the first one, using the definition of the method, Lemma 4.1, and equation (4.1), one gets
For the second term using the same properties than above and the
Finally for the last term, using Cauchy–Schwarz, \(\beta \)-smoothness, and \(2ab\le a^2+b^2\) one gets
Thus summing up these three terms and using that \(\eta =\frac{\rho }{\beta }\) one gets
The proof is concluded with straightforward computations.
6.4 Beyond the black-box model
The function \(f\) is convex and \(\beta \)-smooth, and \(g\) is convex. With \(\eta =\frac{1}{\beta }\) one has
Let \(x_1 = y_1\) an arbitrary initial point, and
The function \(f\) is convex and \(\beta \)-smooth, and \(g\) is convex. The rate of convergence of FISTA on \(f+g\) is similar to the one of Nesterov’s accelerated gradient descent on \(f\), more precisely:
6.4.1 Newton’s method
\(F:\operatorname {int}(\mathcal{X})\to \mathbb {R}\) is a barrier for \(\mathcal{X}\) if
Assume that \(f\) has a Lipschitz Hessian, that is
Let \(x^*\) be a local minimum of \(f\) with strictly positive Hessian, that is \(\nabla ^2 f(x^*)\succeq \mu I_n\), \(\mu {\gt}0\). Suppose that the initial starting point \(x_0\) of Newton’s method is such that
Then Newton’s method is well-defined and converges to \(x^*\) at a quadratic rate:
We use the following simple formula, for \(x,h\in \mathbb {R}^n\),
Now note that \(\nabla f(x^*)=0\), and thus with the above formula one obtains
which allows us to write:
In particular one has
Using the Lipschitz property of the Hessian one immediately obtains that
Using again the Lipschitz property of the Hessian (note that \(\| A-B\| \le s\iff sI_n\succeq A-B\succeq -sI_n\)), the hypothesis on \(x^*\), and an induction hypothesis that \(\| x_k-x^*\| \le \frac{\mu }{2M}\), one has
which concludes the proof.
6.4.2 Interior point methods
Let \(\mathcal X\) be a convex set with non-empty interior, and \(f\) a \(C^3\) convex function defined on \(\operatorname {int}(\mathcal X)\). Then \(f\) is self-concordant (with constant \(M\)) if for all \(x\in \operatorname {int}(\mathcal X)\), \(h\in \mathbb R^n\),
We say that \(f\) is standard self-concordant if \(f\) is self-concordant with constant \(M=2\).
Let \(f\) be a standard self-concordant function on \(\mathcal{X}\). For \(x\in \operatorname {int}(\mathcal{X})\), we say that \(\lambda _f(x)=\| \nabla f(x)\| _{x}^*\) is the Newton decrement of \(f\) at \(x\).
An important inequality is that for \(x\) such that \(\lambda _f(x){\lt}1\), and \(x^*=\operatorname*{arg\, min}f(x)\), one has
Let \(f\) be a standard self-concordant function on \(\mathcal{X}\), and \(x\in \operatorname {int}(\mathcal{X})\) such that \(\lambda _f(x)\le 1/4\), then
\(F\) is a \(\nu \)-self-concordant barrier if it is a standard self-concordant function, and it is \(\tfrac {1}{\nu }\)-exp-concave.
Let \(\mathcal X\subset \mathbb R^n\) be a closed convex set with non-empty interior. There exists \(F\) which is a \((c\, n)\)-self-concordant barrier for \(\mathcal X\) (where \(c\) is some universal constant).
A key property of \(\nu \)-self-concordant barriers is the following inequality:
see [Equation (4.2.17), Nesterov (2004a)]. More generally using 15 together with (5.5) one obtains
In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity (5.7) for a point x close to the central path. We do this as follows:
We can now formally describe and analyze the most basic IPM called the path-following scheme. Let \(F\) be a \(\nu \)-self-concordant barrier for \(\mathcal{X}\). Assume that one can find \(x_0\) such that \(\lambda _{F_{t_0}}(x_0)\le 1/4\) for some small value \(t_0{\gt}0\) (we describe a method to find \(x_0\) at the end of this subsection). Then for \(k\ge 0\), let
The path-following scheme described above satisfies
We show that the iterates \((x_k)_{k\ge 0}\) remain close to the central path \((x^*(t_k))_{k\ge 0}\). Precisely one can easily prove by induction that
Indeed using Theorem 5.4 and equation (5.12) one immediately obtains
where we used in the last inequality that \(t_{k+1}/t_k=1+\frac{1}{13\sqrt{\nu }}\) and \(\nu \ge 1\). Thus using (5.11) one obtains
Observe that \(t_k=\left(1+\frac{1}{13\sqrt{\nu }}\right)^k t_0\), which finally yields