6 Algorithms

6.1 Convex optimization in finite dimension

6.1.1 Center of gravity

Definition 6.1.1

We consider the following simple iterative algorithm: let \(S_1=\mathcal{X}\), and for \(t\ge 1\) do the following:

Compute

\begin{equation} c_t=\frac{1}{\operatorname {vol}(S_t)}\int _{x\in S_t} x\, dx. \label{eq:2.1} \end{equation}
1
Query the first order oracle at \(c_t\) and obtain \(w_t\in \partial f(c_t)\). Let

\[ S_{t+1}=S_t\cap \{ x\in \mathbb {R}^n:(x-c_t)^\top w_t\le 0\} . \]

If stopped after \(t\) queries to the first order oracle then we use \(t\) queries to a zeroth order oracle to output

\[ x_t\in \operatorname*{arg\, min}_{1\le r\le t} f(c_r). \]

This procedure is known as the center of gravity method, it was discovered independently on both sides of the Wall by Levin [1965] and Newman [1965].

Theorem 6.1.1

The center of gravity method satisfies

\[ f(x_t)-\min _{x\in \mathcal{X}} f(x)\le 2B\left(1-\frac{1}{e}\right)^{t/n}. \]

Proof ▶

Let \(x^*\) be such that \(f(x^*)=\min _{x\in \mathcal{X}}f(x)\). Since \(w_t\in \partial f(c_t)\) one has

\[ f(c_t)-f(x)\le w_t^\top (c_t-x). \]

and thus

\[ S_t\setminus S_{t+1}\subset \{ x\in \mathcal{X}:(x-c_t)^\top w_t{\gt}0\} \subset \{ x\in \mathcal{X}:f(x){\gt}f(c_t)\} ,\tag {2.2} \]

which clearly implies that one can never remove the optimal point from our sets in consideration, that is \(x^*\in S_t\) for any \(t\). Without loss of generality we can assume that we always have \(w_t\neq 0\), for otherwise one would have \(f(c_t)=f(x^*)\) which immediately concludes the proof. Now using that \(w_t\neq 0\) for any \(t\) and Lemma 2.2 one clearly obtains

\[ \mathrm{vol}(S_{t+1})\le \left(1-\frac{1}{e}\right)^t\mathrm{vol}(\mathcal{X}). \]

For \(\varepsilon \in [0,1]\), let \(\mathcal{X}_\varepsilon =\{ (1-\varepsilon )x^*+\varepsilon x,\ x\in \mathcal{X}\} \). Note that \(\mathrm{vol}(\mathcal{X}_\varepsilon )=\varepsilon ^n\mathrm{vol}(\mathcal{X})\). These volume computations show that for \(\varepsilon {\gt}\left(1-\frac{1}{e}\right)^{t/n}\) one has \(\mathrm{vol}(\mathcal{X}_\varepsilon ){\gt}\mathrm{vol}(S_{t+1})\). In particular this implies that for \(\varepsilon {\gt}\left(1-\frac{1}{e}\right)^{t/n}\), there must exist a time \(r\in \{ 1,\dots ,t\} \), and \(x_\varepsilon \in \mathcal{X}_\varepsilon \), such that \(x_\varepsilon \in S_r\) and \(x_\varepsilon \notin S_{r+1}\). In particular by (2.2) one has \(f(c_r){\lt}f(x_\varepsilon )\). On the other hand by convexity of \(f\) one clearly has \(f(x_\varepsilon )\le f(x^*)+2\varepsilon B\). This concludes the proof.

Lemma 6.1.1

(Grünbaum [1960]). Let \(\mathcal{K}\) be a centered convex set, i.e., \(\int _{x\in \mathcal{K}} x\, dx = 0\), then for any \(w\in \mathbb {R}^n\), \(w\neq 0\), one has

\[ \operatorname {Vol}\big(\mathcal{K}\cap \{ x\in \mathbb {R}^n : x^\top w\ge 0\} \big)\ge \frac{1}{e}\operatorname {Vol}(\mathcal{K}). \]

6.1.2 Ellipsoid method

Lemma 6.1.2

Let \(\mathcal{E}_0=\{ x\in \mathbb {R}^n:(x-c_0)^\top H_0^{-1}(x-c_0)\le 1\} \). For any \(w\in \mathbb {R}^n\), \(w\neq 0\), there exists an ellipsoid \(\mathcal{E}\) such that

\[ \mathcal{E}\supset \{ x\in \mathcal{E}_0:w^\top (x-c_0)\le 0\} , \]

and

\[ \operatorname {vol}(\mathcal{E})\le \exp \! \left(-\frac{1}{2n}\right)\operatorname {vol}(\mathcal{E}_0). \]

Furthermore for \(n\ge 2\) one can take \(\mathcal{E}=\{ x\in \mathbb {R}^n:(x-c)^\top H^{-1}(x-c)\le 1\} \) where

\[ c=c_0-\frac{1}{n+1}\frac{H_0w}{\sqrt{w^\top H_0 w}}, \]

\[ H=\frac{n^2}{n^2-1}\left(H_0-\frac{2}{n+1}\frac{H_0 ww^\top H_0}{w^\top H_0 w}\right). \]

Proof ▶

For \(n=1\) the result is obvious, in fact we even have \(\operatorname {vol}(\mathcal{E})\le \tfrac {1}{2}\operatorname {vol}(\mathcal{E}_0)\).

For \(n\ge 2\) one can simply verify that the ellipsoid given by (2.5) and (2.6) satisfy the required properties (2.3) and (2.4). Rather than bluntly doing these computations we will show how to derive (2.5) and (2.6). As a by-product this will also show that the ellipsoid defined by

By doing a quick picture, one can see that it makes sense to look for an ellipsoid \(\mathcal E\) that would be centered at \(c=-tw\), with \(t\in [0,1]\) (presumably \(t\) will be small), and such that one principal direction is \(w\) (with inverse squared semi-axis \(a{\gt}0\)), and the other principal directions are all orthogonal to \(w\) (with the same inverse squared semi-axes \(b{\gt}0\)). In other words we are looking for \(\mathcal E=\{ x:(x-c)^\top H^{-1}(x-c)\le 1\} \) with

\[ c=-tw,\quad \text{and}\quad H^{-1}=aww^\top +b(I_n-ww^\top ). \]

Now we have to express our constraints on the fact that \(\mathcal E\) should contain the half Euclidean ball \(\{ x\in \mathcal B:x^\top w\le 0\} \). Since we are also looking for \(\mathcal E\) to be as small as possible, it makes sense to ask for \(\mathcal E\) to "touch" the Euclidean ball, both at \(x=-w\), and at the equator \(\partial \mathcal B\cap w^\perp \). The former condition can be written as:

\[ (-w-c)^\top H^{-1}(-w-c)=1\Longleftrightarrow (t-1)^2a=1, \]

while the latter is expressed as:

\[ \forall y\in \partial \mathcal B\cap w^\perp ,\ (y-c)^\top H^{-1}(y-c)=1 \Longleftrightarrow b+t^2a=1. \]

As one can see from the above two equations, we are still free to choose any value for \(t\in [0,1/2)\) (the fact that we need \(t{\lt}1/2\) comes from \(b=1-\left(\dfrac {t}{1-t}\right)^2{\gt}0\)). Quite naturally we take the value that minimizes the volume of the resulting ellipsoid. Note that

\[ \frac{\operatorname {vol}(\mathcal E)}{\operatorname {vol}(\mathcal B)} =\frac{1}{\sqrt a}\Big(\frac{1}{\sqrt b}\Big)^{\, n-1} =\frac{1}{\sqrt{\dfrac {1}{(1-t)^2}\left(1-\left(\dfrac {t}{1-t}\right)^2\right)^{\, n-1}}} =\frac{1}{\sqrt{f\big(\frac{1}{1-t}\big)}}, \]

where \(f(h)=h^2(2h-h^2)^{\, n-1}\). Elementary computations show that the maximum of \(f\) (on \([1,2]\)) is attained at \(h=1+\frac{1}{n}\) (which corresponds to \(t=\frac{1}{n+1}\)), and the value is

\[ \Big(1+\frac{1}{n}\Big)^2\Big(1-\frac{1}{n^2}\Big)^{\, n-1}\ge \exp \! \big(\tfrac {1}{n}\big), \]

where the lower bound follows again from elementary computations. Thus we showed that, for \(\mathcal{E}_0=\mathcal{B}\), (2.3) and (2.4) are satisfied with the ellipsoid given by the set of points \(x\) satisfying:

\[ \left(x+\frac{w/\| w\| _2}{n+1}\right)^{\! \top } \! \left(\frac{n^2-1}{n^2}I_n+\frac{2(n+1)}{n^2}\frac{ww^{\top }}{\| w\| _2^2}\right) \! \left(x+\frac{w/\| w\| _2}{n+1}\right)\le 1. \tag {2.7} \]

We consider now an arbitrary ellipsoid \(\mathcal{E}_0=\{ x\in \mathbb {R}^n:(x-c_0)^{\top }H_0^{-1}(x-c_0)\le 1\} \). Let \(\Phi (x)=c_0+H_0^{1/2}x\), then clearly \(\mathcal{E}_0=\Phi (\mathcal{B})\) and \(\{ x:w^{\top }(x-c_0)\le 0\} =\Phi (\{ x:(H_0^{1/2}w)^{\top }x\le 0\} )\). Thus in this case the image by \(\Phi \) of the ellipsoid given in (2.7) with \(w\) replaced by \(H_0^{1/2}w\) will satisfy (2.3) and (2.4). It is easy to see that this corresponds to an ellipsoid defined by

\[ c=c_0-\frac{1}{n+1}\frac{H_0 w}{\sqrt{w^{\top }H_0 w}}, \]

\[ H^{-1}=\left(1-\frac{1}{n^2}\right)H_0^{-1}+\frac{2(n+1)}{n^2}\frac{ww^{\top }}{w^{\top }H_0 w}. \tag {2.8} \]

Applying Sherman–Morrison formula to (2.8) one can recover (2.6) which concludes the proof.

Theorem 6.1.2

For \(t \ge 2n^2\log (R/r)\) the ellipsoid method satisfies \(\{ c_1,\dots ,c_t\} \cap \mathcal{X}\neq \varnothing \) and

\[ f(x_t)-\min _{x\in \mathcal{X}}f(x)\le \frac{2BR}{r}\exp \! \left(-\frac{t}{2n^2}\right). \]

6.2 Dimension-free convex optimization

6.2.1 Subgradient descent

Definition 6.2.1 Projected subgradient descent

For \(t \ge 1\):

\begin{align} y_{t+1} & = x_t - \eta g_t,\quad \text{where } g_t \in \partial f(x_t), \label{eq:3.2}\\ x_{t+1} & = \Pi _{\mathcal X}(y_{t+1}). \label{eq:3.3} \end{align}

Theorem 6.2.1

The projected subgradient descent method with \(\eta = \frac{R}{L\sqrt{t}}\) satisfies

\[ f\! \left(\frac{1}{t}\sum _{s=1}^t x_s\right)-f(x^*) \le \frac{RL}{\sqrt{t}}. \]

Proof ▶

Using the definition of subgradients, the definition of the method, and the elementary identity \(2a^\top b=\| a\| ^2+\| b\| ^2-\| a-b\| ^2\), one obtains

\[ f(x_s)-f(x^*) \le g_s^\top (x_s-x^*) = \frac{1}{\eta }(x_s-y_{s+1})^\top (x_s-x^*) \]

\[ = \frac{1}{2\eta }\Big(\| x_s-x^*\| ^2+\| x_s-y_{s+1}\| ^2-\| y_{s+1}-x^*\| ^2\Big) \]

\[ = \frac{1}{2\eta }\Big(\| x_s-x^*\| ^2-\| y_{s+1}-x^*\| ^2\Big)+\frac{\eta }{2}\| g_s\| ^2. \]

Now note that \(\| g_s\| \le L\), and furthermore by Lemma 3.1

\[ \| y_{s+1}-x^*\| \ge \| x_{s+1}-x^*\| . \]

Summing the resulting inequality over \(s\), and using that \(\| x_1-x^*\| \le R\) yield

\[ \sum _{s=1}^t\big(f(x_s)-f(x^*)\big)\le \frac{R^2}{2\eta }+\frac{\eta L^2 t}{2}. \]

Plugging in the value of \(\eta \) directly gives the statement (recall that by convexity \(f((1/t)\sum _{s=1}^t x_s)\le \tfrac {1}{t}\sum _{s=1}^t f(x_s)\)).

Definition 6.2.2 Projected subgradient descent, varying step size

The projected subgradient descent algorithm with time-varying step size \((\eta _t)_{t\ge 1}\), that is

\[ y_{t+1}=x_t-\eta _t g_t,\quad \text{where }g_t\in \partial f(x_t) \]

\[ x_{t+1}=\Pi _{\mathcal X}(y_{t+1}). \]

Theorem 6.2.2

Let \(f\) be \(\alpha \)-strongly convex and \(L\)-Lipschitz on \(\mathcal{X}\). Then projected subgradient descent with \(\eta _s=\frac{2}{\alpha (s+1)}\) satisfies

\[ f\! \left(\sum _{s=1}^t \frac{2s}{t(t+1)}x_s\right)-f(x^*) \le \frac{2L^2}{\alpha (t+1)}. \]

Proof ▶

Coming back to our original analysis of projected subgradient descent in Section 3.1 and using the strong convexity assumption one immediately obtains

\[ f(x_s)-f(x^*) \le \frac{\eta _s}{2}L^2 +\Big(\frac{1}{2\eta _s}-\frac{\alpha }{2}\Big)\| x_s-x^*\| ^2 -\frac{1}{2\eta _s}\| x_{s+1}-x^*\| ^2. \]

Multiplying this inequality by \(s\) yields

\[ s\big(f(x_s)-f(x^*)\big)\le \frac{L^2}{\alpha } +\frac{\alpha }{4}\Big(s(s-1)\| x_s-x^*\| ^2 - s(s+1)\| x_{s+1}-x^*\| ^2\Big), \]

Now sum the resulting inequality over \(s=1\) to \(s=t\), and apply Jensen’s inequality to obtain the claimed statement.

6.2.2 Gradient descent

Theorem 6.2.3

Let \(f\) be convex and \(\beta \)-smooth on \(\mathbb {R}^n\). Then gradient descent with \(\eta =\frac{1}{\beta }\) satisfies

\[ f(x_t)-f(x^*)\le \frac{2\beta \| x_1-x^*\| ^2}{t-1}. \]

Proof ▶

Using (3.5) and the definition of the method one has

\[ f(x_{s+1})-f(x_s)\le -\frac{1}{2\beta }\| \nabla f(x_s)\| ^2. \]

In particular, denoting \(\delta _s=f(x_s)-f(x^*)\), this shows:

\[ \delta _{s+1}\le \delta _s-\frac{1}{2\beta }\| \nabla f(x_s)\| ^2. \]

One also has by convexity

\[ \delta _s\le \nabla f(x_s)^\top (x_s-x^*)\le \| x_s-x^*\| \cdot \| \nabla f(x_s)\| . \]

We will prove that \(\| x_s-x^*\| \) is decreasing with \(s\), which with the two above displays will imply

\[ \delta _{s+1}\le \delta _s-\frac{1}{2\beta \| x_1-x^*\| ^2}\delta _s^2. \]

Let us see how to use this last inequality to conclude the proof. Let

\[ \omega =\frac{1}{2\beta \| x_1-x^*\| ^2}, \]

then ¹

\[ \omega \delta _s^2+\delta _{s+1}\le \delta _s\iff \omega \frac{\delta _s}{\delta _{s+1}}+\frac{1}{\delta _s}\le \frac{1}{\delta _{s+1}} \implies \frac{1}{\delta _{s+1}}-\frac{1}{\delta _s}\ge \omega \implies \frac{1}{\delta _t}\ge \omega (t-1). \]

Thus it only remains to show that \(\| x_s-x^*\| \) is decreasing with \(s\). Using Lemma 3.5 one immediately gets

\[ (\nabla f(x)-\nabla f(y))^{\top }(x-y)\ge \frac{1}{\beta }\| \nabla f(x)-\nabla f(y)\| ^2. \tag {3.6} \]

We use this as follows (together with \(\nabla f(x^*)=0\))

\[ \| x_{s+1}-x^*\| ^2=\| x_s-\tfrac {1}{\beta }\nabla f(x_s)-x^*\| ^2 \]

\[ =\| x_s-x^*\| ^2-\tfrac {2}{\beta }\nabla f(x_s)^{\top }(x_s-x^*)+\tfrac {1}{\beta ^2}\| \nabla f(x_s)\| ^2 \]

\[ \le \| x_s-x^*\| ^2-\tfrac {1}{\beta ^2}\| \nabla f(x_s)\| ^2 \]

\[ \le \| x_s-x^*\| ^2, \]

which concludes the proof.

Lemma 6.2.1

Let \(f\) be a \(\beta \)-smooth function on \(\mathbb {R}^n\). Then for any \(x,y\in \mathbb {R}^n\), one has

\[ |f(x)-f(y)-\nabla f(y)^{\top }(x-y)| \le \frac{\beta }{2}\| x-y\| ^2. \]

Proof ▶

We represent \(f(x)-f(y)\) as an integral, apply Cauchy–Schwarz and then \(\beta \)-smoothness:

\[ |f(x)-f(y)-\nabla f(y)^{\top }(x-y)| = \left|\int _0^1 \nabla f\big(y+t(x-y)\big)^{\top }(x-y)\, dt - \nabla f(y)^{\top }(x-y)\right| \]

\[ \le \int _0^1 \| \nabla f\big(y+t(x-y)\big)-\nabla f(y)\| \cdot \| x-y\| \, dt \]

\[ \le \int _0^1 \beta t\| x-y\| ^2\, dt = \frac{\beta }{2}\| x-y\| ^2. \]

Lemma 6.2.2

Let \(f\) be such that (3.4) holds true. Then for any \(x,y\in \mathbb {R}^n\), one has

\[ f(x)-f(y)\le \nabla f(x)^\top (x-y)-\frac{1}{2\beta }\| \nabla f(x)-\nabla f(y)\| ^2. \]

Proof ▶

Let \(z=y-\frac{1}{\beta }\big(\nabla f(y)-\nabla f(x)\big)\). Then one has

\[ \begin{aligned} f(x)-f(y) & =f(x)-f(z)+f(z)-f(y)\\ & \le \nabla f(x)^\top (x-z)+\nabla f(y)^\top (z-y)+\frac{\beta }{2}\| z-y\| ^2\\ & =\nabla f(x)^\top (x-y)+\big(\nabla f(x)-\nabla f(y)\big)^\top (y-z)+\frac{1}{2\beta }\| \nabla f(x)-\nabla f(y)\| ^2\\ & =\nabla f(x)^\top (x-y)-\frac{1}{2\beta }\| \nabla f(x)-\nabla f(y)\| ^2. \end{aligned} \]

Lemma 6.2.3

Let \(x,y\in \mathcal{X}\), \(x^{+}=\Pi _{\mathcal{X}}\bigl(x-\tfrac {1}{\beta }\nabla f(x)\bigr)\), and \(g_{\mathcal{X}}(x)=\beta (x-x^{+})\). Then the following holds true:

\[ f(x^{+})-f(y)\le g_{\mathcal{X}}(x)^{\top }(x-y)-\tfrac {1}{2\beta }\| g_{\mathcal{X}}(x)\| ^{2}. \]

Proof ▶

We first observe that

\begin{equation} \label{eq:3.7} \nabla f(x)^{\top }(x^{+}-y)\le g_{\mathcal{X}}(x)^{\top }(x^{+}-y). \end{equation}

Indeed the above inequality is equivalent to

\[ \Bigl(x^{+}-\bigl(x-\tfrac {1}{\beta }\nabla f(x)\bigr)\Bigr)^{\top }(x^{+}-y)\le 0, \]

which follows from Lemma 3.1. Now we use 4 as follows to prove the lemma (we also use (3.4) which still holds true in the constrained case)

\[ \begin{aligned} f(x^{+})-f(y) & = f(x^{+})-f(x)+f(x)-f(y)\\ & \le \nabla f(x)^{\top }(x^{+}-x)+\tfrac {\beta }{2}\| x^{+}-x\| ^{2}+\nabla f(x)^{\top }(x-y)\\ & = \nabla f(x)^{\top }(x^{+}-y)+\tfrac {1}{2\beta }\| g_{\mathcal{X}}(x)\| ^{2}\\ & \le g_{\mathcal{X}}(x)^{\top }(x^{+}-y)+\tfrac {1}{2\beta }\| g_{\mathcal{X}}(x)\| ^{2}\\ & = g_{\mathcal{X}}(x)^{\top }(x-y)-\tfrac {1}{2\beta }\| g_{\mathcal{X}}(x)\| ^{2}. \end{aligned} \]

Theorem 6.2.4

Let \(f\) be convex and \(\beta \)-smooth on \(\mathcal{X}\). Then projected gradient descent with \(\eta =\frac{1}{\beta }\) satisfies

\[ f(x_t)-f(x^*) \le \frac{3\beta \| x_1-x^*\| ^2 + f(x_1)-f(x^*)}{t}. \]

Proof ▶

Lemma 3.6 immediately gives

\[ f(x_{s+1})-f(x_s) \le -\frac{1}{2\beta }\| g_{\mathcal{X}}(x_s)\| ^2, \]

and

\[ f(x_{s+1})-f(x^*)\le \| g_{\mathcal X}(x_s)\| \cdot \| x_s-x^*\| . \]

We will prove that \(\| x_s-x^*\| \) is decreasing with \(s\), which with the two above displays will imply

\[ \delta _{s+1}\le \delta _s-\frac{1}{2\beta \| x_1-x^*\| ^2}\delta _{s+1}^2. \]

An easy induction shows that

\[ \delta _s\le \frac{3\beta \| x_1-x^*\| ^2+f(x_1)-f(x^*)}{s}. \]

Thus it only remains to show that \(\| x_s-x^*\| \) is decreasing with \(s\). Using Lemma 3.6 one can see that \(g_{\mathcal X}(x_s)^\top (x_s-x^*)\ge \frac{1}{2\beta }\| g_{\mathcal X}(x_s)\| ^2\) which implies

\[ \begin{aligned} \| x_{s+1}-x^*\| ^2& =\| x_s-\tfrac {1}{\beta }g_{\mathcal X}(x_s)-x^*\| ^2\\ & =\| x_s-x^*\| ^2-\tfrac {2}{\beta }g_{\mathcal X}(x_s)^\top (x_s-x^*)+\tfrac {1}{\beta ^2}\| g_{\mathcal X}(x_s)\| ^2\\ & \le \| x_s-x^*\| ^2. \end{aligned} \]

\(\square \)

Theorem 6.2.5

Let \(f\) be \(\alpha \)-strongly convex and \(\beta \)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta =\frac{1}{\beta }\) satisfies for \(t\ge 0\),

\[ \| x_{t+1}-x^*\| ^2 \le \exp \! \bigg(-\frac{t}{\kappa }\bigg)\| x_1-x^*\| ^2. \]

Proof ▶

Using (3.14) with \(y=x^*\) one directly obtains

\[ \| x_{t+1}-x^*\| ^2 = \| x_t-\tfrac {1}{\beta }g_{\mathcal X}(x_t)-x^*\| ^2 \]

\[ = \| x_t-x^*\| ^2 - \tfrac {2}{\beta } g_{\mathcal X}(x_t)^\top (x_t-x^*) + \tfrac {1}{\beta ^2}\| g_{\mathcal X}(x_t)\| ^2 \]

\[ \le \Big(1-\tfrac {\alpha }{\beta }\Big)\| x_t-x^*\| ^2 \]

\[ \le \Big(1-\tfrac {\alpha }{\beta }\Big)^t \| x_1-x^*\| ^2 \]

\[ \le \exp \! \bigg(-\frac{t}{\kappa }\bigg)\| x_1-x^*\| ^2, \]

which concludes the proof.

Lemma 6.2.4

Let \(f\) be \(\beta \)-smooth and \(\alpha \)-strongly convex on \(\mathbb {R}^n\). Then for all \(x,y\in \mathbb {R}^n\), one has

\[ (\nabla f(x)-\nabla f(y))^\top (x-y) \; \ge \; \frac{\alpha \beta }{\beta +\alpha }\| x-y\| ^2 \; +\; \frac{1}{\beta +\alpha }\| \nabla f(x)-\nabla f(y)\| ^2. \]

Proof ▶

Let \(\varphi (x)=f(x)-\frac{\alpha }{2}\| x\| ^2\). By definition of \(\alpha \)-strong convexity one has that \(\varphi \) is convex. Furthermore one can show that \(\varphi \) is \((\beta -\alpha )\)-smooth by proving (3.4) (and using that it implies smoothness). Thus using (3.6) one gets

\[ (\nabla \varphi (x)-\nabla \varphi (y))^\top (x-y)\ge \frac{1}{\beta -\alpha }\| \nabla \varphi (x)-\nabla \varphi (y)\| ^2, \]

which gives the claimed result with straightforward computations. (Note that if \(\alpha =\beta \) the smoothness of \(\varphi \) directly implies that \(\nabla f(x)-\nabla f(y)=\alpha (x-y)\) which proves the lemma in this case.)

Theorem 6.2.6

Let \(f\) be \(\beta \)-smooth and \(\alpha \)-strongly convex on \(\mathbb {R}^n\). Then gradient descent with \(\eta =\frac{2}{\alpha +\beta }\) satisfies

\[ f(x_{t+1})-f(x^*) \le \frac{\beta }{2}\exp \! \Big(-\frac{4t}{\kappa +1}\Big)\| x_1-x^*\| ^2. \]

Proof ▶

First note that by \(\beta \)-smoothness (since \(\nabla f(x^*)=0\)) one has

\[ f(x_t)-f(x^*) \le \frac{\beta }{2}\| x_t-x^*\| ^2. \]

Now using Lemma 3.11 one obtains

\[ \| x_{t+1}-x^*\| ^2 = \| x_t-\eta \nabla f(x_t)-x^*\| ^2 = \| x_t-x^*\| ^2-2\eta \nabla f(x_t)^\top (x_t-x^*)+\eta ^2\| \nabla f(x_t)\| ^2 \]

\[ \le \Big(1-2\frac{\eta \alpha \beta }{\beta +\alpha }\Big)\| x_t-x^*\| ^2 +\Big(\eta ^2-2\frac{\eta }{\beta +\alpha }\Big)\| \nabla f(x_t)\| ^2 \]

\[ = \Big(\frac{\kappa -1}{\kappa +1}\Big)^2\| x_t-x^*\| ^2 \]

\[ \le \exp \! \Big(-\frac{4t}{\kappa +1}\Big)\| x_1-x^*\| ^2, \]

which concludes the proof.

6.2.3 Nesterov

Definition 6.2.3 Nesterov’s accelerated gradient descent

Nesterov’s accelerated gradient descent can be described as follows: Start at an arbitrary initial point \(x_1=y_1\) and then iterate the following equations for \(t\ge 1\),

\[ y_{t+1}=x_t-\frac{1}{\beta }\nabla f(x_t), \]

\[ x_{t+1}=\left(1+\frac{\sqrt{\kappa }-1}{\sqrt{\kappa }+1}\right)y_{t+1} -\frac{\sqrt{\kappa }-1}{\sqrt{\kappa }+1}y_t. \]

Theorem 6.2.7

Let \(f\) be \(\alpha \)-strongly convex and \(\beta \)-smooth, then Nesterov’s accelerated gradient descent satisfies

\[ f(y_t)-f(x^*) \le \frac{\alpha +\beta }{2}\| x_1-x^*\| ^2\exp \! \left(-\frac{t-1}{\sqrt{\kappa }}\right). \]

Proof ▶

We define \(\alpha \)-strongly convex quadratic functions \(\Phi _s\), \(s\ge 1\) by

induction as follows:

\[ \Phi _1(x)=f(x_1)+\frac{\alpha }{2}\| x-x_1\| ^2, \]

\begin{equation} \tag {3.17} \Phi _{s+1}(x)=\left(1-\frac{1}{\sqrt{\kappa }}\right)\Phi _s(x) +\frac{1}{\sqrt{\kappa }}\Big(f(x_s)+\nabla f(x_s)^{\top }(x-x_s)+\frac{\alpha }{2}\| x-x_s\| ^2\Big). \end{equation}

Intuitively \(\Phi _s\) becomes a finer and finer approximation (from below) to \(f\) in the following sense:

\begin{equation} \tag {3.18} \Phi _{s+1}(x)\le f(x)+\left(1-\frac{1}{\sqrt{\kappa }}\right)^s\big(\Phi _1(x)-f(x)\big). \end{equation}

The above inequality can be proved immediately by induction, using the fact that by \(\alpha \)-strong convexity one has

\[ f(x_s)+\nabla f(x_s)^{\top }(x-x_s)+\frac{\alpha }{2}\| x-x_s\| ^2\le f(x). \]

Equation (3.18) by itself does not say much, for it to be useful one needs to understand how “far” below \(f\) is \(\Phi _s\). The following inequality answers this question:

\begin{equation} \tag {3.19} f(y_s)\le \min _{x\in \mathbb {R}^n}\Phi _s(x). \end{equation}

The rest of the proof is devoted to showing that (3.19) holds true, but first let us see how to combine (3.18) and (3.19) to obtain the rate given by the theorem (we use that by \(\beta \)-smoothness one has \(f(x)-f(x^*)\le \frac{\beta }{2}\| x-x^*\| ^2\)):

\[ \begin{aligned} f(y_t)-f(x^*)& \le \Phi _t(x^*)-f(x^*)\\ & \le \left(1-\frac{1}{\sqrt{\kappa }}\right)^{t-1}\big(\Phi _1(x^*)-f(x^*)\big)\\ & \le \frac{\alpha +\beta }{2}\| x_1-x^*\| ^2\left(1-\frac{1}{\sqrt{\kappa }}\right)^{t-1}. \end{aligned} \]

We now prove (3.19) by induction (note that it is true at \(s=1\) since \(x_1=y_1\)). Let \(\Phi _s^*=\min _{x\in \mathbb {R}^n}\Phi _s(x)\). Using the definition of \(y_{s+1}\) (and \(\beta \)-smoothness), convexity, and the induction hypothesis, one gets

\[ f(y_{s+1}) \le f(x_s) - \frac{1}{2\beta }\| \nabla f(x_s)\| ^2 \]

\[ = \left(1-\frac{1}{\sqrt{\kappa }}\right) f(y_s) + \left(1-\frac{1}{\sqrt{\kappa }}\right)(f(x_s)-f(y_s)) +\frac{1}{\sqrt{\kappa }}f(x_s)-\frac{1}{2\beta }\| \nabla f(x_s)\| ^2 \]

\[ \le \left(1-\frac{1}{\sqrt{\kappa }}\right)\Phi _s^* + \left(1-\frac{1}{\sqrt{\kappa }}\right)\nabla f(x_s)^\top (x_s-y_s) +\frac{1}{\sqrt{\kappa }}f(x_s)-\frac{1}{2\beta }\| \nabla f(x_s)\| ^2. \]

Thus we now have to show that

\begin{equation} \tag {3.20} \Phi _{s+1}^* \ge \left(1-\frac{1}{\sqrt{\kappa }}\right)\Phi _s^* + \left(1-\frac{1}{\sqrt{\kappa }}\right)\nabla f(x_s)^\top (x_s-y_s) +\frac{1}{\sqrt{\kappa }}f(x_s)-\frac{1}{2\beta }\| \nabla f(x_s)\| ^2. \end{equation}

To prove this inequality we have to understand better the functions \(\Phi _s\). First note that \(\nabla ^2\Phi _s(x)=\alpha I_n\) (immediate by induction) and thus \(\Phi _s\) has to be of the following form:

\[ \Phi _s(x)=\Phi _s^* + \frac{\alpha }{2}\| x-v_s\| ^2, \]

for some \(v_s\in \mathbb {R}^n\). Now observe that by differentiating (3.17) and using the above form of \(\Phi _s\) one obtains

\[ \nabla \Phi _{s+1}(x)=\alpha \left(1-\frac{1}{\sqrt{\kappa }}\right)(x-v_s)+\frac{1}{\sqrt{\kappa }}\nabla f(x_s)+\frac{\alpha }{\sqrt{\kappa }}(x-x_s). \]

In particular \(\Phi _{s+1}\) is by definition minimized at \(v_{s+1}\) which can now be defined by induction using the above identity, precisely:

\begin{equation} \tag {3.21} v_{s+1}=\left(1-\frac{1}{\sqrt{\kappa }}\right)v_s + \frac{1}{\sqrt{\kappa }}x_s - \frac{1}{\alpha \sqrt{\kappa }}\nabla f(x_s). \end{equation}

Using the form of \(\Phi _s\) and \(\Phi _{s+1}\), as well as the original definition (3.17) one gets the following identity by evaluating \(\Phi _{s+1}\) at \(x_s\):

\begin{equation} \tag {3.22} \Phi _{s+1}^* + \frac{\alpha }{2}\| x_s-v_{s+1}\| ^2 = \left(1-\frac{1}{\sqrt{\kappa }}\right)\Phi _s^* + \frac{\alpha }{2}\left(1-\frac{1}{\sqrt{\kappa }}\right)\| x_s-v_s\| ^2 + \frac{1}{\sqrt{\kappa }}f(x_s). \end{equation}

Note that thanks to (3.21) one has

\[ \| x_s-v_{s+1}\| ^2 = \left(1-\frac{1}{\sqrt\kappa }\right)^2\| x_s-v_s\| ^2+\frac{1}{\alpha ^2\kappa }\| \nabla f(x_s)\| ^2 -\frac{2}{\alpha \sqrt\kappa }\left(1-\frac{1}{\sqrt\kappa }\right)\nabla f(x_s)^{\top }(v_s-x_s), \]

which combined with (3.22) yields

\[ \Phi ^*_{s+1} = \left(1-\frac{1}{\sqrt\kappa }\right)\Phi ^*_s+\frac{1}{\sqrt\kappa }f(x_s)+\frac{\alpha }{2\sqrt\kappa }\left(1-\frac{1}{\sqrt\kappa }\right)\| x_s-v_s\| ^2 -\frac{1}{2\beta }\| \nabla f(x_s)\| ^2+\frac{1}{\sqrt\kappa }\left(1-\frac{1}{\sqrt\kappa }\right)\nabla f(x_s)^{\top }(v_s-x_s). \]

Finally we show by induction that \(v_s-x_s=\sqrt\kappa (x_s-y_s)\), which concludes the proof of (3.20) and thus also concludes the proof of the theorem:

\[ \begin{aligned} v_{s+1}-x_{s+1} & = \left(1-\frac{1}{\sqrt\kappa }\right)v_s+\frac{1}{\sqrt\kappa }x_s-\frac{1}{\alpha \sqrt\kappa }\nabla f(x_s)-x_{s+1}\\ & = \sqrt\kappa x_s-(\sqrt\kappa -1)y_s-\frac{\sqrt\kappa }{\beta }\nabla f(x_s)-x_{s+1}\\ & = \sqrt\kappa y_{s+1}-(\sqrt\kappa -1)y_s-x_{s+1}\\ & = \sqrt\kappa (x_{s+1}-y_{s+1}), \end{aligned} \]

where the first equality comes from (3.21), the second from the induction hypothesis, the third from the definition of \(y_{s+1}\) and the last one from the definition of \(x_{s+1}\).

Definition 6.2.4 Nesterov, \(\alpha =0\)

Nesterov’s accelerated gradient descent for the case \(\alpha = 0\). First we define the following sequences:

\[ \lambda _0 = 0,\qquad \lambda _t = \frac{1 + \sqrt{1 + 4\lambda _{t-1}^2}}{2},\qquad \text{and}\qquad \gamma _t = \frac{1 - \lambda _t}{\lambda _{t+1}}. \]

(Note that \(\gamma _t \le 0\).) Now the algorithm is simply defined by the following equations, with \(x_1 = y_1\) an arbitrary initial point,

\[ y_{t+1} = x_t - \frac{1}{\beta }\nabla f(x_t), \]

\[ x_{t+1} = (1-\gamma _s)y_{t+1} + \gamma _t y_t. \]

Theorem 6.2.8

Let \(f\) be a convex and \(\beta \)-smooth function, then Nesterov’s accelerated gradient descent satisfies

\[ f(y_t) - f(x^*) \le \frac{2\beta \| x_1 - x^*\| ^2}{t^2}. \]

Proof ▶

Using the unconstrained version of Lemma 3.6 one obtains

\[ f(y_{s+1}) - f(y_s) \]

\[ \le \nabla f(x_s)^\top (x_s - y_s) - \frac{1}{2\beta }\| \nabla f(x_s)\| ^2 \]

\[ = \beta (x_s - y_{s+1})^\top (x_s - y_s) - \frac{\beta }{2}\| x_s - y_{s+1}\| ^2. \tag {3.23} \]

Similarly we also get

\begin{equation} \tag {3.24} f(y_{s+1}) - f(x^*) \le \beta (x_s - y_{s+1})^\top (x_s - x^*) - \frac{\beta }{2}\| x_s - y_{s+1}\| ^2. \end{equation}

Now multiplying (3.23) by \((\lambda _s - 1)\) and adding the result to (3.24), one obtains with \(\delta _s = f(y_s) - f(x^*)\),

\[ \lambda _s \delta _{s+1} - (\lambda _s - 1)\delta _s \]

\[ \le \beta (x_s - y_{s+1})^\top (\lambda _s x_s - (\lambda _s - 1)y_s - x^*) - \frac{\beta }{2}\lambda _s\| x_s - y_{s+1}\| ^2. \]

Multiplying this inequality by \(\lambda _s\) and using that by definition \(\lambda _{s-1}^2 = \lambda _s^2 - \lambda _s\), as well as the elementary identity \(2a^\top b - \| a\| ^2 = \| b\| ^2 - \| b-a\| ^2\),

one obtains

\[ \lambda _s^2\delta _{s+1}-\lambda _{s-1}^2\delta _s \le \frac{\beta }{2}\Big(2\lambda _s(x_s-y_{s+1})^\top (\lambda _s x_s-(\lambda _s-1)y_s-x^*)-\| \lambda _s(y_{s+1}-x_s)\| ^2\Big) \]

\[ = \frac{\beta }{2}\big(\| \lambda _s x_s-(\lambda _s-1)y_s-x^*\| ^2-\| \lambda _s y_{s+1}-(\lambda _s-1)y_s-x^*\| ^2\big). \tag {3.25} \]

Next remark that, by definition, one has

\[ x_{s+1}=y_{s+1}+\gamma _s(y_s-y_{s+1}) \]

\[ \iff \lambda _{s+1}x_{s+1}=\lambda _{s+1}y_{s+1}+(1-\lambda _s)(y_s-y_{s+1}) \]

\[ \iff \lambda _{s+1}x_{s+1}-(\lambda _{s+1}-1)y_{s+1}=\lambda _s y_{s+1}-(\lambda _s-1)y_s. \tag {3.26} \]

Putting together (3.25) and (3.26) one gets with \(u_s=\lambda _s x_s-(\lambda _s-1)y_s-x^*\),

\[ \lambda _s^2\delta _{s+1}-\lambda _{s-1}^2\delta _s\le \frac{\beta }{2}\big(\| u_s\| ^2-\| u_{s+1}\| ^2\big). \]

Summing these inequalities from \(s=1\) to \(s=t-1\) one obtains:

\[ \delta _t\le \frac{\beta }{2\lambda _{t-1}^2}\| u_1\| ^2. \]

By induction it is easy to see that \(\lambda _{t-1}\ge \tfrac {t}{2}\) which concludes the proof.

6.2.4 Frank-Wolfe

Definition 6.2.5 Frank-Wolfe

The conditional gradient descent, introduced in Frank and Wolfe [1956], performs the following update for \(t\ge 1\), where \((\gamma _s)_{s\ge 1}\) is a fixed sequence,

\begin{align} y_t & \in \operatorname*{arg\, min}_{y\in \mathcal{X}} \nabla f(x_t)^\top y \label{eq:3.8}\\ x_{t+1} & = (1-\gamma _t)x_t + \gamma _t y_t. \label{eq:3.9} \end{align}

Theorem 6.2.9

Let \(f\) be a convex and \(\beta \)-smooth function w.r.t. some norm \(\| \cdot \| \), \(R=\sup _{x,y\in \mathcal{X}}\| x-y\| \), and \(\gamma _s=\frac{2}{s+1}\) for \(s\ge 1\). Then for any \(t\ge 2\), one has

\[ f(x_t)-f(x^*)\le \frac{2\beta R^2}{t+1}. \]

Proof ▶

The following inequalities hold true, using respectively \(\beta \)-smoothness (it can easily be seen that (3.4) holds true for smoothness in an arbitrary norm), the definition of \(x_{s+1}\), the definition of \(y_s\), and

the convexity of \(f\):

\[ f(x_{s+1})-f(x_s) \le \nabla f(x_s)^{\top }(x_{s+1}-x_s)+\frac{\beta }{2}\| x_{s+1}-x_s\| ^2 \]

\[ \le \gamma _s\nabla f(x_s)^{\top }(y_s-x_s)+\frac{\beta }{2}\gamma _s^2R^2 \]

\[ \le \gamma _s\nabla f(x_s)^{\top }(x^*-x_s)+\frac{\beta }{2}\gamma _s^2R^2 \]

\[ \le \gamma _s\bigl(f(x^*)-f(x_s)\bigr)+\frac{\beta }{2}\gamma _s^2R^2. \]

Rewriting this inequality in terms of \(\delta _s=f(x_s)-f(x^*)\) one obtains

\[ \delta _{s+1}\le (1-\gamma _s)\delta _s+\frac{\beta }{2}\gamma _s^2R^2. \]

A simple induction using that \(\gamma _s=\frac{2}{s+1}\) finishes the proof (note that the initialization is done at step \(2\) with the above inequality yielding \(\delta _2\le \frac{\beta }{2}R^2\)).

6.3 Almost dimension-free convex optimization in non-Euclidean spaces

6.3.1 Mirror descent

Definition 6.3.1 Mirror map

Let \(\mathcal{D}\subset \mathbb {R}^n\) be a convex open set such that \(\mathcal{X}\) is included in its closure, that is \(\mathcal{X}\subset \overline{\mathcal{D}}\), and \(\mathcal{X}\cap \mathcal{D}\neq \varnothing \). We say that \(\Phi :\mathcal{D}\to \mathbb {R}\) is a mirror map if it satisfies the following properties:

\(\Phi \) is strictly convex and differentiable.
The gradient of \(\Phi \) takes all possible values, that is \(\nabla \Phi (\mathcal{D})=\mathbb {R}^n\).
The gradient of \(\Phi \) diverges on the boundary of \(\mathcal{D}\), that is

\[ \lim _{x\to \partial \mathcal{D}}\| \nabla \Phi (x)\| =+\infty . \]

Definition 6.3.2

In mirror descent the projection is done via the Bregman divergence associated to \(\Phi \). Precisely one defines

\[ \Pi ^{\Phi }_{\chi }(y)=\arg \min _{x\in \chi \cap \mathcal{D}} D_{\Phi }(x,y). \]

Lemma 6.3.1

Let \(x\in \mathcal{X}\cap \mathcal{D}\) and \(y\in \mathcal{D}\), then

\[ (\nabla \Phi (\Pi ^{\Phi }_{\mathcal{X}}(y))-\nabla \Phi (y))^{\top }(\Pi ^{\Phi }_{\mathcal{X}}(y)-x)\le 0, \]

which also implies

\[ D_{\Phi }(x,\Pi ^{\Phi }_{\chi }(y))+D_{\Phi }(\Pi ^{\Phi }_{\chi }(y),y)\le D_{\Phi }(x,y). \]

Proof ▶

The proof is an immediate corollary of Proposition 1.3 together with the fact that \(\nabla _x D_{\Phi }(x,y)=\nabla \Phi (x)-\nabla \Phi (y)\).

Definition 6.3.3 Mirror descent

The mirror descent strategy based on a mirror map \(\Phi \) is as follows: let \(x_1\in \operatorname*{arg\, min}_{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)\). Then for \(t\ge 1\), let \(y_{t+1}\in \mathcal{D}\) such that

\[ \nabla \Phi (y_{t+1})=\nabla \Phi (x_t)-\eta g_t,\quad \text{where }g_t\in \partial f(x_t),\tag {4.2} \]

and

\[ x_{t+1}\in \Pi ^{\Phi }_{\mathcal{X}}(y_{t+1}).\tag {4.3} \]

Theorem 6.3.1

Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap \mathcal{D}\) w.r.t. \(\| \cdot \| \). Let \(R^2=\sup _{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)-\Phi (x_1)\), and \(f\) be convex and L-Lipschitz w.r.t. \(\| \cdot \| \). Then mirror descent with \(\eta =\dfrac {R}{L}\sqrt{\dfrac {2\rho }{t}}\) satisfies

\[ f\! \left(\frac{1}{t}\sum _{s=1}^t x_s\right)-f(x^*)\le RL\sqrt{\frac{2}{\rho t}}. \]

Proof ▶

Let \(x\in \mathcal X\cap \mathcal{D}\). The claimed bound will be obtained by taking a limit \(x\to x^*\). Now by convexity of \(f\), the definition of mirror descent, equation (4.1), and Lemma 4.1, one has

\[ f(x_s)-f(x) \le g_s^\top (x_s-x) = \frac{1}{\eta }\big(\nabla \Phi (x_s)-\nabla \Phi (y_{s+1})\big)^\top (x_s-x) \]

\[ = \frac{1}{\eta }\Big(D_\Phi (x,x_s)+D_\Phi (x_s,y_{s+1})-D_\Phi (x,y_{s+1})\Big) \]

\[ \le \frac{1}{\eta }\Big(D_\Phi (x,x_s)+D_\Phi (x_s,y_{s+1})-D_\Phi (x,x_{s+1})-D_\Phi (x_{s+1},y_{s+1})\Big). \]

The term \(D_\Phi (x,x_s)-D_\Phi (x,x_{s+1})\) will lead to a telescopic sum when summing over \(s=1\) to \(s=t\), and it remains to bound the other term as follows using \(\rho \)-strong convexity of the mirror map and \(az-bz^2\le \tfrac {a^2}{4b},\ \forall z\in \mathbb R\):

\[ D_\Phi (x_s,y_{s+1})-D_\Phi (x_{s+1},y_{s+1}) = \Phi (x_s)-\Phi (x_{s+1})-\nabla \Phi (y_{s+1})^\top (x_s-x_{s+1}) \]

\[ \le \big(\nabla \Phi (x_s)-\nabla \Phi (y_{s+1})\big)^\top (x_s-x_{s+1})-\frac{\rho }{2}\| x_s-x_{s+1}\| ^2 \]

\[ = \eta g_s^\top (x_s-x_{s+1})-\frac{\rho }{2}\| x_s-x_{s+1}\| ^2 \]

\[ \le \eta L\| x_s-x_{s+1}\| -\frac{\rho }{2}\| x_s-x_{s+1}\| ^2 \]

\[ \le \frac{(\eta L)^2}{2\rho }. \]

We proved

\[ \sum _{s=1}^t\big(f(x_s)-f(x)\big)\le \frac{D_\Phi (x,x_1)}{\eta }+\eta \frac{L^2 t}{2\rho }, \]

which concludes the proof up to trivial computation.

6.3.2 Dual averaging

Theorem 6.3.2

Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap D\) w.r.t. \(\| \cdot \| \). Let \(R^2=\sup _{x\in \mathcal{X}\cap D}\Phi (x)-\Phi (x_1)\), and \(f\) be convex and \(L\)-Lipschitz w.r.t. \(\| \cdot \| \). Then dual averaging with \(\eta =\dfrac {R}{L}\sqrt{\dfrac {\rho }{2t}}\) satisfies

\[ f\! \left(\tfrac {1}{t}\sum _{s=1}^t x_s\right)-f(x^*)\le 2RL\sqrt{\dfrac {2}{\rho t}}. \]

Proof ▶

We define \(\psi _t(x)=\eta \sum _{s=1}^t g_s^\top x+\Phi (x)\), so that \(x_t\in \arg \min _{x\in \mathcal{X}\cap \mathcal{D}}\psi _{t-1}(x)\). Since \(\Phi \) is \(\rho \)-strongly convex one clearly has that \(\psi _t\) is \(\rho \)-strongly convex, and thus

\[ \psi _t(x_{t+1}) - \psi _t(x_t) \le \nabla \psi _t(x_{t+1})^\top (x_{t+1}-x_t) - \frac{\rho }{2}\| x_{t+1}-x_t\| ^2 \le -\frac{\rho }{2}\| x_{t+1}-x_t\| ^2, \]

where the second inequality comes from the first order optimality condition for \(x_{t+1}\) (see Proposition 1.3). Next observe that

\[ \psi _t(x_{t+1}) - \psi _t(x_t) = \psi _{t-1}(x_{t+1}) - \psi _{t-1}(x_t) + \eta g_t^\top (x_{t+1}-x_t) \ge \eta g_t^\top (x_{t+1}-x_t). \]

Putting together the two above displays and using Cauchy-Schwarz (with the assumption \(\| g_t\| _* \le L\)) one obtains

\[ \frac{\rho }{2}\| x_{t+1}-x_t\| ^2 \le \eta g_t^\top (x_t-x_{t+1}) \le \eta L\| x_t-x_{t+1}\| . \]

In particular this shows that \(\| x_{t+1}-x_t\| \le \frac{2\eta L}{\rho }\) and thus with the above display

\[ g_t^\top (x_t - x_{t+1}) \le \frac{2\eta L^2}{\rho }. \tag {4.7} \]

Now we claim that for any \(x\in \mathcal X\cap \mathcal D\),

\[ \sum _{s=1}^t g_s^\top (x_s - x) \le \sum _{s=1}^t g_s^\top (x_s - x_{s+1}) + \frac{\Phi (x)-\Phi (x_1)}{\eta }, \tag {4.8} \]

which would clearly conclude the proof thanks to (4.7) and straightforward computations. Equation (4.8) is equivalent to

\[ \sum _{s=1}^t g_s^\top x_{s+1} + \frac{\Phi (x_1)}{\eta } \le \sum _{s=1}^t g_s^\top x + \frac{\Phi (x)}{\eta }, \]

and we now prove the latter equation by induction. At \(t=0\) it is true since \(x_1\in \arg \min _{x\in \mathcal X\cap \mathcal D}\Phi (x)\). The following inequalities prove the inductive step, where we use the induction hypothesis at \(x=x_{t+1}\) for the first inequality, and the definition of \(x_{t+1}\) for the second inequality:

\[ \sum _{s=1}^t g_s^\top x_{s+1} + \frac{\Phi (x_1)}{\eta } \le g_t^\top x_{t+1} + \sum _{s=1}^{t-1} g_s^\top x_{t+1} + \frac{\Phi (x_{t+1})}{\eta } \le \sum _{s=1}^t g_s^\top x + \frac{\Phi (x)}{\eta }. \]

6.3.3 Mirror prox

Definition 6.3.4 Mirror prox

Mirror prox is described by the following equations:

\[ \nabla \Phi (y'_{t+1})=\nabla \Phi (x_t)-\eta \nabla f(x_t), \]

\[ y_{t+1}\in \arg \min _{x\in \mathcal X\cap \mathcal D} D_{\Phi }(x,y'_{t+1}), \]

\[ \nabla \Phi (x'_{t+1})=\nabla \Phi (x_t)-\eta \nabla f(y_{t+1}), \]

\[ x_{t+1}\in \arg \min _{x\in \mathcal X\cap \mathcal D} D_{\Phi }(x,x'_{t+1}). \]

Theorem 6.3.3

Let \(\Phi \) be a mirror map \(\rho \)-strongly convex on \(\mathcal{X}\cap \mathcal{D}\) w.r.t. \(\| \cdot \| \). Let \(R^{2}=\sup _{x\in \mathcal{X}\cap \mathcal{D}}\Phi (x)-\Phi (x_{1})\), and \(f\) be convex and \(\beta \)-smooth w.r.t. \(\| \cdot \| \). Then mirror prox with \(\eta =\frac{\rho }{\beta }\) satisfies

\[ f\! \left(\tfrac {1}{t}\sum _{s=1}^{t} y_{s+1}\right)-f(x^{*}) \le \frac{\beta R^{2}}{\rho t}. \]

Proof ▶

Let \(x\in \mathcal{X}\cap D\). We write

\[ \begin{aligned} f(y_{t+1})-f(x) & \leq \nabla f(y_{t+1})^{\top }(y_{t+1}-x)\\ & = \nabla f(y_{t+1})^{\top }(x_{t+1}-x)+\nabla f(x_t)^{\top }(y_{t+1}-x_{t+1})\\ & \quad +(\nabla f(y_{t+1})-\nabla f(x_t))^{\top }(y_{t+1}-x_{t+1}). \end{aligned} \]

We will now bound separately these three terms. For the first one, using the definition of the method, Lemma 4.1, and equation (4.1), one gets

\[ \begin{aligned} \eta \nabla f(y_{t+1})^{\top }(x_{t+1}-x) & =(\nabla \Phi (x_t)-\nabla \Phi (x’_{t+1}))^{\top }(x_{t+1}-x)\\ & \le (\nabla \Phi (x_t)-\nabla \Phi (x_{t+1}))^{\top }(x_{t+1}-x)\\ & =D_{\Phi }(x,x_t)-D_{\Phi }(x,x_{t+1})-D_{\Phi }(x_{t+1},x_t). \end{aligned} \]

For the second term using the same properties than above and the

\[ \eta \nabla f(x_t)^\top (y_{t+1}-x_{t+1}) \]

\[ \quad = (\nabla \Phi (x_t)-\nabla \Phi (y'_{t+1}))^\top (y_{t+1}-x_{t+1}) \]

\[ \quad \le (\nabla \Phi (x_t)-\nabla \Phi (y_{t+1}))^\top (y_{t+1}-x_{t+1}) \]

\[ \quad = D_{\Phi }(x_{t+1},x_t)-D_{\Phi }(x_{t+1},y_{t+1})-D_{\Phi }(y_{t+1},x_t) \tag {4.9} \]

\[ \quad \le D_{\Phi }(x_{t+1},x_t)-\frac{\rho }{2}\| x_{t+1}-y_{t+1}\| ^2-\frac{\rho }{2}\| y_{t+1}-x_t\| ^2. \]

Finally for the last term, using Cauchy–Schwarz, \(\beta \)-smoothness, and \(2ab\le a^2+b^2\) one gets

\[ (\nabla f(y_{t+1})-\nabla f(x_t))^\top (y_{t+1}-x_{t+1}) \]

\[ \quad \le \| \nabla f(y_{t+1})-\nabla f(x_t)\| _* \, \| y_{t+1}-x_{t+1}\| \]

\[ \quad \le \beta \| y_{t+1}-x_t\| \, \| y_{t+1}-x_{t+1}\| \]

\[ \quad \le \frac{\beta }{2}\| y_{t+1}-x_t\| ^2+\frac{\beta }{2}\| y_{t+1}-x_{t+1}\| ^2. \]

Thus summing up these three terms and using that \(\eta =\frac{\rho }{\beta }\) one gets

\[ f(y_{t+1})-f(x)\le \frac{D_{\Phi }(x,x_t)-D_{\Phi }(x,x_{t+1})}{\eta }. \]

The proof is concluded with straightforward computations.

6.4 Beyond the black-box model

Definition 6.4.1 ISTA, Iterative Shrinkage-Thresholding Algorithm

\[ x_{t+1}=\arg \min _{x\in \mathbb {R}^n}\eta (g(x)+\nabla f(x_t)^\top x)+\frac{1}{2}\| x-x_t\| _2^2 \]

\[ \quad =\arg \min _{x\in \mathbb {R}^n} g(x)+\frac{1}{2\eta }\| x-(x_t-\eta \nabla f(x_t))\| _2^2. \tag {5.1} \]

Theorem 6.4.1

The function \(f\) is convex and \(\beta \)-smooth, and \(g\) is convex. With \(\eta =\frac{1}{\beta }\) one has

\[ f(x_t)+g(x_t)-(f(x^*)+g(x^*))\le \frac{\beta \| x_1-x^*\| _2^2}{2t}. \]

Definition 6.4.2 Fast-ISTA

\[ \lambda _0 = 0,\qquad \lambda _t = \frac{1 + \sqrt{1 + 4\lambda _{t-1}^2}}{2},\qquad \text{and}\qquad \gamma _t = \frac{1-\lambda _t}{\lambda _{t+1}}. \]

Let \(x_1 = y_1\) an arbitrary initial point, and

\[ y_{t+1} \; =\; \arg \min _{x\in \mathbb {R}^n} g(x) + \frac{\beta }{2}\| x - (x_t - \tfrac {1}{\beta }\nabla f(x_t))\| _2^2, \]

\[ x_{t+1} \; =\; (1-\gamma _t)y_{t+1} + \gamma _t y_t. \]

Theorem 6.4.2

The function \(f\) is convex and \(\beta \)-smooth, and \(g\) is convex. The rate of convergence of FISTA on \(f+g\) is similar to the one of Nesterov’s accelerated gradient descent on \(f\), more precisely:

\[ f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \le \frac{2\beta \| x_1 - x^*\| ^2}{t^2}. \]

6.4.1 Newton’s method

Definition 6.4.3

\(F:\operatorname {int}(\mathcal{X})\to \mathbb {R}\) is a barrier for \(\mathcal{X}\) if

\[ F(x)\; \longrightarrow _{\; x\to \partial \mathcal{X}\; }\; +\infty . \]

Theorem 6.4.3

Assume that \(f\) has a Lipschitz Hessian, that is

\[ \| \nabla ^2 f(x)-\nabla ^2 f(y)\| \le M\| x-y\| . \]

Let \(x^*\) be a local minimum of \(f\) with strictly positive Hessian, that is \(\nabla ^2 f(x^*)\succeq \mu I_n\), \(\mu {\gt}0\). Suppose that the initial starting point \(x_0\) of Newton’s method is such that

\[ \| x_0-x^*\| \le \frac{\mu }{2M}. \]

Then Newton’s method is well-defined and converges to \(x^*\) at a quadratic rate:

\[ \| x_{k+1}-x^*\| \le \frac{M}{\mu }\| x_k-x^*\| ^2. \]

Proof ▶

We use the following simple formula, for \(x,h\in \mathbb {R}^n\),

\[ \int _0^1 \nabla ^2 f(x+sh)\, h\, ds = \nabla f(x+h)-\nabla f(x). \]

Now note that \(\nabla f(x^*)=0\), and thus with the above formula one obtains

\[ \nabla f(x_k)=\int _0^1 \nabla ^2 f(x^*+s(x_k-x^*))\, (x_k-x^*)\, ds, \]

which allows us to write:

\[ \begin{aligned} x_{k+1}-x^* & = x_k-x^*-[\nabla ^2 f(x_k)]^{-1}\nabla f(x_k)\\ & = x_k-x^*-[\nabla ^2 f(x_k)]^{-1}\int _0^1 \nabla ^2 f(x^*+s(x_k-x^*))\, (x_k-x^*)\, ds\\ & = [\nabla ^2 f(x_k)]^{-1}\int _0^1 [\nabla ^2 f(x_k)-\nabla ^2 f(x^*+s(x_k-x^*))]\, (x_k-x^*)\, ds. \end{aligned} \]

In particular one has

\[ \| x_{k+1}-x^*\| \le \| [\nabla ^2 f(x_k)]^{-1}\| \times \left(\int _0^1 \| \nabla ^2 f(x_k)-\nabla ^2 f(x^*+s(x_k-x^*))\| \, ds\right)\| x_k-x^*\| . \]

Using the Lipschitz property of the Hessian one immediately obtains that

\[ \left(\int _0^1 \| \nabla ^2 f(x_k)-\nabla ^2 f(x^*+s(x_k-x^*))\| \, ds\right)\le \frac{M}{2}\| x_k-x^*\| . \]

Using again the Lipschitz property of the Hessian (note that \(\| A-B\| \le s\iff sI_n\succeq A-B\succeq -sI_n\)), the hypothesis on \(x^*\), and an induction hypothesis that \(\| x_k-x^*\| \le \frac{\mu }{2M}\), one has

\[ \nabla ^2 f(x_k)\succeq \nabla ^2 f(x^*)-M\| x_k-x^*\| I_n\succeq (\mu -M\| x_k-x^*\| )I_n\succeq \frac{\mu }{2}I_n, \]

which concludes the proof.

6.4.2 Interior point methods

Definition 6.4.4

Let \(\mathcal X\) be a convex set with non-empty interior, and \(f\) a \(C^3\) convex function defined on \(\operatorname {int}(\mathcal X)\). Then \(f\) is self-concordant (with constant \(M\)) if for all \(x\in \operatorname {int}(\mathcal X)\), \(h\in \mathbb R^n\),

\[ \nabla ^3 f(x)[h,h,h]\le M\| h\| _x^3. \]

We say that \(f\) is standard self-concordant if \(f\) is self-concordant with constant \(M=2\).

Definition 6.4.5

Let \(f\) be a standard self-concordant function on \(\mathcal{X}\). For \(x\in \operatorname {int}(\mathcal{X})\), we say that \(\lambda _f(x)=\| \nabla f(x)\| _{x}^*\) is the Newton decrement of \(f\) at \(x\).

An important inequality is that for \(x\) such that \(\lambda _f(x){\lt}1\), and \(x^*=\operatorname*{arg\, min}f(x)\), one has

\begin{equation} \tag {5.5} \| x-x^*\| _x \le \frac{\lambda _f(x)}{1-\lambda _f(x)}, \end{equation}

Theorem 6.4.4

Let \(f\) be a standard self-concordant function on \(\mathcal{X}\), and \(x\in \operatorname {int}(\mathcal{X})\) such that \(\lambda _f(x)\le 1/4\), then

\[ \lambda _f\bigl(x-[\nabla ^2 f(x)]^{-1}\nabla f(x)\bigr)\le 2\lambda _f(x)^2. \]

Definition 6.4.6

\(F\) is a \(\nu \)-self-concordant barrier if it is a standard self-concordant function, and it is \(\tfrac {1}{\nu }\)-exp-concave.

Theorem 6.4.5

Let \(\mathcal X\subset \mathbb R^n\) be a closed convex set with non-empty interior. There exists \(F\) which is a \((c\, n)\)-self-concordant barrier for \(\mathcal X\) (where \(c\) is some universal constant).

A key property of \(\nu \)-self-concordant barriers is the following inequality:

\begin{equation} c^\top x^*(t)-\min _{x\in \mathcal X} c^\top x \le \frac{\nu }{t},\label{eq:5.10} \end{equation}

see [Equation (4.2.17), Nesterov (2004a)]. More generally using 15 together with (5.5) one obtains

\begin{align} c^\top y-\min _{x\in \mathcal X} c^\top x & \le \frac{\nu }{t}+c^\top (y-x^*(t))\nonumber \\ & = \frac{\nu }{t}+\frac{1}{t}(\nabla F_t(y)-\nabla F_t(y))^\top (y-x^*(t))\nonumber \\ & \le \frac{\nu }{t}+\frac{1}{t}\| \nabla F_t(y)-\nabla F_t(y)\| ^*_y\cdot \| y-x^*(t)\| _y\nonumber \\ & \le \frac{\nu }{t}+\frac{1}{t}\bigl(\lambda _{F_t}(y)+\sqrt{\nu }\bigr)\frac{\lambda _{F_t}(y)}{1-\lambda _{F_t}(y)}\label{eq:5.11} \end{align}

In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity (5.7) for a point x close to the central path. We do this as follows:

\begin{equation} \tag {5.12} \lambda _{F_{t'}}(x) = \ \left\| t'c+\nabla F(x)\right\| _{x}^* = \ \left\| (t'/t)(tc+\nabla F(x))+(1-t'/t)\nabla F(x)\right\| _{x}^* \le \ \frac{t'}{t}\lambda _{F_t}(x)+\left(\frac{t'}{t}-1\right)\sqrt{\nu }. \end{equation}

We can now formally describe and analyze the most basic IPM called the path-following scheme. Let \(F\) be a \(\nu \)-self-concordant barrier for \(\mathcal{X}\). Assume that one can find \(x_0\) such that \(\lambda _{F_{t_0}}(x_0)\le 1/4\) for some small value \(t_0{\gt}0\) (we describe a method to find \(x_0\) at the end of this subsection). Then for \(k\ge 0\), let

\[ t_{k+1}=\Bigl(1+\frac{1}{13\sqrt{\nu }}\Bigr)t_k, \]

\[ x_{k+1}=x_k-[\nabla ^2 F(x_k)]^{-1}\bigl(t_{k+1}c+\nabla F(x_k)\bigr). \]

Theorem 6.4.6

The path-following scheme described above satisfies

\[ c^\top x_k - \min _{x\in \mathcal X} c^\top x \le \frac{2\nu }{t_0}\exp \! \left(-\frac{k}{1+13\sqrt{\nu }}\right). \]

Proof ▶

We show that the iterates \((x_k)_{k\ge 0}\) remain close to the central path \((x^*(t_k))_{k\ge 0}\). Precisely one can easily prove by induction that

\[ \lambda _{F_{t_k}}(x_k)\le 1/4. \]

Indeed using Theorem 5.4 and equation (5.12) one immediately obtains

\[ \lambda _{F_{t_{k+1}}}(x_{k+1}) \le 2\lambda _{F_{t_{k+1}}}(x_k)^2 \le 2\Big(\frac{t_{k+1}}{t_k}\lambda _{F_{t_k}}(x_k)+\Big(\frac{t_{k+1}}{t_k}-1\Big)\sqrt{\nu }\Big)^2 \le 1/4, \]

where we used in the last inequality that \(t_{k+1}/t_k=1+\frac{1}{13\sqrt{\nu }}\) and \(\nu \ge 1\). Thus using (5.11) one obtains

\[ c^\top x_k-\min _{x\in \mathcal{X}}c^\top x\le \frac{\nu +\sqrt{\nu }/3+1/12}{t_k}\le \frac{2\nu }{t_k}. \]

Observe that \(t_k=\left(1+\frac{1}{13\sqrt{\nu }}\right)^k t_0\), which finally yields

\[ c^\top x_k-\min _{x\in \mathcal{X}}c^\top x\le \frac{2\nu }{t_0}\left(1+\frac{1}{13\sqrt{\nu }}\right)^{-k}. \]