Theoretical Optimisation

Let f : \mathbb R^n \to \mathbb R be Fréchet-differentiable.

Problem 1. Prove that if \mathbf x_0 \in D gives a local extremum for f, then f'(\mathbf x_0) = 0.

(Click for Solution)

Solution. Suppose \mathbf x_0 is a local maximum for f without loss of generality (otherwise, \mathbf x_0 is a local maximum for -f). Then there exists r > 0 and a neighborhood of \mathbf x_0 with radius r such that for \| \mathbf v \| < r,

f( \mathbf x_0 + \mathbf v) \leq f(\mathbf x_0).

Consider the quantity f_{x_i}(\mathbf x_0). By definition,

\displaystyle f_{x_i}(\mathbf x_0) = \lim_{t \to 0} \frac{f(\mathbf x_0 + t \cdot \mathbf e_i) - f(\mathbf x_0)}{t},\quad |t| < r.

For t \neq 0, we have f(\mathbf x_0 + t \mathbf e_i) \leq f(\mathbf x_0). Therefore, taking t \to 0^+ yields f_{x_i}(\mathbf x_0) \leq 0, and taking t \to 0^- yields f_{x_i}(\mathbf x_0) \geq 0, so that f_{x_i}(\mathbf x_0) = 0. Therefore, for any \mathbf v,

(f'(\mathbf x_0))(\mathbf v) = \langle (\nabla f)(\mathbf x_0), \mathbf v \rangle = \langle 0, \mathbf v \rangle = 0.

Therefore, f'(\mathbf x_0) = 0.

Suppose f has continuous first- and second-order partial deriatives and D \subseteq \mathbb R^n be a domain.

Definition 1. Define the Hessian matrix \mathbf H(f) of f by

(\mathbf H(f))(\cdot) := \begin{bmatrix} f_{x_1x_1}(\cdot) & f_{x_2x_1} (\cdot)& \cdots & f_{x_n x_1}(\cdot) \\ f_{x_1 x_2}(\cdot) & f_{x_2x_2}(\cdot) & \cdots & f_{x_n x_2}(\cdot) \\ \vdots & \vdots & \ddots & \vdots \\ f_{x_1 x_n}(\cdot) & f_{x_2 x_n}(\cdot) & \cdots & f_{x_n x_n}(\cdot) \end{bmatrix}.

We remark that \mathbf H(f) is symmetric by Clairaut’s theorem.

Problem 2. Suppose n = 2 and f'(x_0, y_0) = 0. Prove the following:

  • (x_0,y_0) gives a local maximum if |(\mathbf H(f))(x_0,y_0)| > 0 and f_{xx}(x_0, y_0) < 0.
  • (x_0,y_0) gives a local minimum if |(\mathbf H(f))(x_0,y_0)| > 0 and f_{xx}(x_0, y_0) > 0.
  • (x_0,y_0) is not a local extremum if |(\mathbf H(f))(x_0,y_0)| < 0.
(Click for Solution)

Solution. In the case n = 2,

(H(f))(\cdot) = \begin{bmatrix} f_{xx}(\cdot) & f_{yx}(\cdot) \\ f_{xy}(\cdot) & f_{yy}(\cdot) \end{bmatrix},

and so by Clairaut’s theorem has a determinant of

|(H(f))(\cdot)| = f_{xx} \cdot f_{yy} - f_{xy}^2.

Fix \mathbf v \in \mathbb R^n and assume \|\mathbf v\| = 1 without loss of generality. Compute \partial_{\mathbf v}^2(f):= \partial_{\mathbf v}(\partial_{\mathbf v}(f)) and simplify using Clairaut’s theorem and completing the square to obtain

\displaystyle \begin{aligned}\partial_{\mathbf v}^2(f) &= \partial_{\mathbf v}(\partial_{\mathbf v}(f)) \\ &= \partial_{\mathbf v}(v_1 \cdot f_x + v_2 \cdot f_y) \\ &= v_1 \cdot (v_1 \cdot f_x + v_2 \cdot f_y)_x + v_2 \cdot (v_1 \cdot f_x + v_2 \cdot f_y)_y \\ &= v_1^2 \cdot f_{xx} + 2v_1 v_2 \cdot f_{xy} + v_2^2 \cdot f_{yy} \\ &= f_{xx} \cdot \left( v_1 + v_2\cdot \frac{f_{xy}}{f_{xx}}\right)^2 + \frac{v_2^2}{f_{xx}} \cdot (f_{xx} \cdot f_{yy} - f_{xy}^2) \\ &= f_{xx} \cdot \left[ \left( v_1 + v_2\cdot \frac{f_{xy}}{f_{xx}}\right)^2 + \frac{v_2^2}{f_{xx}^2} \cdot |(\mathbf H(f))(\cdot)| \right]. \end{aligned}

Suppose f_{xx}(x_0,y_0) > 0. If |(\mathbf H(f))(x_0, y_0)| > 0, then there exists \delta > 0 such that if \|\mathbf w \| < \delta, then (\partial_{\mathbf v}^2(f))(\mathbf x_0 + \mathbf w) > 0. Consider the function \gamma : [-1, 1] \to \mathbb R defined by

\displaystyle \gamma(t) = f(\mathbf x_0 + t \cdot \delta \cdot \mathbf v).

Differentiating twice,

\begin{aligned} \gamma'(t) &= \delta \cdot (\partial_{\mathbf v}(f))(\mathbf x_0 + t \cdot \delta \cdot \mathbf v), \\ \gamma''(t) &= \delta^2 \cdot (\partial_{\mathbf v}^2(f))(\mathbf x_0 + t \cdot \delta \cdot \mathbf v).\end{aligned}

In particular, \gamma'(0) = 0 and \gamma''(0) > 0. By the usual second derivative test, (0, f(x_0,y_0)) = (0, \gamma(0)) is a local minimum on the line segment \mathbf x_0 + [-\delta, \delta] \cdot \mathbf v. Since this argument holds regardless of \mathbf v, (x_0, y_0) yields a local minimum for f.

The argument holds similarly in the other two cases.

Problem 3. Write down the maximum value of (\partial_{\mathbf v}(f))(\mathbf x_0) in terms of \nabla f.

(Click for Solution)

Solution. Assume \|\mathbf v \| = 1 without loss of generality. Recall that

(\partial_{\mathbf v}(f))(\mathbf x_0) = \langle (\nabla f)(\mathbf x_0), \mathbf v \rangle = \| (\nabla f)(\mathbf x_0) \| \cdot \|\mathbf v \| \cdot \cos(\theta),

where \theta \in [0, \pi]. To maximise (\partial_{\mathbf v}(f))(\mathbf x_0), we need \theta such that \cos(\theta) = 1, which means \theta = 0. In this case, \mathbf v \parallel (\nabla f)(\mathbf x_0), so that

\displaystyle \mathbf v = \frac{(\nabla f)(\mathbf x_0)}{\| (\nabla f)(\mathbf x_0) \|}.

Therefore, the maximum value of (\partial_{\mathbf v}(f))(\mathbf x_0) is \|(\nabla f)(\mathbf x_0)\|. On the opposite end, the vector

\displaystyle \mathbf w := -\frac{(\nabla f)(\mathbf x_0)}{\| (\nabla f)(\mathbf x_0) \|}

minimises (\partial_{\mathbf v}(f))(\mathbf x_0), yielding the famous machine learning technique called gradient descent.

Let g : \mathbb R^n \to \mathbb R be Fréchet-differentiable.

Problem 4. Define the zero surface of g by

S := \{\mathbf x \in \mathbb R^n : g(\mathbf x) = 0\} \subseteq \mathbb R^n.

Let \mathbf r : [-1, 1] \to S be a continuously differentiable curve with \mathbf r(0) = \mathbf x_0. Prove that \langle (\nabla g)(\mathbf x_0), \mathbf r'(0) \rangle = 0.

(Click for Solution)

Solution. By the chain rule,

\langle (\nabla g)(\mathbf x_0), \mathbf r'(0) \rangle = (g'(\mathbf r(0)))(\mathbf r'(0)) = 0' = 0.

Problem 5. The tangent plane T to S at \mathbf x_0 is defined by

T := \{\mathbf x \in \mathbb R^n : \langle \mathbf x - \mathbf x_0, (\nabla g)(\mathbf x_0) \rangle = 0\}.

Prove that if \nabla g \neq \mathbf 0, then for any vector \mathbf u \in \mathbb R^n,

\mathbf u \perp T \quad \iff \quad \mathbf u \in \mathrm{span}\{ (\nabla g)(\mathbf x_0) \}.

(Click for Solution)

Solution. Since T can be affinely transformed to a subspace of \mathbb R^n, we assume \mathbf x_0 = \mathbf 0 without loss of generality. Then we see that T = \mathrm{span}\{ (\nabla g)(\mathbf x_0) \}^{\perp} with dimension n - 1. In this case, \mathbf u \perp T if and only if \mathbf u \in T^{\perp} = \mathrm{span}\{ (\nabla g)(\mathbf x_0) \}.

Problem 6. Suppose g : D \to \mathbb R has continuous first-order partial derivatives and g_{x_i}(\mathbf x_0) \neq 0 for any i. Evaluate the local extrema of f on S whenever they exist.

(Click for Solution)

Solution. Let \mathbf x_0 be a local extremum of f on S. We claim that there exists \lambda \in \mathbb R such that

(\nabla f)(\mathbf x_0) = \lambda \cdot (\nabla g)(\mathbf x_0).

Let \mathbf r : [-1, 1] \to S be any continuously differentiable curve with \mathbf r(0) = \mathbf x_0. Define h = f \circ \mathbf r : [-1, 1] \to \mathbb R. Differentiating using the chain rule, since h is a local extremum at 0,

\displaystyle 0 = h'(0) = (f'(\mathbf r(0)))(\mathbf r'(0)) = \langle (\nabla f)(\mathbf x_0), \mathbf r'(0) \rangle.

In particular, (\nabla f)(\mathbf x_0) \perp T. Therefore, (\nabla f)(\mathbf x_0) \in \mathrm{span}\{ (\nabla g)(\mathbf x_0) \}. Hence, there exists \lambda \in \mathbb R such that

(\nabla f)(\mathbf x_0) = \lambda \cdot (\nabla g)(\mathbf x_0).

Corollary 1. Define the Lagrangian \mathcal L : \mathbb R^n \times \mathbb R \to \mathbb R by

\mathcal L(\mathbf x, \lambda) := f(\mathbf x) - \lambda \cdot g(\mathbf x).

Under the hypotheses of Problems 4 to 6, if \mathbf x_0 is a local extremum, then there exists \lambda \in \mathbb R such that \nabla \mathcal L(\mathbf x_0, \lambda) = 0. In this case, we call \lambda a Lagrange multiplier for the optimisation problem.

—Joel Kindiak, 13 Aug 25, 1243H

,

Published by


Leave a comment