Baby Approximation Theory

When I took my A-Level examinations, I was required to take a Chemistry practical assessment. In this assessment, I collected data and needed to draw a line of best-fit. About that time, I learned linear algebra, and thought I’d apply what I learned to compute the perfect best-fit line. I got an ‘A’ for Chemistry in the end. It helped me quite a bit.

Given the dataset $\{(x_i, y_i) : 1 \leq n \leq N\} \subseteq \mathbb R \times \mathbb R$ , we want to find a line of best fit, usually denoted

$\hat y_i = \beta_0 + \beta_1 x_i,\quad 1 \leq i \leq N,$

that minimises the total squared-error $\displaystyle \sum_{i=1}^N (y_i - \hat y_i )^2$ . How do we achieve this goal? Let’s try to solve the equation directly, which gives the matrix equation

$\begin{bmatrix}1 & x_1 \\ \vdots & \vdots \\ 1 & x_N\end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix}.$

The probability that we can even solve this won’t be high, since we have too much data! For instance, the simple data set $\{ (1, 1), (2, 3), (3, 4) \}$ will be enough to confound this equation. With this question at hand, let’s ask a more generically formable question: What would be a “best-fit” solution to the usually-unsolvable equation $\mathbf A \mathbf x = \mathbf b$ ?

Even more generally, if $V$ is an inner product space over $\mathbb K$ , and $T : V \to V$ is a linear transformation, what would be a “best-fit” solution to the equation $T(\mathbf x) = \mathbf y$ ? By definition, a solution exists if and only if $\mathbf y \in T(V) \subseteq V$ . What if $\mathbf y \notin T(V)$ ?

Lemma 1. For any subspace $W \subseteq V$ , the orthogonal complement

$W^{\perp} := \{ \mathbf u \in V : \mathbf u \perp W \}$

is also a subspace of $V$ . Furthermore, if $V$ is finite-dimensional, then

$V = W \oplus W^{\perp}.$

Proof. If $V$ is finite dimensional, then $W$ is finite-dimensional and contains an orthonormal basis, and

$W = \ker(I_V - \mathrm{proj}_W), W^{\perp} = \ker(\mathrm{proj}_W).$

The decomposition $I_V = (I_V - \mathrm{proj}_W) + \mathrm{proj}_W$ with the property

$\begin{aligned} \mathrm{proj}_W \circ (I_V - \mathrm{proj}_W) &= \mathrm{proj}_W - \mathrm{proj}_W^2 \\ &= \mathrm{proj}_W - \mathrm{proj}_W = O_V\end{aligned}$

yields the desired result.

Theorem 1. Let $V$ be finite-dimensional and $W \subseteq V$ a subspace. Then for any $\mathbf y \in V$ , there exists unique vectors $\mathbf w \in W$ , $\mathbf n \in W^\perp$ , such that

$\mathbf y = \mathbf w + \mathbf n.$

We call this expression the orthogonal decomposition of $\mathbf y$ . In fact, by Lemma 1, we must have

$\mathbf y = \mathrm{proj}_W(\mathbf y) + (\mathbf y - \mathrm{proj}_W(\mathbf y)).$

Proof. We have established existence by Lemma 1, and it remains to prove uniqueness. Consider two possible decompositions

$\mathbf y = \mathbf w + \mathbf n = \mathbf w' + \mathbf n'.$

Then

$\mathbf w - \mathbf w' = \mathbf n' - n \in W \cap W^{\perp} = \{\mathbf 0\},$

yielding $\mathbf w = \mathbf w'$ and $\mathbf n = \mathbf n'$ .

So the projection is indeed a powerful tool, since we can always write $\mathbf y$ in terms of its projection. Furthermore, the orthogonal component $\mathbf y - \mathrm{proj}_{T(V)}(\mathbf y)$ will definitely be orthogonal to $\mathbf y$ . But this raises the question: is $\mathrm{proj}_{T(V)}(\mathbf y)$ really our best-fit solution?

Definition 1. For any $\mathbf v \in V$ and subspace $W \subseteq V$ , we call $\mathbf u \in W$ a best approximation of $\mathbf v$ onto $W$ if

$\| \mathbf v - \mathbf u \| \leq \| \mathbf v - \mathbf w \|,\quad \mathbf w \in W.$

Theorem 2. For any $\mathbf y \in V$ and subspace $W \subseteq V$ , $\mathrm{proj}_W(\mathbf v)$ is the best approximation of $\mathbf y$ onto $W$ .

Proof. Firstly, for vectors $\mathbf u, \mathbf v \in V$ ,

$\begin{aligned} \| \mathbf u + \mathbf v \|^2 &= \langle \mathbf u + \mathbf v, \mathbf u + \mathbf v \rangle \\ &= \langle \mathbf u, \mathbf u \rangle + \langle \mathbf u, \mathbf v \rangle + \langle \mathbf v, \mathbf u \rangle + \langle \mathbf v, \mathbf v \rangle \\ &= \| \mathbf u \|^2 + \| \mathbf v \|^2 + \langle \mathbf u, \mathbf v \rangle + \langle \mathbf v, \mathbf u \rangle. \end{aligned}$

Thus, we have Pythagoras’ theorem for inner product spaces if $\langle \mathbf u, \mathbf v \rangle = 0$ , then

$\| \mathbf u + \mathbf v \|^2 = | \mathbf u \|^2 + \| \mathbf v \|^2.$

Therefore, for any $\mathbf w \in W$ ,

$\begin{aligned} \| \mathbf v -\mathbf w \|^2 &= \| (\mathbf v - \mathrm{proj}_W(\mathbf v)) + (\mathrm{proj}_W(\mathbf v) - \mathbf w)\|^2 \\ &= \| \mathbf v - \mathrm{proj}_W(\mathbf v) \|^2 + \| \mathrm{proj}_W(\mathbf v) - \mathbf w \|^2 \\ &\geq \| \mathbf v - \mathrm{proj}_W(\mathbf v) \|^2 + 0 \\ &= \| \mathbf v - \mathrm{proj}_W(\mathbf v) \|^2. \end{aligned}$

For uniqueness, suppose $\mathbf u_1, \mathbf u_2$ are best approximations. Define $\mathbf u_3 := \frac 12 (\mathbf u_1 + \mathbf u_2)$ . Then

$\begin{aligned} 0 < \| \mathbf v - \mathbf u_3 \| &= \textstyle \| \frac 12 (\mathbf v - \mathbf u_1) + \frac 12 (\mathbf v - \mathbf u_2) \| \\ &\leq \textstyle \frac 12 \| \mathbf v - \mathbf u_1 \| + \frac 12 \| \mathbf v - \mathbf u_2 \| \\ &= \| \mathbf v - \mathbf u_1 \|.\end{aligned}$

Hence, $\mathbf u_3$ is also a best approximation. Defining $\mathbf z_i := \mathbf v - \mathbf u_i$ , we have the equality

$\textstyle \| \mathbf z_1 \| = \| \mathbf z_2 \| = \| \frac 12 (\mathbf z_1 + \mathbf z_2) \| =: 1,$

where we set them equal $1$ without loss of generality. We can prove the parallelogram identity via

$\|\mathbf z_1 - \mathbf z_2 \|^2 + \|\mathbf z_1 + \mathbf z_2 \|^2 = 2 \|\mathbf z_1\|^2 + 2 \|\mathbf z_2 \|^2.$

Substituting the various values,

$\|\mathbf z_1 - \mathbf z_2 \|^2 + 2^2 = 2 \cdot 1^2 + 2 \cdot 1^2.$

Hence, $\|\mathbf z_1 - \mathbf z_2 \| = 0$ yields $\mathbf z_1 = \mathbf z_2 = \mathbf z_3$ . Hence $\mathbf u_1 = \mathbf u_2 = \mathbf u_3$ are the one and the same best approximation.

The key take-away is this: the best approximation of $\mathbf v$ onto $W$ is its projection $\mathrm{proj}_W(\mathbf v)$ . But how do we answer our semi-original question: finding a best-fit solution to $T(\mathbf x) = \mathbf y$ ?

Corollary 1. Let $V, W$ be finite-dimensional vector spaces and $W$ be an inner product space. Let $T : V \to W$ be a linear transformation and $\mathbf y \in W$ . Then the equation $T(\mathbf x) = \mathbf y$ has a best approximation given by $\mathbf u \in V$ such that $T(\mathbf u) = \mathrm{proj}_W(\mathbf y)$ .

Proof. Apply Theorem 2 to the subspace $T(V) \subseteq W$ .

This result helps us partly solve our original equation. Let $\mathbf A : \mathbb K^n \to \mathbb K^m$ . By Theorem 2, a best approximation solution to the equation $\mathbf A \mathbf x = \mathbf b$ , we need to find $\mathbf u \in \mathbb K^n$ such that $\mathbf A \mathbf u = \mathrm{proj}_{\mathbf A(\mathbb K^n)} (\mathbf b)$ . But how do we actually compute such a $\mathbf u$ ?

The key property that we want to exploit is that $(\mathbf b - \mathbf A \mathbf u) \perp \mathbf A(\mathbb K^n)$ . This means that for any $\mathbf e_i$ ,

$\langle \mathbf A \mathbf e_j, \mathbf b - \mathbf A \mathbf u \rangle = 0 \quad \iff \quad \langle \mathbf A \mathbf e_j, \mathbf A \mathbf u \rangle = \langle \mathbf A \mathbf e_j, \mathbf b \rangle.$

It turns out that in solving this problem, we are finally forced to introduce the idea of the matrix transpose, which we will properly discuss in the next post. Remember, we haven’t technically answered our best-fit line question: find meaningful constants $\beta_0,\beta_1$ such that

$\hat y_i = \beta_0 + \beta_1 x_i,\quad 1 \leq i \leq N.$

As a side-note, we can generalise this problem to a more realistic setting involving machine learning datasets $\{(\mathbf x_i, y_i) : 1 \leq i \leq N\} \subseteq \mathbb R^n \times \mathbb R$ , where we seek sufficiently meaningful parameters $\theta_0 \in \mathbb R$ , $\boldsymbol{\theta} \in \mathbb R^n$ such that