Linear Algebra – Orthogonality and least squares – Least-squares problems
A system of equations \(A\mathbf{x}=\mathbf{b}\) is only consistent if \(\mathbf{b}\in\text{Col}(A)\). In practice it often occurs that such a system is inconsistent because of measurement or rounding errors for instance. This is often quite unsatisfactory. In order to solve the system one replaces the vector \(\mathbf{b}\) by a vector that is contained in \(\text{Col}(A)\). In order to minimize the 'error' hereby one should choose that vector in \(\text{Col}(A)\) that is closest to \(\mathbf{b}\): the (orthogonal) projection of \(\mathbf{b}\) onto \(\text{Col}(A)\). A solution of the system of equations obtained in this manner is called a least-squares solution of \(A\mathbf{x}=\mathbf{b}\):
Definition: Suppose that \(A\) is an \(m\times n\) matrix and \(\mathbf{b}\in\mathbb{R}^n\). Then \(\hat{\mathbf{x}}\in\mathbb{R}^n\) is called a least-squares solution of \(A\mathbf{x}=\mathbf{b}\) if \(||\mathbf{b}-A\hat{\mathbf{x}}||\leq||\mathbf{b}-A\mathbf{x}||\) for all \(\mathbf{x}\in\mathbb{R}^n\).
The finding of such a least-squares solution comes down to minimalizing the distance \(||\mathbf{b}−A\mathbf{x}||\) or equivalently of \(||\mathbf{b}−A\mathbf{x}||^2\), being a sum of squares. That explains the phrase "least-squares solution".
Now we will look for such a least-squares solution. Suppose that \(\hat{\mathbf{b}}=\text{proj}_{\text{Col}(A)}\mathbf{b}\), then \(A\mathbf{x}=\hat{\mathbf{b}}\) is consistent. Suppose that \(\hat{\mathbf{x}}\) is a solution, then we have: \(A\hat{\mathbf{x}}=\hat{\mathbf{b}}\). Then we have that \(\mathbf{b}-\hat{\mathbf{b}}\) is orthogonal to \(\text{Col}(A)\). So: \(\mathbf{b}−A\hat{\mathbf{x}}\) is orthogonal to every column of \(A\). If \(\mathbf{a}_j\) is such a column of \(A\), then we have: \(\mathbf{a}_j\cdot(\mathbf{b}-A\hat{\mathbf{x}})=0\) or equivalently \(\mathbf{a}_j^T(\mathbf{b}-A\hat{\mathbf{x}})=0\). This holds for each \(j=1,2,\ldots,n\), so:
\[A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}\quad\Longleftrightarrow\quad A^T\mathbf{b}-A^TA\hat{\mathbf{x}}=\mathbf{0} \quad\Longleftrightarrow\quad A^T\hat{\mathbf{x}}=A^T\mathbf{b}.\]The latter system of equations is referred to as the normal equations:
Theorem: The set of least-squares solutions \(\hat{\mathbf{x}}\) of \(A\mathbf{x}=\mathbf{b}\) corresponds to the nonempty set of solutions of the normal equations: \(A^T\hat{\mathbf{x}}=A^T\mathbf{b}\).
Proof: We have already seen that a least-squares solution \(\hat{\mathbf{x}}\) of \(A\mathbf{x}=\mathbf{b}\) must satisfy the normal equations \(A^TA\hat{\mathbf{x}}=A^T\mathbf{b}\). Conversely, if \(\hat{\mathbf{x}}\) is a solution of \(A^TA\hat{\mathbf{x}}=A^T\mathbf{b}\), then \(\mathbf{b}-A\hat{\mathbf{x}}\) is orthogonal to all rows of \(A^T\) and therefore to all columns of \(A\). Hence: \(\mathbf{b}−A\hat{\mathbf{x}}\) is orthogonal to \(\text{Col}(A)\). This implies that \(\mathbf{b}=A\hat{\mathbf{x}}+(\mathbf{b}-A\hat{\mathbf{x}})\) is a decomposition of \(\mathbf{b}\) with \(A\hat{\mathbf{x}}\in\text{Col}(A)\) and \(\mathbf{b}-A\hat{\mathbf{x}}\perp\text{Col}(A)\). Since such a decomposition is unique, we have that \(A\hat{\mathbf{x}}\) must be the orthogonal projection of \(\mathbf{b}\) onto \(\text{Col}(A)\). Hence: \(A\hat{\mathbf{x}}=\hat{\mathbf{b}}\) and this implies that \(\hat{\mathbf{x}}\) is a least-squares solution of \(A\mathbf{x}=\mathbf{b}\).
Remark: a least=squares solution is unique if the matrix \(A^TA\) is invertible. This is the case if the columns of \(A\) are linearly independent. We will not prove this.
The distance \(||\mathbf{b}-\hat{\mathbf{b}}||=||\mathbf{b}-A\hat{\mathbf{x}}||\) is called the least-squares error.
Examples:
1) If \(A=\begin{pmatrix}1&0\\1&1\\0&1\end{pmatrix}\) and \(\mathbf{b}=\begin{pmatrix}1\\2\\2\end{pmatrix}\), then \(A\mathbf{x}=\mathbf{b}\) is inconsistent.
Now we have:
\[A^TA=\begin{pmatrix}1&1&0\\0&1&1\end{pmatrix}\begin{pmatrix}1&0\\1&1\\0&1\end{pmatrix}=\begin{pmatrix}2&1\\1&2\end{pmatrix} \quad\text{and}\quad A^T\mathbf{b}=\begin{pmatrix}1&1&0\\0&1&1\end{pmatrix}\begin{pmatrix}1\\2\\2\end{pmatrix}=\begin{pmatrix}3\\4\end{pmatrix}.\]Hence:
\[A^TA\mathbf{x}=A^T\mathbf{b}:\quad\left(\left.\begin{matrix}2&1\\1&2\end{matrix}\,\right|\,\begin{matrix}3\\4\end{matrix}\right) \sim\left(\left.\begin{matrix}1&2\\0&-3\end{matrix}\,\right|\,\begin{matrix}4\\-5\end{matrix}\right) \sim\left(\left.\begin{matrix}3&0\\0&3\end{matrix}\,\right|\,\begin{matrix}2\\5\end{matrix}\right)\quad\Longrightarrow\quad \hat{\mathbf{x}}=\frac{1}{2}\begin{pmatrix}2\\5\end{pmatrix}.\]So we have that \(\displaystyle\hat{\mathbf{b}}=A\hat{\mathbf{x}}=\frac{1}{3}\begin{pmatrix}1&0\\1&1\\0&1\end{pmatrix} \begin{pmatrix}2\\5\end{pmatrix}=\frac{1}{3}\begin{pmatrix}2\\7\\5\end{pmatrix}\) is the orthogonal projection of \(\mathbf{b}\) onto \(\text{Col}(A)\). Then the least-squares error is
\[||\mathbf{b}-\hat{\mathbf{b}}||=||\mathbf{b}-A\hat{\mathbf{x}}||=||\begin{pmatrix}1\\2\\2\end{pmatrix} -\frac{1}{3}\begin{pmatrix}2\\7\\5\end{pmatrix}||=||\frac{1}{3}\begin{pmatrix}1\\-1\\1\end{pmatrix}||=\frac{1}{3}\sqrt{3}.\]2) Suppose that \(A=\begin{pmatrix}1&0&1\\1&0&1\\0&1&1\\0&1&1\end{pmatrix}\) and \(\mathbf{b}=\begin{pmatrix}1\\3\\1\\3\end{pmatrix}\). Then the columns of \(A\) are linearly dependent, since the third column is the sum of the first and the second column. Moreover, the system \(A\mathbf{x}=\mathbf{b}\) is inconsistent.
Now we have:
\[A^TA=\begin{pmatrix}1&1&0&0\\0&0&1&1\\1&1&1&1\end{pmatrix}\begin{pmatrix}1&0&1\\1&0&1\\0&1&1\\0&1&1\end{pmatrix} =\begin{pmatrix}2&0&2\\0&2&2\\2&2&4\end{pmatrix}\quad\text{en}\quad A^T\mathbf{b}=\begin{pmatrix}1&1&0&0\\0&0&1&1\\1&1&1&1\end{pmatrix} \begin{pmatrix}1\\3\\1\\3\end{pmatrix}=\begin{pmatrix}4\\4\\8\end{pmatrix}.\]Hence:
\[A^TA\mathbf{x}=A^T\mathbf{b}:\quad\left(\left.\begin{matrix}2&0&2\\0&2&2\\2&2&4\end{matrix}\,\right|\,\begin{matrix}4\\4\\8\end{matrix}\right) \sim\left(\left.\begin{matrix}2&0&2\\0&2&2\\0&2&2\end{matrix}\,\right|\,\begin{matrix}4\\4\\4\end{matrix}\right) \sim\left(\left.\begin{matrix}1&0&1\\0&1&1\\0&0&0\end{matrix}\,\right|\,\begin{matrix}2\\2\\0\end{matrix}\right).\]In dit geval zijn er dus oneindig veel oplossingen:
\[\left\{\begin{array}{l}x_1=2-x_3\\x_2=2-x_3\\x_3\;\text{is vrij}\end{array}\right.\quad\Longrightarrow\quad \hat{\mathbf{x}}=\begin{pmatrix}2-x_3\\2-x_3\\x_3\end{pmatrix}=\begin{pmatrix}2\\2\\0\end{pmatrix}+x_3\begin{pmatrix}-1\\-1\\1\end{pmatrix}.\]Of course, the orthogonal projection of \(\mathbf{b}\) onto \(\text{Col}(A)\) is still unique:
\[\hat{\mathbf{b}}=A\hat{\mathbf{x}}=\begin{pmatrix}1&0&1\\1&0&1\\0&1&1\\0&1&1\end{pmatrix} \begin{pmatrix}2-x_3\\2-x_3\\x_3\end{pmatrix}=\begin{pmatrix}2\\2\\2\\2\end{pmatrix}.\]The \(QR\) factorization of a matrix can help to shorten the computation:
Theorem: If \(A\) is an \(m\times n\) matrix with linearly independent columns and \(A=QR\) is the \(QR\) factorization of \(A\), then we have: for each \(\mathbf{b}\in\mathbb{R}^m\) the equation \(A\mathbf{x}=\mathbf{b}\) has a unique least-squares solution \(\hat{\mathbf{x}}\) given by \(R\hat{\mathbf{x}}=Q^T\mathbf{b}\).
Remark: the least-squares solution \(\hat{\mathbf{x}}\) is unique because the columns of \(A\) are linearly independent. Since \(R\) is an upper triangular matrix, the equation \(R\hat{\mathbf{x}}=Q^T\mathbf{b}\) can (easily) be solved by using back substitution.
Proof: For a \(QR\) factorization of \(A\) we have that \(Q^TQ=I\) (since \(Q\) has orthonormal columns) and that \(R\) is invertible. Hence:
\[A^TA=(QR)^TQR=R^TQ^TQR=R^TIR=R^TR\quad\text{en}\quad A^T\mathbf{b}=(QR)^T\mathbf{b}=R^TQ^T\mathbf{b}.\]Since \(R\) is invertible (\(\text{det}(R)\neq0\)), \(R^T\) is invertible as well (since: \(\det(R^T)=\det(R)\neq0\)). Hence:
\[A^TA\hat{\mathbf{x}}=A^T\mathbf{b}\quad\Longleftrightarrow\quad R^TR\hat{\mathbf{x}}=R^TQ^T\mathbf{b} \quad\Longleftrightarrow\quad R\hat{\mathbf{x}}=Q^T\mathbf{b}.\]Last modified on May 2, 2021