# Deriving the Normal Equation

Consider a linear model

where

is a matrix of real numbers with as the number of samples (or rows), and is the number of features (or columns),

is a matrix (also called a *vector*) of coefficients , and

is a matrix of target variables per ith sample.

The normal equation is a matrix equation that allows you to solve above. By doing so, you can construct a linear regression model that allows you to predict the value of the target variable for any new samples. It is given by

The goal of the normal equation is to provide you with so that the cost function

is minimized.

You can find derivations of the normal equation on the internet, but most of them include a lot of handwaving and skip seemingly trivial, but actually tedious and crucial steps. I will derive it in this article.

## Minimizing Using Vector Calculus

Recall back in calculus that to minimize a function, you simply take its derivative and set it to zero. We will do the same for the cost function. But first, we need to borrow some concepts from vector calculus.

The derivative of a vector is called a gradient, and is denoted by the symbol . We subscript the symbol with some variable to denote that we’re taking a vector’s derivative with respect to that variable. Thus, the derivative with respect to is simply .

Minimizing the cost function means taking its derivative with respect to and setting the result to zero:

We will use this equation to derive the normal equation.

## Some Useful Theorems

In order to proceed with the derivation, we’re gonna need the help of some theorems. You can find these as well in any proper vector calculus textbook. Let , , and be matrices. The two theorems we will use are the following:

where the *trace* of a matrix is just the sum of its diagonal elements.

If we set the matrix as the identity matrix in the second theorem above, we simply get

By setting , we get

due to the fact that the trace of a matrix is a function that maps from to . Taking the transpose of the last equation and noting that , we get

But the left-hand side of the equation is just the first theorem. Thus, we have

We can finally proceed with our derivation.

## All Together Now

Let’s take the gradient of with respect to

The term on the right side is a real number. This indicates that the whole term in the parentheses is some function that maps to a real number. If we use the fact that the trace of a real number is just itself, and that , we get

The term has vanished because it’s treated as a constant when we’re taking the gradient with respect to .

We are now left with

If we use our derived theorem above () and substitute as follows,

we should get

Since , we finally have

which is the normal equation. How cool is that! :)