Linear Regression Through Least Squares / A Tutorial

December 25th 2012

This is a simple, easy-to-understand tutorial on doing regression mathematically. For those who found this page accidentally, and are unsure what’s going on, regression is, roughly, a way to find a line/curve that approximates a set of data ( a line of best fit ). This is often necessary when dealing with data from experimentation, since it is unlikely to fit the expected type of line perfectly. Real life isn’t linear!

Down to business

O.K, so here we go. This first post is on the least squares method. You may have heard of it before – it’s easily the most popular method for regression. I’m choosing to use it now because it is both simple to implement and simple to understand conceptually. Note: The following tutorial assumes very little about prior mathematical knowledge. However, to full understand the theory behind the method, knowledge of differential calculus is necessary. Luckily, use and programming of this method does not require such knowledge, so feel free to skip any parts that you feel like.

Here’s the basic idea:

We are given a set of (x, y) coordinates, and we want to find the line that best models the relationship between x and y. The important part here is in the definition of “best”. For our purposes, “best” has a very special meaning: The sum of the squares of the differences between the best-fit line’s y value for a given x and it’s actual (empirical) y value. Let’s make this a little less abstract. What does this mean for some real data?

We start with the following data:

$(1,6) (2, 5) (3, 7) (4, 10)$

As you can see, this data is not perfectly linear. However, it does have a general upward trend (positive slope). So what we want to end up with is an equation in the following form:

$y = \beta_1+\beta_2x$

Here, $\beta_1$ is the y intercept and $\beta_2$ is the slope of the line. With this method, we are trying to minimize the square of the “errors” between the data points and the best-fit line. Calculating the error at a given x is very simple: just subtract the best-fit-line value for that x from the actual value. For example, for $(2,5)$, the error squared would be $(5-(\beta_1+\beta_2(2)))^2$. Since the errors are squared, all errors are positive numbers.

Now we simply apply this error formula to all of the points to find an expression that would represent, in a way, the “total error” of all the data points:


Obviously, for our line to fit the data most accurately, we want this sum to be as small as possible. Put another way, we want to find the $\beta_1$ and $\beta_2$ that together result in the least squares of the errors, thus the name of this method.

Note: This next section uses some calculus, and is not strictly necessary. However, if you, like me, enjoy understanding WHY, then this should satisfy your curiosity. I advise you to at least skim over the section, at least to identify when it ends and the practical method starts again.

Okay, so we want to minimize the aforementioned sum. How do we do that? Well, the minimum of the graph of this sum will be where the rate of change is zero, or where the derivative is equal to zero. Before we find this, lets write the sum of the errors squared as a general function for any set of $k$ points from $(x_1,y_1)$ to $(x_k,y_k)$, with slope $m$ and y-intercept $b$:

f(m,b) is the error of a line using m and b as slope and y-intercept


To find this position, we want to find the partial derivative (which is the derivative with respect to one variable where the other is held constant) of this sum with respect to both $b$ (what we originally called $\beta_1$) and $m$ (originally $\beta_2$) individually, and set them each equal to zero.

Note: How to find the basic derivative of a function is outside the scope of this tutorial; If you wan’t to understand this portion but don’t know how to derive, try Paul’s Notes, an awesome resource.

Before deriving, we must establish two rules:

  • To find the partial derivative of $f(x,y)=x^2+2x^3y+y^4+5$ with respect to $x$, pretend that y is a constant. So, the first derivative is $2x+6x^2y$. The $y$ in $2x^3y$ stays as-is, since it is a coefficient. The last term, 5, is a constant and thus goes away. Importantly, the second term, $y^4$, is also a constant, since $y$ is constant. It also goes away.

  • Derivatives involving summation: $\frac{d}{dy}\sum f(x)=\sum\frac{d}{dy}f(x)$

O.K, down to business. First, lets find $\frac{\partial}{\partial b}$

$\frac{\partial}{\partial b}\sum\limits_{i=1}^k[y_i-(b+mx_i)]^2$

$\frac{\partial}{\partial b}\sum\limits_{i=1}^k[y_i-b-mx_i]^2$

$\sum\limits_{i=1}^k\frac{\partial}{\partial b}[y_i-b-mx_i]^2$

Using Chain Rule, the derivative of the inside is -1 since $y_i$ and $m$ and $x_i$ are all constants:


Take the 2 outside (Distributive Property):


Now for $\frac{\partial}{\partial m}$. Same method.

$\frac{\partial}{\partial m}\sum\limits_{i=1}^k[y_i-(b+mx_i)]^2$

Some steps omitted – See the previous derivation.


In the previous step, we used the chain rule again. This time, $y_i$ and $b$ disappear as constants. For the final term, $-x_i$ is the coefficient and $m$ is the variable. when derived, the $m$ becomes a $1$ , leaving only the coefficient of $-x_i$ as the derivative of the inside.

So now we have two expressions, the partial derivatives that we just found, that we will set equal to zero to minimize the square of the errors:



Note that the twos at the beginning of each equation have been divided away.

Now we must simplify each of these equations to a more useful form:




Move part to the other side, and make $\sum\limits_{i=1}^k(b)$ to $kb$, since $b$ is constant, so summing it $k$ times is the same as multiplying by $k$. Also bring constants out of summations (Distributive property).


The other equation:





Now we can really get started! Also, all calculus is over as of here.


These two equations are really important; They even have a special name. These are the Normal Equations:



Now we have just a basic system of equations, with two variables and two equations. Lets solve! To get the result we want, we must solve in a somewhat specific way. We start with the second equation above, and divide both sides of the equation by $k$, the number of data points we have. A note: remember, the summation of some values, divided by the number of values, is the mean


Using the aforementioned rule, where $\overline{A}$ is the mean of all As.




Then we go back to the first equation, with our newly crafted $b$, thus:


Plug in what we got for $b$:


Distribute and simplify, making $\overline{A}$ back to $\sum{A_i}$, like so:


Now we just solve for m, the slope of our line, through basic algebra (Remember, that was what we were finding!):


Now, we can solve for $m$ with this, but it’s easier to use more simplified form that uses means. I’m not going to go to this form step by step, since it is not necessary, but just try going form the previous step. I’m removing the limits on the sums for ease of writing; You can pretend they are still there.


And that’s it! The above equation is pretty much our final destination. From here on out, we’re going to revert to using real data. In case you forgot (Oops, I did!), Here are the data points to which we want to fit a line:


We can find all the numbers necessary to get $m$:



and $m$ itself (I’ve calculated the sums):


Now we have a slope! Just plug it back in to $b=\overline{y}-m\overline{x}$:


And thats it! The equation of our final line:


The Actual, Straightforward Method (TL;DR)

In case you just want the method, here you go:

Given a set of n points:

  1. Find the averages of the x’s and the y’s individually, as $\overline{x}$ and $\overline{y}$

  2. Calculate the following sum to find the slope of the best-fit line, $m$:


  1. Plug it into this equation to find $b$, the y-intercept:


  1. $y=mx+b$ is the line of best-fit, and estimates a linear model for the given data.

  2. Done!

Thanks for reading! I hope this tutorial has helped you not only understand how to do linear regression, but also why this method works. Why is it useful to know why? Well, now you can apply this skill in other scenarios. For example, if you read carefully, you should have no problem fitting your data to higher-order polynomials. To get a quadratic curve as a result, simply minimize the error with three variables (2 coefficients and a constant) instead of two. Now, no data is too wild to tame!

blog comments powered by Disqus