This is something I’ve been meaning to write for ages. My formal training for most things is limited. Like a lot of folks, I’m an autodidact. This is good in that I’m always learning and always studying those things that I enjoy. At the same time, it means that I take in information in a highly unstructured, probably inefficient way. Linear regression is a fantastic example of this. I feel as though I’ve known about it- from adding trendlines to Excel charts, to pondering significance of multiple parameters- for decades. My formal understanding of it- the Zen of OLS, if you will- has never really been there.

It was some time ago, that I concluded that all of statistical inference amounts to solving an optimization problem. The target function is typically, though not always, likelihood. (This epiphany was likely driven by a seminar I attended which was given by Stuart Klugman. We spent a lot of time using Excel’s Solver add-in. It really helped to de-mystify loss distributions.) If you know- or can guess- what probability distribution is involved, you can optimize for parameters and start to make inferences. Linear regression is a natural extension of this. Simply make a linear transform and solve the optimization. The cool thing is that we’ve introduced a constraint on the function, which allows us to guess the parameters of the linear transform.

Let’s take a step back. First, we’ll generate a sample of normally distributed random variables and set up a likelihood function.

set.seed(1234) N = 100 E = rnorm(N, mean = 0, sd = 2) lnLike = function(x, mu, sigma) { n = length(x) lnLike = -n / 2 * log(2*pi) lnLike = lnLike - n/2 * log(sigma ^2) lnLike = lnLike - 1/(2*sigma^2)*sum((x - mu)^2) lnLike }

With this in hand, we can see what values for mu and sigma will lead to particular values of the likelihood function. (Log-likelihood, that is.)

# for a fixed sd, plot the likelihood function for various mu's testMu = -50:50 / 100 likelihood = numeric(length(testMu)) for (i in 1:length(likelihood)){ likelihood[i] = lnLike(e, testMu[i], 1) } plot(likelihood ~ testMu, pch = 19) abline(v = 0) abline(v = testMu[likelihood == max(likelihood)]) testMu[likelihood == max(likelihood)] # for a fixed mu, plot the likelihood function for various sigma's testSigma = 50:150 / 100 likelihood = numeric(length(testSigma)) for (i in 1:length(testSigma)){ likelihood[i] = lnLike(E, 0, testSigma[i]) } plot(likelihood ~ testSigma, pch = 19) abline(v = 1) abline(v = testSigma[likelihood == max(likelihood)]) testSigma[likelihood == max(likelihood)]

These plots look like the following:

Note the big asymetry for the sigma values. Also note that it’s fairly flat for a sizeable range around its true value. The one dimensional cases help us understand some of the marginal behavior, but the likelihood function exists in three dimensions. We can get a sense of that using the contour function. First, we’ll have to build up a set of values.

params = expand.grid(mu = testMu, sigma = testSigma) params$Likelihood = mapply(lnLike, params$mu, params$sigma, MoreArgs = list(x = e)) z = matrix(params$Likelihood, length(testMu), length(testSigma)) filled.contour(x=testMu, y=testSigma, z=z, color.palette = heat.colors, xlab = "mu", ylab = "sigma")

Which produces this cool plot:

We used the max function above to identify the best fits for mu and sigma when holding the other parameter fixed. For the two-dimensional case (and also to measure in between the test values) we need something more sophisticated. Here, we can use the optim function. As is usually the case, Markus Gesman over at Mages‘ blog got there first, with a nice overview of optim. We’ll add a wrapper to our likelihood function so that we can fit for both parameters.

lnLike2 = function(x, par) { mu = par[1] sigma = par[2] lnLike(x, mu, sigma) } optim(par = c(-1,4), fn = lnLike2, control = list(fnscale = -1), x = e)

Now the fun part. Let’s alter the E variable by first adding a constant and assuming that mu is zero. That assumption allows us to use the likelihood function unaltered, knowing that the mu parameter which is returned will correspond to the constant.

B0 = 5 Y = B0 + e optim(par = c(-1,4), fn = lnLike2, control = list(fnscale = -1), x = Y) fit = lm(Y ~ 1) fit$coefficients[[1]]

We’ve also fit using the lm function. Note that the parameters come out the same. Let’s try adding another term, this time with a slope parameter:

X = as.double(1:length(E)) B1 = 1.5 Y = B0 + B1 * X + E lnLike3 = function(par, Y, X) { B0 = par[1] B1 = par[2] sigma = par[3] x = Y - B0 - B1 * X mu = 0 lnLike(x, mu, sigma) } optim(par = c(3, 2, 4), fn = lnLike3, control = list(fnscale = -1), Y = Y, X = X) fit = lm(Y ~ 1 + X) fit$coefficients

Once again, we’ve got agreements with the coefficients, although the values aren’t exactly the same. This is either due to the optimization algorithm or something ridiculous I’ve done. Either way, comments are welcome.

As usual, this is probably old hat to most folks. But for an enthusiastic novice like me, this is loads of fun. It means that when using a linear model, one need not assume a normal distribution. It’s possible to test the sensitivity of results against any other distribution. More fundamentally, it takes me closer to that Zen-like state of understanding. It’s been decades, but I’m finally starting to grok linear regression.

Code to reproduce the examples may be fond at this Gist: AnotherOLS.R

Actually there is a nice theory which says that estimates gotten via any optimisation are close to the true parameters, provided optimisation function is reasonable (the correct wording is that the estimates centered around true parameter values are asymptotically normal with zero mean and covariance matrix which depends on data distribution and the optimisation function). This theory also says that maximum likelihood estimates are the best in a sense, that their covariance matrix is the smallest. You can find more about it here http://en.wikipedia.org/wiki/M-estimate.

FYI, there is a case mismatch between your initial assignment of the random data (“E = …”) and your first function call (“lnLike(e, …)”). That said, I enjoyed this post!

Thanks for that correction. I’ve altered the post to reflect it. After the post was originally published I added a link to a Gist with the full code, which I hope is error free. Here’s that Gist link again: AnotherOLS.R

Thank you for the nice post. I am formally untrained as well and searching bits of insight. It appears to me, that your basic idea, to find regression parameters by optimizing over given distributions, is what Bayesians do over and over. Maybe have a look at what people do with BUGS (WinBUGS, OpenBUGS, JAGS or even Stan). They might have some very interesting input und show, that you can go very far into regression with this approach.

For Gaussian linear models, maximum likelihood estimates have a close form: they are nothing but the least-squares estimates. So this optimization is useless…

The closed form solution for homoskedastic, Gaussian error terms involves taking the inverse of a matrix. When one has more than one slope coefficient, numerical methods are needed.

The example shown here is a toy model intended to illustrate a concept. I would never go through the hassle of setting up and optimizing a likelihood function when canned functions such as lm, glm, lmer and so forth are available. However, if I’d like to assume that error terms have a different distribution like a skewed normal, or Burr, or whatever, direct optimization of the likelihood function would allow me to do that.

More fundamentally, this is meant to illustrate the idea that this sort of inference is all based around optimization of some objective function. For the univariate linear model, I know the formulae for calibration of the parameters in my sleep. It’s nice to remember where they came from.

Let me correct you. If you look at lm function code, you would find that there is no need of taking the inverse. To avoid these problems QR matrix decomposition is used which is numerically stable and it reduced down to simple algebraic problem once again.

require(“ElemStatLearn”)

y=galaxy[,5]

QR=qr(as.matrix(galaxy[,3:4]))

Q=qr.Q(QR)

R=qr.R(QR)

backsolve(R,t(Q)%*%y)

QR=qr(cbind(rep(1,323),galaxy[,3:4]))

Q=qr.Q(QR)

R=qr.R(QR)

backsolve(R,t(Q)%*%y)

Anyways, your post is interesting and easy to read. Which is the reason why I read it all even I knew what is coming next.

If to speak about more complicated distributions, there are many nice ways of fitting them. I have used quantile regression, GMM, ML or in desperate cases I optimised squared distances between quantiles using deoptim package. Nice way of investigating is plotting CDF of your sample and your optimised parameters on the same chart.

I’ll take your word for it on whether or not an inversion is needed. I don’t often get my hands dirty with the details of the optimization. I thought QR was an iterative algorithm? Anyway, I’m probably wrong about that, too :-). It’ll make for interesting reading one of these days.

You are right. 🙂 It is iterative algorithm, but it does not mean that it isn’t stable. What means a unique solution exists.