### Ridge regression

In this post I want to write about the connection of Ridge regression and robust regression. Ridge regression (also know as Thikonov regularization) is a form of regularization or shrinkage, where the parameters of linear regression are *shrunk* towards 0.

There are several reason why one might want to use methods like this. A very simple motivation is the case of *multicollinearity*. If regression covariates suffer from multicollinearity, the moment matrix is (close to) singular and computing the least squares solution becomes difficult or impossible. An easy solution to make invertible is to add a (-scaled) identity matrix and use the estimator . It turns out this is the solution to the optimization problem

which corresponds to the usual least squares minimization objective plus a penalty term for the size of the parameters. Another interpretation can be found by considering a Bayesian linear model where the parameters are endowed with a Gaussian prior distribution. These models are well known. However, a less well known fact is the connection to robust regression.

### Robust regression

Now let us consider the case where the observations are random. Let us denote the random observation as and we now write for the mean and for a centered random matrix such that and . One could still perform ordinary least squares, i.e. minimizing , however a more robust choice would be to take the randomness into account and minimize the *expected* squared norm

For the Euclidean norm as above the expectation can be rewritten as

This is the same optimization objective as Ridge regression! For example, if the variance of the error matrix is *isotropic*, i.e. for some value we get . As before the solution is

Hence, Ridge regression has another interesting interpretation as a robust optimization problem which is why I think that this fact is useful in understanding why Ridge regression approaches can be advantageous especially when the number of observed data are low. There is another interesting detail. It is a well know fact that Ridge regression corresponds to a Bayesian model where the *parameters* are random and assumed to have an a priori Gaussian distribution whereas the matrix of covariates is assumed fixed.
Here, the parameters are fixed and the *covariates* are random.

#### References

Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization, Section 6.4.1. Cambridge university press.