Why squared error?

Someone recently asked on the statistics Stack Exchange why the squared error is used in statistics. This is something I’d been wondering about myself recently, so I decided to take a crack at answering it. The post below is adapted from that answer.

Why squared error?

It’s true that one could choose to use, say, the absolute error instead of the squared error. In fact, the absolute error is often closer to what you “care about” when making predictions from your model. For instance, if you buy a stock expecting its future price to be $P_{predicted}$ and its future price is $P_{actual}$ instead, you lose money proportional to $(P_{predicted} - P_{actual})$ , not its square! The same is true in many other contexts.

However, the squared error has much nicer mathematical properties. For instance:

If $X$ is a random variable, then the estimator of $X$ that minimizes the squared error is the mean, $E(X)$ . On the other hand, the estimator that minimizes the absolute error is the median, $m(X)$ . The mean has much nicer properties than the median; for instance, $E(X + Y) = E(X) + E(Y)$ , but there is no general expression for $m(X + Y)$ .
If you have a vector $\vec X = (X_1, X_2)$ estimated by $\vec x = x_1, x_2$ , then for the squared error it doesn’t matter whether you consider the components separately or together: $\left|\left|\vec X - \vec x\right|\right|^2 = (X_1 - x_1)^2 + (X_2 - x_2)^2$ , so the squared error of the components just adds. You can’t do that with absolute error. This means that the squared error is independent of re-parameterizations: for instance, if you define $\vec Y_1 = (X_1 + X_2, X_1 - X_2)$ , then the minimum-squared-deviance estimators for $Y$ and $X$ are the same, but the minimum-absolute-deviance estimators are not.
For independent random variables, variances (expected squared errors) add: $Var(X + Y) = Var(X) + Var(Y)$ . The same is not true for expected absolute error.
For a sample from a multivariate Gaussian distribution (where probability density is exponential in the squared distance from the mean), all of its coordinates are Gaussian, no matter what coordinate system you use. For a multivariate Laplace distribution (like a Gaussian but with absolute, not squared, distance), this isn’t true.
The squared error of a probabilistic classifier is a proper scoring rule. If you had an oracle telling you the actual probability of each class for each item, and you were being scored based on your Brier score, your best bet would be to predict what the oracle told you for each class. This is not true for absolute error. (For instance, if the oracle tells you that $P(Y=1) = 0.9$ , then predicting that $P(Y=1) = 0.9$ yields an expected score of $0.9\cdot 0.1 + 0.1 \cdot 0.9 = 0.18$ ; you should instead predict that $P(Y=1) = 1$ , for an expected score of $0.9\cdot 0 + 0.1 \cdot 1 = 0.1$ .)

I would say that these nice properties are merely “convenient”—we might choose to use the absolute error instead if it didn’t pose technical issues when solving problems. But some mathematical coincidences involving the squared error are more important. They don’t just pose technical problem-solving issues; rather, they give us intrinsic reasons why minimizing the square error might be a good idea:

When fitting a Gaussian distribution to a set of data, the maximum-likelihood fit is that which minimizes the squared error, not the absolute error.
When doing dimensionality reduction, finding the basis that minimizes the squared reconstruction error yields principal component analysis, which is nice to compute, coordinate-independent, and has a natural interpretation for multivariate Gaussian distributions (finding the axes of the ellipse that the distribution makes). There’s a variant called “robust PCA” that is sometimes applied to minimizing absolute reconstruction error, but it seems to be less well-studied and harder to understand and compute.

Looking deeper

One might well ask whether there is some deep mathematical truth underlying the many different conveniences of the squared error. As far as I know, there are a few (which are related in some sense, but not, I would say, the same):

Differentiability

The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization. To optimize the squared error, you can just set its derivative equal to 0 and solve; to optimize the absolute error often requires more complex techniques.

Inner products

The squared error is induced by an inner product on the underlying space. An inner product is basically a way of “projecting vector $x$ along vector $y$ ,” or figuring out “how much does $x$ point in the same direction as $y$ .” In finite dimensions this is the standard (Euclidean) inner product $\langle a, b\rangle = \sum_i a_ib_i$ . Inner products are what allow us to think geometrically about a space, because they give a notion of:

a right angle ( $x$ and $y$ are right angles if $\langle x, y\rangle = 0$ );
and a length (the length of $x$ is $\left|\left|x\right|\right| = \sqrt{\langle x, x\rangle}$ ).

By “the squared error is induced by the Euclidean inner product” I mean that the squared error between $x$ and $y$ is $\left|\left|x-y\right|\right|^2$ , the (squared) Euclidean distance between them. In fact the Euclidean inner product is in some sense the “only possible” axis-independent inner product in a finite-dimensional vector space, which means that the squared error has uniquely nice geometric properties.

For random variables, in fact, you can define is a similar inner product: $\langle X, Y\rangle = E(XY)$ . This means that we can think of a “geometry” of random variables, in which two variables make a “right angle” if $E(XY) = 0$ . Not coincidentally, the “length” of $X$ is $E(X^2)$ , which is related to its variance. In fact, in this framework, “independent variances add” is just a consequence of the Pythagorean Theorem:

Beyond squared error

Given these nice mathematical properties, would we ever not want to use squared error? Well, as I mentioned at the very beginning, sometimes absolute error is closer to what we “care about” in practice. For instance, if your data has tails that are fatter than Gaussian, then minimizing the squared error can cause your model to spend too much effort getting close to outliers, because it “cares too much” about the one large error component on the outlier relative to the many moderate errors on the rest of the data.

The absolute error is less sensitive to such outliers. (For instance, if you observe an outlier in your sample, it changes the squared-error-minimizing mean proportionally to the magnitude of the outlier, but hardly changes the absolute-error-minimizing median at all!) And although the absolute error doesn’t enjoy the same nice mathematical properties as the squared error, that just means absolute-error problems are harder to solve, not that they’re objectively worse in some sense. The upshot is that as computational methods have advanced, we’ve become able to solve absolute-error problems numerically, leading to the rise of the subfield of robust statistical methods.

In fact, there’s a fairly nice correspondence between some squared-error and absolute-error methods:

Squared error	Absolute error
Mean	Median
Variance	Expected absolute deviation
Gaussian distribution	Laplace distribution
Linear regression	Quantile regression
PCA	Robust PCA
Ridge regression	LASSO

As we get better at modern numerical methods, no doubt we’ll find other useful absolute-error-based techniques, and the gap between squared-error and absolute-error methods will narrow. But because of the connection between the squared error and the Gaussian distribution, I don’t think it will ever go away entirely.

Comments

Jeff Wu

December 2014

Can you explain the second bullet again? Neither part of it seems true to me (and the claims seem somewhat unrelated)

Ben

Can you comment on what specific statements in the first part don’t seem true? I think “squared error of a vector is sum of squared errors of coordinates” is pretty uncontroversial.

You’re right that I didn’t explain the second part very clearly, and I didn’t state that it’s only true for re-parameterizations that preserve the norm (up to a scalar). The argument (and why they’re related) is as follows:

I just showed that the squared error of $\vec x$ to $\vec X$ is the sum of the coordinate-wise squared errors.
But this argument didn’t rely on the coordinate system that we used.
Therefore, if you change coordinate systems (a.k.a. re-parameterize your problem), as long as your change preserves the norm, your squared error stays the same (so the estimator that minimizes it stays the same, suitably re-parameterized).

If that clears things up, I’ll edit this into the post.

Sorry for being so brief in my comment in the morning. The part I was objecting to in the first part is “You can’t do that with absolute error.” It seems like absolute error is a sum of absolute error of coordinates? But looking again, I’m not sure that I had in mind the same notion as what you had in mind.

I see - FWIW I do think the post is slightly misleading, in that it becomes untrue if you use the transformation Y1 = X1 + X2, Y2 = X1 - 2X2. At that point, it seems like the parameterizations you’re allowing are basically defined to be the ones that work. (But, nothing except swapping coordinates and negating works for absolute error, so it does still have a leg up!)

I guess I was equivocating between two senses of absolute error. Absolute error in the sense of “non-squared L2 distance between points” does not work that way, but is ok with orthogonal re-parameterizations. Absolute error in the sense of “L1 distance between points” works that way, but is not ok with any re-parameterizations (except for signed permutations). I’ll edit the bullet point when I think about what I actually want to say. Thanks for catching it!

Matt

April 2015

I would add that we have the nice decomposition

Var Y = Var(E[Y|X) + Var[Y|X]

Which is a combination of 1 and 3.

Bayesian interpretation of regressions with gaussian prior

and E[E[Y|X]] = E[X]

@Matt: What do you mean by “Bayesian interpretation of regressions with gaussian prior”? Do you mean interpreting Tikhonov regularization as placing a Gaussian prior on the coefficients? And if so, is there not a similar interpretation of penalized quantile regression?

John Mount

May 2015

Nice article. A point I emphasize is minimizing square-error (while not obviously natural) gets expected values right. So it tends to point you towards unbiased estimators. Some of my notes on this: http://www.win-vector.com/blog/2014/01/use-standard-deviation-not-mad-about-mad/

@John Mount: That’s true, but you could equally well say that minimizing absolute error tends to point you towards median-unbiased estimators!

In fact, I would say that unbiasedness could just as easily be motivated by the niceness of squared error as the other way around. Unbiasedness is defined in terms of expected value, but the reason expected value is a “special” statistic is that it minimizes squared error.

Kevin

Great post!

Leon

`I would say that unbiasedness could just as easily be motivated by the niceness of squared error as the other way around […] the reason expected value is a “special” statistic is that it minimizes squared error'

I think averages are a much more primitive concept than “squared error”. We take averages of things all the time in pre-probability maths. Averages correspond to evenly distributing the pie. Averages play nice with affine transformations. (Higher-dimensional) averages correspond to centre of mass.

So I think it makes most sense to go from averages to squared error, normality, etc. (as I think Gauss did back in the day) rather than the other way around.

Anonymous

September 2016

Finally understand inner products, woot.

Daniel Vainsencher

November 2016

Ridge Regression:Lasso is about the regularization type, not about the loss, so it disagrees with everything else in your post. There is no reason not to use absolute errors (or Huber, or epsilon insensitive, or…) loss with either $l_1$ or $l_2$ or other regularization types.

Thanks for a good post on a point I care about, still trying to understand why I care about expected values (hence squared errors), and how I might convince myself not to.

Hand-waving follows:

One reason you might care about expected values is the Von Neumann-Morgenstern theorem, which roughly states that any decision-maker, whose decisions satisfy certain consistency properties, has a utility function for which they are trying to maximize the expected value.

If your utility function is smooth, then it’s locally linear in anything you care about, and so at least locally, you end up caring about the expected value of those things as well!

Roland

Nice article. You’re mentioning the Gaussian distribution already, but I would also emphasize that the squared error occurs as a natural parameter of the Gaussian (as variance / standard deviation). And because Gaussian arises as the large-sample limit of means (the central limit theorem), the squared error becomes a central property in statistical theory.

Martin Roberts

This is a great post that tries to find a resolution to this commonly posed question, in a variety of different ways.

I think that one of the reasons why we naturally think that the squared error is more mathematically amenable is because our mathematics education has been traditionally and primarily driven with calculus at the pinnacle. This has been because a career in science or engineering (which fundamentally depends on calculus) has been typically considered more favorably than a career in statistics, and thus discrete maths has traditionally been considered a poor cousin of calculus.

This in turn, has meant that in many ways absolute function has been a poor cousin to the quadratic function. However, with the rise of computing / data science, and the ubiquitous use of computers that are able to handle absolute error as easily as mathematicians handle squared error, i believe we will see a rise in the popularity of the absolute function as a tool; and discrete maths as a branch of mathematics as important as calculus, and stats/maths a career as important as engineering.

To cite two quick examples that comes to mind. In the deep learning space, the fact that originally the most neural networks were originally based on the classic differentiable sigmoid functions such as the logistic function or hyperbolic tangent, whereas now the non-differentiable rectified linear units (ReLUs) are becoming the standard and default functions.

Similarly, prior to deep learning becoming all the rage, the data science geeks were discovering that many classification and regression algorithms (such as LASSO) that were originally formulated for the quadratic error, have much nicer and intuitive results (eg variable selection) if recast in terms of absolute error. And secondly,

That is, I do not think that the value of differentiability and mathematical formulations to admit closed-form solutions (including the quadratic loss), will decrease per se, but I do believe that we are only recently starting to discover (rediscover?) the potential of the absolute error especially in discrete maths and computational mathematics.

Nic Szerman

January 2023

Spheres optimize volume to surface ratio. Spheres also have a shape defined by the square distance from the origin (along the space’s basis). If sphere volume ~ prediction error, minimizing square distance is akin to reducing sphere radius. Therefore, using square error sort of optimally improves accuracy.