I found Simon Raper’s article, “The shock of the mean”, on the history of the arithmetic mean, to be quite appealing and informative. I agree that the idea of combining observations was frowned upon, especially before the middle of the 1700s. However, astronomical observations often required combination, which led to consideration of measurement errors and introduced key elements leading to modern statistical science. Thus, I missed what I consider to be a crucial element: the connection with the development of objective statistical methods for choosing among statistical procedures.
Clearly, the use of the mean would be neither shocking nor even problematic if there were no alternatives. The introduction of criteria for comparing and analysing statistical procedures is critical, both for the ascendancy of the mean and for the establishment of a scholarly field of statistics, that we may call “statistical science”. I will argue here that the primacy of the mean owes much more to the introduction of least squares, the recognition of the normal approximation for means, and to the reputation of Gauss than to objective statistical principles. Furthermore, while key ingredients pre-dated Gauss, the systematic development of expected loss (e.g., mean squared error, error probabilities, etc.) and asymptotic optimality had to wait for the 20th century. It is this recent development, especially following Fisher, Neyman, Wald and others, that constitutes modern statistical science.
The first ingredient to be introduced appears to be Galileo’s use of an objective criterion for choosing among alternative estimators. The appearance of a nova in 1572 posed a serious challenge to the traditional Aristotelian dictum of the immutability of stars. In criticism of the new science, Scipione Chiaramonti presented data consisting of pairs of parallax observations which could be combined to estimate the distance of the nova. He claimed that the nova was “sub-lunar” and thus not a stellar object subject to immutability. Galileo adopted a different set of measurements for which the angular differences had a much smaller sum of absolute deviations (residuals). These measurements gave a distance estimate far beyond the moon. Galileo claimed that, unlike Chiaramonti’s estimate, his estimate had a negligible total deviation compared to what measuring error might suggest, and thus his estimate should be preferred.1,2
Another fundamental ingredient of statistical analysis occurs in work of Boscovich.3,4 While the mean was indeed becoming standard (as noted by Raper), the notion of keeping absolute deviations small continued to be considered the appropriate criterion. However, Boscovich was faced with an inherently bivariate problem: scientists thought that the earth’s rotation should cause a bulge at the equator, making the earth elliptic (rather than spherical). Boscovich had five measurements of the length 1 degree of latitude, each taken at a different latitude. If the earth were elliptical, the length (y) would be (approximately) proportional to the squared sine of the latitude (x), leading to a linear model with unknown intercept and slope (zero for sphericity). In a first attempt, published in joint work with Christopher Maire in 1755, it was realised that each pair of observations determined a linear fit. Taking the mean of the 10 slopes would seem to be appropriate, but Boscovich and Maire noticed that one of the pairs gave a negative slope, and under ellipticity the true slope would need to be non-negative. Recognising that simply dropping this value would bias the result, the authors suggested a trimmed mean of the eight middle slopes. Note that the desire to discard the negative slope estimate would be recognised today as a confusion of estimate and parameter, and that modern robustness theory would suggest the value of the trimmed mean (even in the presence of dependency5).
This approach clearly left Boscovich unsatisfied. Five years later he published an alternative solution. He introduced two criteria: (1) the sum of residuals should be zero (that is, the line should fit the data means (x̄, ȳ)) and (2) among such estimators, the sum of absolute residuals should be minimised. The first principle was a requirement of mean unbiasedness, and so the two criteria represented a mixture of conditions that would be considered rather odd today. Boscovich developed a graphical procedure to find this estimator, and for the first time introduced the fundamental idea of finding statistical procedures that were optimal for a specified loss function.
The primacy of the sum of absolute deviations as a criterion continued to hold for some time. A generation later, Laplace made two further crucial advances: he showed that the median indeed minimised the sum of absolute deviations, and he found the distribution of errors (double exponential) for which the median was “most probable” (i.e., maximum likelihood). This advance might have led to the main ingredients of statistical science (expected loss and asymptotic analysis), but in fact there was little systematic work in this direction for over a century. Specifically, there was no systematic attempt to show that the “most probable” value would be a good estimator in any objective sense. While suggesting the use of the “least absolute deviations” (LAD) estimator, Laplace offered no way of finding it in the multi-parameter problems that were now arising in astronomy, especially for the determination of orbits of celestial objects.
In the late 18th century, Legendre and Gauss introduced the criterion of the sum of squares of the deviations, and applied the method of least squares specifically for fitting astronomical observations. It seems clear that the main reason for considering least squares is that the estimates for multiple regression models could be found by known (matrix inversion) methods. The three-part argument of Gauss discussed by Raper appears to be more concerned with justifying the method of least squares for multiple regression estimation. Gauss never presented statistical reasons (such as the mean squared error) for preferring least squares to LAD (or, equivalently, the mean to the median). In 1809, he presented an extensive development of the application of least squares to astronomical data, but only toward the end of the book did he raise the issue of alternative approaches.6 Noting that Laplace had suggested an alternate method based on least absolute deviations, he (correctly) noted (1) that such an estimator in linear regression models with p slope coefficients (and one intercept) would fit p+1 observations exactly (that is, have p+1 zero residuals); and (2) that the observations not fit exactly would enter the computation only through the signs of the residuals. Gauss apparently considered these to be good reasons to prefer least squares, perhaps because these properties would be intuitively unreasonable for model errors. Over 10 years later, and well after Laplace used the Central Limit Theorem to justify the normal assumption for the error distribution, Gauss still based his preference on subjective reasons and not on more objective statistical bases. Gauss wrote in 1823:
“Laplace has also considered the problem in a similar manner, but he adopted the absolute value of the error as his measure of loss. Now if I am not mistaken this convention is no less arbitrary than mine. Should an error of double size be considered as tolerable as a single error twice repeated or worse? Is it better to assign only twice as much influence to a double error or more? The answers are not self evident, and the problem cannot be resolved by mathematical proofs, but only by an arbitrary decision. Moreover, it cannot be denied that Laplace’s convention violates continuity and hence resists analytic treatment, while the results that my convention leads to are distinguished by their wonderful simplicity and generality.”7
In response to the argument of Gauss, if measurements are meaningful, the absolute error (measured in the same units) would certainly be more meaningful and natural than the squared error. In fact, estimators can now be justified by formal theoretical statistics, with robustness theory explicitly leading to the LAD estimator and other alternatives. Note that the objections in the last sentence have been overcome. The “discontinuity” of LAD estimators occurs only when they are not unique; as set-valued functions, they are continuous in appropriate metrics on sets. Linear programming provides simple computation, and in fact, in certain circumstances, LAD regression estimators are computationally faster (in probability) than least squares estimators.8 Finally, it is intriguing to note that Gauss’s characterisation of the properties of LAD estimators would allow computation in many of the moderately small data sets he considered, where enumerating all of the (p+1) dimensional exact fits would be feasible.
Indeed, as described by Raper, the mean became the standard location estimator in the years following Gauss. However, despite the existence of the principal ingredients (loss functions and distributional models), it seems clear that the primacy of the mean (and the normal model) was based primarily on convenience and subjective preferences, and not on a firmer (and more modern) theory of statistical science. In fact, the question of whether and when one statistical procedure may be better than another was rarely broached, and systematic development of conditions for such comparisons to hold had to await the 20th century. It was not until the last half of the 20th century that robustness and regression quantiles (and modern LAD computation) brought the median and LAD methods back into their own.
As a final remark, the earliest comparison of estimators I am aware of occurs in the rabbinic writings, specifically, in the Mishnah (3rd century; for one translation, see Neusner, 1988, Tractate: Kelim9). Earlier Rabbis had specified an “egg’s bulk” concerning food purity. The later Rabbis felt that such a measure required specific definition appropriate for use in the home. The majority opinion was to specify the “middle-sized” egg (“neither small nor big”), but one Rabbi suggested placing the smallest and largest eggs in water and dividing the amount of water displaced into two parts. The majority opinion prevailed on the grounds: “who knows which is the smallest and which is the largest; it all depends on the observer’s eyesight”. Perhaps oddly, the objection seems to overlook that the smallest and largest eggs are more likely to be identifiable by eye than the middle-sized egg.
About the author
Stephen Portnoy is professor emeritus in the Department of Statistics, University of Illinois at Urbana-Champaign.
- Galileo, Galilei (1967). Dialogue Concerning the Two Chief World Systems, trans. Stillman Drake. Second revised edition. Berkeley: University of California Press. ^
- Meli, D.B. (2004). The Role of Numerical Tables in Galileo and Mersenne, Perspectives on Science, 2004, 12:2, 164-190. ^
- Stigler, S. (1986). The History of Statistics. The Measurement of Uncertainty before 1900, Harvard University Press, Cambridge, Mass. ^
- Farebrother, R. W. (1999). Fitting Linear Relationships. A History of the Calculus of Observations 1750 – 1900, Springer, New York. ^
- Portnoy, S. (1977). Robust estimation in dependent situations. Ann. Statist., 5, 22-43. ^
- Gauss, C.F. (1809). Theoria Motus Corporum Celestium, Perthes et Besser, Hamburg, Translated, 1857, as Theory of Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, trans. C. H. Davis. Boston, Little, Brown. Reprinted, 1963; Dover, New York. ^
- Gauss, C. F. (1823). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Dieterich, Gottingen, Translated as Theory of the Combination of Observations Least Subject to Errors, trans. G. W. Stewart. Siam 1995. ^
- Koenker, R. and Portnoy, S. (1997). The Gaussian Hare and the Laplacian Tortoise: computability of squared-error vs. absolute-error estimators (with discussion), Stat. Science, 12, 279-300. ^
- Neusner, Jacob (1988). The Mishnah, a new translation, Yale University. ^