Outliers: Which prior are you using?
The problem of outliers from a Bayesian viewpoint
This post is concerned with a ubiquitous problem of outliers. They are infamous for degrading the performance of many models/algorithms. As a result, ongoing attempts try to accommodate them by deriving robust estimators. Unfortunately, these estimators have drawbacks such as being less efficient. In this post, I approach the problem from a Bayesian viewpoint. I illustrate how the issue of outliers connects with our prior beliefs about the data collection procedure. This leads me to show how a simple but flexible Bayesian model allows us to accommodate outliers without inheriting the drawbacks of other estimators.
Disclaimer: This post is heavily inspired by the work of Jaynes (2003).
The problem
Imagine we are interested in a quantity
The dilemma
Two opposite views have been expressed:
- The outlier should not have been included in the data. The data have been contaminated and the outlier needs to be removed otherwise we may get erroneous conclusions.
- The outlier may be the most important datapoint we have so it must be taken into account in the analysis. In other words, it may be desirable to describe the population including all observations. For only in that way do we describe what is actually happening (Dixon 1950).
These viewpoints reflect different prior information about the data collection procedure. The first view is reasonable if we believe a priori the data collection procedure is unreliable. That is, any now and then and without warning we can get an erroneous measurement. The second view is reasonable if we have absolute confidence in the data collection procedure. Then the outlier is an important result and ignoring it may harm us.
Clearly these are extreme positions, and in real-life the researcher is in a intermediate position. If they knew the apparatus is unreliable they would have choose not to collect data in the first place or improve the apparatus. Of course, in some situations we are obliged to use whatever “apparatus” we have access to. So the question arises can we formalise an intermediate position?
Robustness
Such an intermediate position is the idea of robustness. Researchers sometimes use various “robust” procedures, which protect against the possibility (or presence) of outliers. These techniques do not directly examine the outliers but accommodate them at no serious inconvenience (Barnett and Lewis 1974). Certain estimators, especially the mean and least squares estimators, are particularly vulnerable to outliers, or have low breakdown values2.
For this reason, researchers turn to robust or high breakdown methods to provide alternative estimators for these important aspects of the data. A common robust estimation method for univariate distributions involves the use of a trimmed mean, which is calculated by temporarily eliminating extreme observations at both ends of the sample (very high and low values) (Anscombe 1960). Alternatively, researchers may choose to compute a Windsorized mean, for which the highest and lowest observations are temporarily censored, and replaced with adjacent values from the remaining data.
The issue arises from the fact that robust qualities - however defined - must be bought at a price: poorer performance when the model is correct. This is usually reported by some trade-off between the conflicting requirements of robustness and accuracy.
As an example, lets look at the median which is often cited as a robust estimator. The downside of the median is that it is less efficient than the mean. This is because it does not take into account the precise value of each observation and hence does not use all information available in the data. The standard error of the median (
where
The model
Following Box and Tiao (1968) I assume that the apparatus produces good and bad measurements. So we have a “good” sampling distribution
parametrized by
possibly containing an uninteresting parameter
with joint prior probabilities
to the
Consider the most common case where our prior information about the good and bad observations is invariant on the particular trial at which they occur. That is, the probability of any sequence of
The theorem above is equivalent to assuming that
The solution
Let
and from (2),
is the likelihood. The marginal posterior density for the parameter of interest
Another formulation of (4) is
where
which results from decomposing the prior joint density
where
where
are a sequence of likelihood functions for the good distributions in which we use all the data, all except
is the probability that all the data
Following the same reasoning, this is the probability, given
In short,
An example
Suppose we are interested in a location parameter, and have a sample of 10 observations. But one datapoint
Connection with adversarial training in Machine Learning
In fact, model (2) is the cornerstone of adversarial training in Machine Learning (ML). In adversarial training, the basic idea is to simply create and then incorporate adversarial data into the training process. The researcher then evaluates how robust is the output of the model to such perturbations of the input data. The entire area of adversarial ML studies ways to create robust learning algorithms that withstand such perturbations. The area of adversarial ML arose after observing that standard learning methods degrade rapidly in the presence of perturbations (Kurakin, Goodfellow, and Bengio 2016).
The formal study of robust estimation was initiated by (Huber 1964, 1965) who considered estimation procedures under the
where
Summarising, the Bayesian solution can capture our prior knowledge about how the data are being generated. Allowing for a more flexible Bayesian model gives desirable qualities of robustness automatically. As a result, we may be able to bypass the need to derive robust estimators which, as we saw, come with drawbacks. This fact could be used in adversarial ML applications.
References
Anscombe, Frank J. 1960. “Rejection of Outliers.” Technometrics 2 (2): 123–46.
Barnett, Vic, and Toby Lewis. 1974. Outliers in Statistical Data. Wiley.
Box, George EP, and George C Tiao. 1968. “A Bayesian Approach to Some Outlier Problems.” Biometrika 55 (1): 119–29.
De Finetti, Bruno. 1972. “Probability, Induction, and Statistics.”
Dixon, Wilfred J. 1950. “Analysis of Extreme Values.” The Annals of Mathematical Statistics 21 (4): 488–506.
Grubbs, Frank E. 1969. “Procedures for Detecting Outlying Observations in Samples.” Technometrics 11 (1): 1–21.
Huber, Peter J. 1964. “Robust Estimation of a Location Parameter.” Ann. Math. Statist. 35 (1): 73–101. https://doi.org/10.1214/aoms/1177703732.
———. 1965. “A Robust Version of the Probability Ratio Test.” Ann. Math. Statist. 36 (6): 1753–8. https://doi.org/10.1214/aoms/1177699803.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University Press.
Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. 2016. “Adversarial Machine Learning at Scale.” arXiv Preprint arXiv:1611.01236.
Maindonald, John, and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach. Vol. 10. Cambridge University Press.
Serfling, Robert. 2011. “Asymptotic Relative Efficiency in Estimation.” International Encyclopedia of Statistical Science 23 (13): 68–72.
I define an outlier as an observation which seems “to deviate markedly from the other members of the data sample in which it appears.” (Grubbs 1969)?↩︎
The breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. See Serfling (2011) for a formal definition.↩︎
In (7) I assume that
and are independent. That is, , which a reasonable assumption.↩︎