Outliers: Which prior are you using?

The problem of outliers from a Bayesian viewpoint

This post is concerned with a ubiquitous problem of outliers. They are infamous for degrading the performance of many models/algorithms. As a result, ongoing attempts try to accommodate them by deriving robust estimators. Unfortunately, these estimators have drawbacks such as being less efficient. In this post, I approach the problem from a Bayesian viewpoint. I illustrate how the issue of outliers connects with our prior beliefs about the data collection procedure. This leads me to show how a simple but flexible Bayesian model allows us to accommodate outliers without inheriting the drawbacks of other estimators.

Disclaimer: This post is heavily inspired by the work of Jaynes (2003).

The problem

Imagine we are interested in a quantity \(\theta\), which is unknown. The subsequent, logical step is to try to quantify our uncertainty about \(\theta\) by collecting some data. That is, we are trying to measure \(\theta\). But the data collection procedure (or apparatus) is always imperfect and so having \(n\) independent measurements of \(\theta\), we have \(n\) different results ($x_1, …, x_n $). How are we going to proceed on estimating \(\theta\), what is the “best” estimate to use? If the \(n\) data points are “close” together the problem of drawing conclusion about \(\theta\) is not very difficult. But if they are not nicely clustered: one value, \(x_j\), lies far away from the other \(n-1\) values? How are we going to deal with this outlier1?

The dilemma

Two opposite views have been expressed:

  1. The outlier should not have been included in the data. The data have been contaminated and the outlier needs to be removed otherwise we may get erroneous conclusions.
  2. The outlier may be the most important datapoint we have so it must be taken into account in the analysis. In other words, it may be desirable to describe the population including all observations. For only in that way do we describe what is actually happening (Dixon 1950).

These viewpoints reflect different prior information about the data collection procedure. The first view is reasonable if we believe a priori the data collection procedure is unreliable. That is, any now and then and without warning we can get an erroneous measurement. The second view is reasonable if we have absolute confidence in the data collection procedure. Then the outlier is an important result and ignoring it may harm us.

Clearly these are extreme positions, and in real-life the researcher is in a intermediate position. If they knew the apparatus is unreliable they would have choose not to collect data in the first place or improve the apparatus. Of course, in some situations we are obliged to use whatever “apparatus” we have access to. So the question arises can we formalise an intermediate position?

Robustness

Such an intermediate position is the idea of robustness. Researchers sometimes use various “robust” procedures, which protect against the possibility (or presence) of outliers. These techniques do not directly examine the outliers but accommodate them at no serious inconvenience (Barnett and Lewis 1974). Certain estimators, especially the mean and least squares estimators, are particularly vulnerable to outliers, or have low breakdown values2.

For this reason, researchers turn to robust or high breakdown methods to provide alternative estimators for these important aspects of the data. A common robust estimation method for univariate distributions involves the use of a trimmed mean, which is calculated by temporarily eliminating extreme observations at both ends of the sample (very high and low values) (Anscombe 1960). Alternatively, researchers may choose to compute a Windsorized mean, for which the highest and lowest observations are temporarily censored, and replaced with adjacent values from the remaining data.

The issue arises from the fact that robust qualities - however defined - must be bought at a price: poorer performance when the model is correct. This is usually reported by some trade-off between the conflicting requirements of robustness and accuracy.

As an example, lets look at the median which is often cited as a robust estimator. The downside of the median is that it is less efficient than the mean. This is because it does not take into account the precise value of each observation and hence does not use all information available in the data. The standard error of the median (\(\sigma_{median}\)) for large samples and normal distributions is:

\[ \sigma_{median} \approx 1.25 \frac{\sigma}{\sqrt{N}} = 1.25 \sigma_{mean}\]

where \(\sigma\) is the population standard deviation and \(N\) the sample size. Thus, the standard error of the median is about \(25\%\) larger than that for the mean (Maindonald and Braun 2006, Chapter 4). Hence, the median is less efficient estimator when the model in correct, i.e the data come from normal distributions. Later, I will show that Bayesian analysis automatically delivers robustness whenever it is desirable without throwing away relevant information. But first I introduce how the apparatus generates data.

The model

Following Box and Tiao (1968) I assume that the apparatus produces good and bad measurements. So we have a “good” sampling distribution

\[G(x|\theta)\]

parametrized by \(\theta\). The “bad” sampling distribution

\[B(x|\xi)\]

possibly containing an uninteresting parameter \(\xi\). Data from \(B(x|\xi)\) are useless or worse for estimating \(\theta\), since their occurrence probability has nothing to do with \(\theta\). Our sample consists of \(n\) observations

\[D = (x_1 \dots x_n)\] The trouble is we do not know which is which. However, we may be able to guess since a datapoint far away from the tails of \(G(x|\theta)\) can be suspected of being bad. Let’s define

\[\begin{equation} q_i = \begin{cases} 1 & \text{if the ith datapoint is good} \\ 0 & \text{if it is bad,} \end{cases} \end{equation}\]

with joint prior probabilities

\[p(q_1 \dots q_n)\]

to the \(2^n\) sequences of good and bad.

Consider the most common case where our prior information about the good and bad observations is invariant on the particular trial at which they occur. That is, the probability of any sequence of \(n\) good/bad observations depends only on the numbers \(r\), \(n-r\) of good and bad ones. Then, under de Finetti’s representation theorem (De Finetti 1972)

\[\begin{equation} p(q_1 \dots q_n) = \int_{0}^{1} u^r (1-u)^{n-r} dg(u) \tag{1} \end{equation}\]

The theorem above is equivalent to assuming that \(q_i\) are independent Bern(\(u\)) (Bernoulli) random variables with \(u\), given a prior distribution \(g(u)\). Consequently, our sampling distribution can be written as a probability mixture of the good and bad distributions

\[\begin{equation} p(x|\theta,\xi,u) = u G(x|\theta) + (1-u) B(x|\xi) \tag{2} \end{equation}\]

\(\theta\) can be thought of the parameter of interest while (\(\xi,u\)) are nuisance parameters. In the next section, I show how a simple, flexible Bayesian solution allows for robustness. Throughout I assume \(u\) is unknown, which is in line with real-life scenarios.

The solution

Let \(p(\theta,\xi,u)\) be the joint prior density for the parameters. Under Bayes theorem their joint posterior density, given the data \(D\), becomes

\[p(\theta,\xi,u|D) \propto L(\theta,\xi,u) p(\theta,\xi,u),\]

and from (2),

\[\begin{equation} L(\theta,\xi,u) = \prod_{i=1}^{n} \Big[ u G(x|\theta) + (1-u) B(x|\xi) \Big] \tag{3} \end{equation}\]

is the likelihood. The marginal posterior density for the parameter of interest \(\theta\) is

\[\begin{equation} p(\theta|D) = \int \int p(\theta,\xi,u|D) d\xi du. \tag{4} \end{equation}\]

Another formulation of (4) is

\[ p(\theta|D) = \frac{p(\theta) \bar{L}(\theta)} {\int p(\theta) \bar{L}(\theta) d\theta}\]

where \(p(\theta)\) is the marginal prior density for \(\theta\) and \(\bar{L}(\theta)\) is the quasi-likelihood defined as

\[\begin{equation} \bar{L}(\theta) = \int \int L(\theta,\xi,u) h(\xi,u|\theta) d\xi du. \tag{5} \end{equation}\]

which results from decomposing the prior joint density \(p(\theta,\xi,u)\) into

\[p(\theta,\xi,u) = h(\xi,u|\theta) p(\theta)\]

where \(h(\xi,u|\theta)\) is the joint prior for \((\xi,u)\) given \(\theta\). Substituting (3) into (5), we have

\[\begin{equation} \begin{split} \bar{L}(\theta) = \int \int h(\xi,u|\theta) d\xi du \Big[ u^n L(\theta) + u^{n-1} (1-u) \sum_{j=1}^n B(x_j|\xi) L_j(\theta) \\ + n^{n-2} (1-u)^2 \sum_{j< k} B(x_j|\xi) B(x_k|\xi) L_{jk}(\theta) + \dots \\ + (1-u)^n B(x_1|\xi) \dots B(x_n|\xi) \Big] \end{split} \tag{6} \end{equation}\]

where

\[\begin{equation} \begin{split} L(\theta) = \prod_{i = 1}^n G(x_i|\theta) \\ L_j(\theta) = \prod_{i \neq j} G(x_i|\theta) \\ L_{jk}(\theta) = \prod_{i \neq j,k} G(x_i|\theta) \dots \end{split} \end{equation}\]

are a sequence of likelihood functions for the good distributions in which we use all the data, all except \(x_j\), all except \(x_j\) and \(x_k\) etc. Note that the coefficient of \(L(\theta)\) in (6),

\[\begin{equation} \int \int h(\xi,u|\theta) u^n d\xi du = \int h(u|\theta)u^n du, \tag{7} \end{equation}\]

is the probability that all the data \(D\) are good conditional on \(\theta\)3. This is in the form (1), in which the function \(g(u)\) is the prior \(h(u|\theta)\). Likewise, the coefficient of \(L_j(\theta)\) is

\[ \int \int h(\xi,u|\theta) u^{n-1} (1-u) B(x_j|\xi) d\xi du =\\ \int u^{n-1} (1-u) du \int B(x_j|\xi) h(\xi,u|\theta) d\xi.\]

Following the same reasoning, this is the probability, given \(\theta\), that the jth datapoint would be bad and would have the value \(x_j\) and the other data would be good. Putting \(\bar{L}(\theta)\) into words

\[\begin{equation} \begin{array}{l@{}l} \bar{L}(\theta) &{} = \text{prob(all the data are good)} \times \text{(likelihood using all the data)} \\ &{} + \sum_j \text{prob(only $x_j$ bad)} \times \text{(likelihood using all the data except $x_j$)} \\ &{} + \dots \\ &{} + \text{prob(all the data are bad)}. \end{array} \label{quasiinwords} \end{equation}\]

In short, \(\bar{L}(\theta)\) is a weighted average of likelihoods resulting from every possible assumption about each datapoint \(x_j\), weighted by the prior probabilities of those assumptions.

An example

Suppose we are interested in a location parameter, and have a sample of 10 observations. But one datapoint \(x_j\) moves away from the cluster of the others. How will this datapoint affect our conclusions about \(\theta\)? The answer depends on the model we specify. If we assume the sampling distribution \(G(x|\theta)\) to be Gaussian i.e. \(x \sim N(\theta, \sigma)\), and our prior for \(\theta\) wide, then the Bayesian estimate will remain equal to the sample average and our datapoint \(x_j\) will pull the estimate far away from the average indicated by the nine other data values. However, this analysis assumes that we know in advance that \(u =1\), all the data are good i.e. come from \(G\). In such a case the study of datapoint \(x_j\) may be of significance since it gives us information about \(\theta\). The rejection of \(x_j\) would then be fault. On the other hand, if we believe that \(x_j\) should be thrown out, then we don’t actually believe in our assumption that \(u = 1\) strongly enough to adhere to it in the presence of the this surprising datapoint. A model like (2) would then be more realistic.

Connection with adversarial training in Machine Learning

In fact, model (2) is the cornerstone of adversarial training in Machine Learning (ML). In adversarial training, the basic idea is to simply create and then incorporate adversarial data into the training process. The researcher then evaluates how robust is the output of the model to such perturbations of the input data. The entire area of adversarial ML studies ways to create robust learning algorithms that withstand such perturbations. The area of adversarial ML arose after observing that standard learning methods degrade rapidly in the presence of perturbations (Kurakin, Goodfellow, and Bengio 2016).

The formal study of robust estimation was initiated by (Huber 1964, 1965) who considered estimation procedures under the \(\epsilon\)-contamination model, where samples are obtained from a mixture model of the form:

\[\begin{equation} P_{\epsilon} = (1 - \epsilon) P + \epsilon Q, \label{Huber_contamination} \end{equation}\]

where \(P\) is the uncontaminated target distribution, \(Q\) is an arbitrary outlier distribution and \(\epsilon\) is the expected fraction of contamination. The distribution \(Q\) allows for arbitrary contamination, which may correspond to gross corruptions or more subtle deviations from the assumed model. This is exactly our model in (2).

Summarising, the Bayesian solution can capture our prior knowledge about how the data are being generated. Allowing for a more flexible Bayesian model gives desirable qualities of robustness automatically. As a result, we may be able to bypass the need to derive robust estimators which, as we saw, come with drawbacks. This fact could be used in adversarial ML applications.

References

Anscombe, Frank J. 1960. “Rejection of Outliers.” Technometrics 2 (2): 123–46.

Barnett, Vic, and Toby Lewis. 1974. Outliers in Statistical Data. Wiley.

Box, George EP, and George C Tiao. 1968. “A Bayesian Approach to Some Outlier Problems.” Biometrika 55 (1): 119–29.

De Finetti, Bruno. 1972. “Probability, Induction, and Statistics.”

Dixon, Wilfred J. 1950. “Analysis of Extreme Values.” The Annals of Mathematical Statistics 21 (4): 488–506.

Grubbs, Frank E. 1969. “Procedures for Detecting Outlying Observations in Samples.” Technometrics 11 (1): 1–21.

Huber, Peter J. 1964. “Robust Estimation of a Location Parameter.” Ann. Math. Statist. 35 (1): 73–101. https://doi.org/10.1214/aoms/1177703732.

———. 1965. “A Robust Version of the Probability Ratio Test.” Ann. Math. Statist. 36 (6): 1753–8. https://doi.org/10.1214/aoms/1177699803.

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University Press.

Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. 2016. “Adversarial Machine Learning at Scale.” arXiv Preprint arXiv:1611.01236.

Maindonald, John, and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach. Vol. 10. Cambridge University Press.

Serfling, Robert. 2011. “Asymptotic Relative Efficiency in Estimation.” International Encyclopedia of Statistical Science 23 (13): 68–72.


  1. I define an outlier as an observation which seems “to deviate markedly from the other members of the data sample in which it appears.” (Grubbs 1969)?↩︎

  2. The breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. See Serfling (2011) for a formal definition.↩︎

  3. In (7) I assume that \(u\) and \(\xi\) are independent. That is, \(h(\xi,u) = h(\xi) h(u)\), which a reasonable assumption.↩︎

Solon Karapanagiotis
Solon Karapanagiotis
Research Associate
MRC Biostatistics Unit

Related