<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ML |</title><link>https://www.solon-karapanagiotis.com/tag/ml/</link><atom:link href="https://www.solon-karapanagiotis.com/tag/ml/index.xml" rel="self" type="application/rss+xml"/><description>ML</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>© Solon Karapanagiotis 2018-2026</copyright><lastBuildDate>Fri, 20 Aug 2021 00:00:00 +0000</lastBuildDate><image><url>https://www.solon-karapanagiotis.com/media/icon.png</url><title>ML</title><link>https://www.solon-karapanagiotis.com/tag/ml/</link></image><item><title>Table vs graph</title><link>https://www.solon-karapanagiotis.com/post/table_vs_graph/ml/</link><pubDate>Fri, 20 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.solon-karapanagiotis.com/post/table_vs_graph/ml/</guid><description>
&lt;p>Shall I display my data using a table or a graph? The usual answer is: it depends. Mostly, it depends on who the audience is and how the data will be used. I agree, but &lt;a href="https://doi.org/10.1038/s42256-021-00353-8">Alaa et al, 2021&lt;/a> may have gone a bit too far using tables.&lt;/p>
&lt;p>I’ll start with a brief summary of the paper.&lt;/p>
&lt;p>It is about the development of Adjutorium - a machine learning algorithm for breast cancer prognostication. The authors motivate the development of Adjutorium by stating that a widely used model (PREDICT v2.1) under-performs in specific subgroups of patients. They then compare the accuracy of Adjutorium in predicting all-cause and breast cancer-specific mortality at 3, 5 and 10 years from baseline with PREDICT v2.1. In addition, they compare Adjutorium to an in-house Cox proportional hazards (Cox PH) regression model. They use a series of measures to assess the three models, AUC-ROC, Harrel’s C-index and Uno’s C-index. They conclude that “Adjutorium uniformly outperformed PREDICT v2.1 and the conventional Cox PH model in predicting all-cause and breast cancer-specific mortality”.&lt;/p>
&lt;p>This statement is mostly based on &lt;a href="https://www.nature.com/articles/s42256-021-00353-8/tables/1">Table 1&lt;/a>. But, the table is cramped with so many values that is difficult to draw any conclusions - unless you spend hours on it.&lt;/p>
&lt;p>I argue that the main message they are trying to convey is not contained in the actual values, which would justify this tabular form, but in the “shape” of the values. They want to reveal the relationships among the three models. That is why I believe a graph would communicate the message more efficiently. So, below I plot the bottom panel (external validation cohort) of their Table 1.&lt;/p>
&lt;p>The horizontal lines show the performance of Adjutorium. In general, Adjutorium performs better. The improvement in performance is more evident for the cancer-specific mortality (right panel).&lt;/p>
&lt;p>Interestingly though, the conclusions depend the choice of performance measure. For example, using the AUC-ROC and Uno’s C-index the simpler Cox PH model predicts all-cause mortality equally well to Adjutorium.&lt;/p>
&lt;p>In general, I find graphs more informative - it is easier to see trends in the data when it is displayed visually compared to when it is displayed numerically in a table.&lt;/p>
&lt;p>&lt;img src="https://www.solon-karapanagiotis.com/post/table_vs_graph/table_vs_graph_files/figure-html/unnamed-chunk-1-1.png" width="672" />&lt;/p></description></item><item><title>Outliers: Which prior are you using?</title><link>https://www.solon-karapanagiotis.com/post/outliers/outliers/</link><pubDate>Sat, 22 May 2021 00:00:00 +0000</pubDate><guid>https://www.solon-karapanagiotis.com/post/outliers/outliers/</guid><description>
&lt;p>This post is concerned with a ubiquitous problem of outliers. They are infamous for degrading the performance of many models/algorithms. As a result, ongoing attempts try to accommodate them by deriving robust estimators. Unfortunately, these estimators have drawbacks such as being less efficient. In this post, I approach the problem from a Bayesian viewpoint. I illustrate how the issue of outliers connects with our prior beliefs about the data collection procedure. This leads me to show how a simple but flexible Bayesian model allows us to accommodate outliers without inheriting the drawbacks of other estimators.&lt;/p>
&lt;p>Disclaimer: This post is heavily inspired by the work of &lt;span class="citation">Jaynes (&lt;a href="#ref-jaynes2003probability" role="doc-biblioref">2003&lt;/a>)&lt;/span>.&lt;/p>
&lt;div id="the-problem" class="section level2">
&lt;h2>The problem&lt;/h2>
&lt;p>Imagine we are interested in a quantity &lt;span class="math inline">\(\theta\)&lt;/span>, which is unknown. The subsequent, logical step is to try to quantify our uncertainty about &lt;span class="math inline">\(\theta\)&lt;/span> by collecting some data. That is, we are trying to measure &lt;span class="math inline">\(\theta\)&lt;/span>. But the data collection procedure (or apparatus) is always imperfect and so having &lt;span class="math inline">\(n\)&lt;/span> independent measurements of &lt;span class="math inline">\(\theta\)&lt;/span>, we have &lt;span class="math inline">\(n\)&lt;/span> different results ($x_1, …, x_n $). How are we going to proceed on estimating &lt;span class="math inline">\(\theta\)&lt;/span>, what is the “best” estimate to use?
If the &lt;span class="math inline">\(n\)&lt;/span> data points are “close” together the problem of drawing conclusion about &lt;span class="math inline">\(\theta\)&lt;/span> is not very difficult. But if they are not nicely clustered: one value, &lt;span class="math inline">\(x_j\)&lt;/span>, lies far away from the other &lt;span class="math inline">\(n-1\)&lt;/span> values? How are we going to deal with this outlier&lt;a href="#fn1" class="footnote-ref" id="fnref1">&lt;sup>1&lt;/sup>&lt;/a>?&lt;/p>
&lt;/div>
&lt;div id="the-dilemma" class="section level2">
&lt;h2>The dilemma&lt;/h2>
&lt;p>Two opposite views have been expressed:&lt;/p>
&lt;ol style="list-style-type: decimal">
&lt;li>The outlier should not have been included in the data. The data have been contaminated and the outlier needs to be removed otherwise we may get erroneous conclusions.&lt;/li>
&lt;li>The outlier may be the most important datapoint we have so it must be taken into account in the analysis. In other words, it may be desirable to describe the population including all observations. For only in that way do we describe what is actually happening &lt;span class="citation">(Dixon &lt;a href="#ref-dixon1950analysis" role="doc-biblioref">1950&lt;/a>)&lt;/span>.&lt;/li>
&lt;/ol>
&lt;p>These viewpoints reflect different prior information about the data collection procedure. The first view is reasonable if we believe &lt;em>a priori&lt;/em> the data collection procedure is unreliable. That is, any now and then and without warning we can get an erroneous measurement. The second view is reasonable if we have absolute confidence in the data collection procedure. Then the outlier is an important result and ignoring it may harm us.&lt;/p>
&lt;p>Clearly these are extreme positions, and in real-life the researcher is in a intermediate position. If they knew the apparatus is unreliable they would have choose not to collect data in the first place or improve the apparatus. Of course, in some situations we are obliged to use whatever “apparatus” we have access to. So the question arises can we formalise an intermediate position?&lt;/p>
&lt;/div>
&lt;div id="robustness" class="section level2">
&lt;h2>Robustness&lt;/h2>
&lt;p>Such an intermediate position is the idea of robustness. Researchers sometimes use various “robust” procedures, which protect against the possibility (or presence) of outliers. These techniques do not directly examine the outliers but accommodate them at no serious inconvenience &lt;span class="citation">(Barnett and Lewis &lt;a href="#ref-barnett1974outliers" role="doc-biblioref">1974&lt;/a>)&lt;/span>. Certain estimators, especially the mean and least squares estimators, are particularly vulnerable to outliers, or have low breakdown values&lt;a href="#fn2" class="footnote-ref" id="fnref2">&lt;sup>2&lt;/sup>&lt;/a>.&lt;/p>
&lt;p>For this reason, researchers turn to robust or high breakdown methods to provide alternative estimators for these important aspects of the data. A common robust estimation method for univariate distributions involves the use of a trimmed mean, which is calculated by temporarily eliminating extreme observations at both ends of the sample (very high and low values) &lt;span class="citation">(Anscombe &lt;a href="#ref-anscombe1960rejection" role="doc-biblioref">1960&lt;/a>)&lt;/span>. Alternatively, researchers may choose to compute a Windsorized mean, for which the highest and lowest observations are temporarily censored, and replaced with adjacent values from the remaining data.&lt;/p>
&lt;p>The issue arises from the fact that robust qualities - however defined - must
be bought at a price: poorer performance when the model is correct. This is usually reported by some trade-off between the conflicting requirements of robustness and accuracy.&lt;/p>
&lt;p>As an example, lets look at the median which is often cited as a robust estimator. The downside of the median is that it is less efficient than the mean. This is because it does not take into account the precise value of each observation and hence does not use all information available in the data. The standard error of the median (&lt;span class="math inline">\(\sigma_{median}\)&lt;/span>) for large samples and normal distributions is:&lt;/p>
&lt;p>&lt;span class="math display">\[ \sigma_{median} \approx 1.25 \frac{\sigma}{\sqrt{N}} = 1.25 \sigma_{mean}\]&lt;/span>&lt;/p>
&lt;p>where &lt;span class="math inline">\(\sigma\)&lt;/span> is the population standard deviation and &lt;span class="math inline">\(N\)&lt;/span> the sample size.
Thus, the standard error of the median is about &lt;span class="math inline">\(25\%\)&lt;/span> larger than that for the mean &lt;span class="citation">(Maindonald and Braun &lt;a href="#ref-maindonald2006data" role="doc-biblioref">2006&lt;/a>, Chapter 4)&lt;/span>. Hence, the median is less efficient estimator when the model in correct, i.e the data come from normal distributions. Later, I will show that Bayesian analysis automatically delivers robustness whenever it is desirable without throwing away relevant information. But first I introduce how the apparatus generates data.&lt;/p>
&lt;/div>
&lt;div id="the-model" class="section level2">
&lt;h2>The model&lt;/h2>
&lt;p>Following &lt;span class="citation">Box and Tiao (&lt;a href="#ref-box1968bayesian" role="doc-biblioref">1968&lt;/a>)&lt;/span> I assume that the apparatus produces good and bad measurements. So we have a “good” sampling distribution&lt;/p>
&lt;p>&lt;span class="math display">\[G(x|\theta)\]&lt;/span>&lt;/p>
&lt;p>parametrized by &lt;span class="math inline">\(\theta\)&lt;/span>. The “bad” sampling distribution&lt;/p>
&lt;p>&lt;span class="math display">\[B(x|\xi)\]&lt;/span>&lt;/p>
&lt;p>possibly containing an uninteresting parameter &lt;span class="math inline">\(\xi\)&lt;/span>. Data from &lt;span class="math inline">\(B(x|\xi)\)&lt;/span> are useless or worse for estimating &lt;span class="math inline">\(\theta\)&lt;/span>, since their occurrence probability has nothing to do with &lt;span class="math inline">\(\theta\)&lt;/span>. Our sample consists of &lt;span class="math inline">\(n\)&lt;/span> observations&lt;/p>
&lt;p>&lt;span class="math display">\[D = (x_1 \dots x_n)\]&lt;/span>
The trouble is we do not know which is which. However, we may be able to guess since a datapoint far away from the tails of &lt;span class="math inline">\(G(x|\theta)\)&lt;/span> can be suspected of being bad. Let’s define&lt;/p>
&lt;p>&lt;span class="math display">\[\begin{equation}
q_i =
\begin{cases}
1 &amp;amp; \text{if the ith datapoint is good} \\
0 &amp;amp; \text{if it is bad,}
\end{cases}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>with joint prior probabilities&lt;/p>
&lt;p>&lt;span class="math display">\[p(q_1 \dots q_n)\]&lt;/span>&lt;/p>
&lt;p>to the &lt;span class="math inline">\(2^n\)&lt;/span> sequences of good and bad.&lt;/p>
&lt;p>Consider the most common case where our prior information about the good and bad observations is invariant on the particular trial at which they occur. That is, the probability of any sequence of &lt;span class="math inline">\(n\)&lt;/span> good/bad observations depends only on the numbers &lt;span class="math inline">\(r\)&lt;/span>, &lt;span class="math inline">\(n-r\)&lt;/span> of good and bad ones. Then, under de Finetti’s representation theorem &lt;span class="citation">(De Finetti &lt;a href="#ref-de1972probability" role="doc-biblioref">1972&lt;/a>)&lt;/span>&lt;/p>
&lt;p>&lt;span class="math display" id="eq:deFinetti">\[\begin{equation}
p(q_1 \dots q_n) = \int_{0}^{1} u^r (1-u)^{n-r} dg(u)
\tag{1}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>The theorem above is equivalent to assuming that &lt;span class="math inline">\(q_i\)&lt;/span> are independent Bern(&lt;span class="math inline">\(u\)&lt;/span>) (Bernoulli) random variables with &lt;span class="math inline">\(u\)&lt;/span>, given a prior distribution &lt;span class="math inline">\(g(u)\)&lt;/span>. Consequently, our sampling distribution can be written as a probability mixture of the good and bad distributions&lt;/p>
&lt;p>&lt;span class="math display" id="eq:mixturedistr">\[\begin{equation}
p(x|\theta,\xi,u) = u G(x|\theta) + (1-u) B(x|\xi)
\tag{2}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>&lt;span class="math inline">\(\theta\)&lt;/span> can be thought of the parameter of interest while (&lt;span class="math inline">\(\xi,u\)&lt;/span>) are nuisance parameters.
In the next section, I show how a simple, flexible Bayesian solution allows for robustness. Throughout I assume &lt;span class="math inline">\(u\)&lt;/span> is unknown, which is in line with real-life scenarios.&lt;/p>
&lt;/div>
&lt;div id="the-solution" class="section level2">
&lt;h2>The solution&lt;/h2>
&lt;p>Let &lt;span class="math inline">\(p(\theta,\xi,u)\)&lt;/span> be the joint prior density for the parameters. Under Bayes theorem their joint posterior density, given the data &lt;span class="math inline">\(D\)&lt;/span>, becomes&lt;/p>
&lt;p>&lt;span class="math display">\[p(\theta,\xi,u|D) \propto L(\theta,\xi,u) p(\theta,\xi,u),\]&lt;/span>&lt;/p>
&lt;p>and from &lt;a href="#eq:mixturedistr">(2)&lt;/a>,&lt;/p>
&lt;p>&lt;span class="math display" id="eq:jointlikelihood">\[\begin{equation}
L(\theta,\xi,u) = \prod_{i=1}^{n} \Big[ u G(x|\theta) + (1-u) B(x|\xi) \Big]
\tag{3}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>is the likelihood. The marginal posterior density for the parameter of interest &lt;span class="math inline">\(\theta\)&lt;/span> is&lt;/p>
&lt;p>&lt;span class="math display" id="eq:marginaltheta">\[\begin{equation}
p(\theta|D) = \int \int p(\theta,\xi,u|D) d\xi du.
\tag{4}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>Another formulation of &lt;a href="#eq:marginaltheta">(4)&lt;/a> is&lt;/p>
&lt;p>&lt;span class="math display">\[ p(\theta|D) = \frac{p(\theta) \bar{L}(\theta)} {\int p(\theta) \bar{L}(\theta) d\theta}\]&lt;/span>&lt;/p>
&lt;p>where &lt;span class="math inline">\(p(\theta)\)&lt;/span> is the marginal prior density for &lt;span class="math inline">\(\theta\)&lt;/span> and &lt;span class="math inline">\(\bar{L}(\theta)\)&lt;/span> is the quasi-likelihood defined as&lt;/p>
&lt;p>&lt;span class="math display" id="eq:quasilikelihood">\[\begin{equation}
\bar{L}(\theta) = \int \int L(\theta,\xi,u) h(\xi,u|\theta) d\xi du.
\tag{5}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>which results from decomposing the prior joint density &lt;span class="math inline">\(p(\theta,\xi,u)\)&lt;/span> into&lt;/p>
&lt;p>&lt;span class="math display">\[p(\theta,\xi,u) = h(\xi,u|\theta) p(\theta)\]&lt;/span>&lt;/p>
&lt;p>where &lt;span class="math inline">\(h(\xi,u|\theta)\)&lt;/span> is the joint prior for &lt;span class="math inline">\((\xi,u)\)&lt;/span> given &lt;span class="math inline">\(\theta\)&lt;/span>.
Substituting &lt;a href="#eq:jointlikelihood">(3)&lt;/a> into &lt;a href="#eq:quasilikelihood">(5)&lt;/a>, we have&lt;/p>
&lt;p>&lt;span class="math display" id="eq:quasilikelihoodex">\[\begin{equation}
\begin{split}
\bar{L}(\theta) = \int \int h(\xi,u|\theta) d\xi du \Big[ u^n L(\theta) + u^{n-1} (1-u) \sum_{j=1}^n B(x_j|\xi) L_j(\theta) \\
+ n^{n-2} (1-u)^2 \sum_{j&amp;lt; k} B(x_j|\xi) B(x_k|\xi) L_{jk}(\theta) + \dots \\
+ (1-u)^n B(x_1|\xi) \dots B(x_n|\xi) \Big]
\end{split}
\tag{6}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>where&lt;/p>
&lt;p>&lt;span class="math display">\[\begin{equation}
\begin{split}
L(\theta) = \prod_{i = 1}^n G(x_i|\theta) \\
L_j(\theta) = \prod_{i \neq j} G(x_i|\theta) \\
L_{jk}(\theta) = \prod_{i \neq j,k} G(x_i|\theta) \dots
\end{split}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>are a sequence of likelihood functions for the good distributions in which we use all the data, all except &lt;span class="math inline">\(x_j\)&lt;/span>, all except &lt;span class="math inline">\(x_j\)&lt;/span> and &lt;span class="math inline">\(x_k\)&lt;/span> etc. Note that the coefficient of &lt;span class="math inline">\(L(\theta)\)&lt;/span> in &lt;a href="#eq:quasilikelihoodex">(6)&lt;/a>,&lt;/p>
&lt;p>&lt;span class="math display" id="eq:simplification">\[\begin{equation}
\int \int h(\xi,u|\theta) u^n d\xi du = \int h(u|\theta)u^n du,
\tag{7}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>is the probability that all the data &lt;span class="math inline">\(D\)&lt;/span> are good conditional on &lt;span class="math inline">\(\theta\)&lt;/span>&lt;a href="#fn3" class="footnote-ref" id="fnref3">&lt;sup>3&lt;/sup>&lt;/a>. This is in the form &lt;a href="#eq:deFinetti">(1)&lt;/a>, in which the function &lt;span class="math inline">\(g(u)\)&lt;/span> is the prior &lt;span class="math inline">\(h(u|\theta)\)&lt;/span>. Likewise, the coefficient of &lt;span class="math inline">\(L_j(\theta)\)&lt;/span> is&lt;/p>
&lt;p>&lt;span class="math display">\[ \int \int h(\xi,u|\theta) u^{n-1} (1-u) B(x_j|\xi) d\xi du =\\
\int u^{n-1} (1-u) du \int B(x_j|\xi) h(\xi,u|\theta) d\xi.\]&lt;/span>&lt;/p>
&lt;p>Following the same reasoning, this is the probability, given &lt;span class="math inline">\(\theta\)&lt;/span>, that the jth datapoint would be bad and would have the value &lt;span class="math inline">\(x_j\)&lt;/span> and the other data would be good. Putting &lt;span class="math inline">\(\bar{L}(\theta)\)&lt;/span> into words&lt;/p>
&lt;p>&lt;span class="math display">\[\begin{equation}
\begin{array}{l@{}l}
\bar{L}(\theta) &amp;amp;{} = \text{prob(all the data are good)} \times \text{(likelihood using all the data)} \\
&amp;amp;{} + \sum_j \text{prob(only $x_j$ bad)} \times \text{(likelihood using all the data except $x_j$)} \\
&amp;amp;{} + \dots \\
&amp;amp;{} + \text{prob(all the data are bad)}.
\end{array}
\label{quasiinwords}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>In short, &lt;span class="math inline">\(\bar{L}(\theta)\)&lt;/span> is a weighted average of likelihoods resulting from every possible assumption about each datapoint &lt;span class="math inline">\(x_j\)&lt;/span>, weighted by the prior probabilities of those assumptions.&lt;/p>
&lt;/div>
&lt;div id="an-example" class="section level2">
&lt;h2>An example&lt;/h2>
&lt;p>Suppose we are interested in a location parameter, and have a sample of 10 observations. But one datapoint &lt;span class="math inline">\(x_j\)&lt;/span> moves away from the cluster of the others. How will this datapoint affect our conclusions about &lt;span class="math inline">\(\theta\)&lt;/span>? The answer depends on the model we specify. If we assume the sampling distribution &lt;span class="math inline">\(G(x|\theta)\)&lt;/span> to be Gaussian i.e. &lt;span class="math inline">\(x \sim N(\theta, \sigma)\)&lt;/span>, and our prior for &lt;span class="math inline">\(\theta\)&lt;/span> wide, then the Bayesian estimate will remain equal to the sample average and our datapoint &lt;span class="math inline">\(x_j\)&lt;/span> will pull the estimate far away from the average indicated by the nine other data values. However, this analysis assumes that we know in advance that &lt;span class="math inline">\(u =1\)&lt;/span>, all the data are good i.e. come from &lt;span class="math inline">\(G\)&lt;/span>. In such a case the study of datapoint &lt;span class="math inline">\(x_j\)&lt;/span> may be of significance since it gives us information about &lt;span class="math inline">\(\theta\)&lt;/span>. The rejection of &lt;span class="math inline">\(x_j\)&lt;/span> would then be fault. On the other hand, if we believe that &lt;span class="math inline">\(x_j\)&lt;/span> should be thrown out, then we don’t actually believe in our assumption that &lt;span class="math inline">\(u = 1\)&lt;/span> strongly enough to adhere to it in the presence of the this surprising datapoint. A model like &lt;a href="#eq:mixturedistr">(2)&lt;/a> would then be more realistic.&lt;/p>
&lt;/div>
&lt;div id="connection-with-adversarial-training-in-machine-learning" class="section level2">
&lt;h2>Connection with adversarial training in Machine Learning&lt;/h2>
&lt;p>In fact, model &lt;a href="#eq:mixturedistr">(2)&lt;/a> is the cornerstone of adversarial training in Machine Learning (ML). In adversarial training, the basic idea is to simply create and then incorporate adversarial data into the training process. The researcher then evaluates how robust is the output of the model to such perturbations of the input data. The entire area of adversarial ML studies ways to create robust learning algorithms that withstand such perturbations. The area of adversarial ML arose after observing that standard learning methods degrade rapidly in the presence of perturbations &lt;span class="citation">(Kurakin, Goodfellow, and Bengio &lt;a href="#ref-kurakin2016adversarial" role="doc-biblioref">2016&lt;/a>)&lt;/span>.&lt;/p>
&lt;p>The formal study of robust estimation was initiated by &lt;span class="citation">(Huber &lt;a href="#ref-huber1964" role="doc-biblioref">1964&lt;/a>, &lt;a href="#ref-huber1965" role="doc-biblioref">1965&lt;/a>)&lt;/span> who considered estimation procedures under the &lt;span class="math inline">\(\epsilon\)&lt;/span>-contamination model, where samples are obtained from a mixture model of the form:&lt;/p>
&lt;p>&lt;span class="math display">\[\begin{equation}
P_{\epsilon} = (1 - \epsilon) P + \epsilon Q,
\label{Huber_contamination}
\end{equation}\]&lt;/span>&lt;/p>
&lt;p>where &lt;span class="math inline">\(P\)&lt;/span> is the uncontaminated target distribution, &lt;span class="math inline">\(Q\)&lt;/span> is an arbitrary outlier distribution and &lt;span class="math inline">\(\epsilon\)&lt;/span> is the expected fraction of contamination. The distribution &lt;span class="math inline">\(Q\)&lt;/span> allows for arbitrary contamination, which may correspond to gross corruptions or more subtle deviations from the assumed model. This is exactly our model in &lt;a href="#eq:mixturedistr">(2)&lt;/a>.&lt;/p>
&lt;p>Summarising, the Bayesian solution can capture our prior knowledge about how the data are being generated. Allowing for a more flexible Bayesian model gives desirable qualities of robustness &lt;em>automatically&lt;/em>. As a result, we may be able to bypass the need to derive robust estimators which, as we saw, come with drawbacks. This fact could be used in adversarial ML applications.&lt;/p>
&lt;/div>
&lt;div id="references" class="section level2 unnumbered">
&lt;h2>References&lt;/h2>
&lt;div id="refs" class="references">
&lt;div id="ref-anscombe1960rejection">
&lt;p>Anscombe, Frank J. 1960. “Rejection of Outliers.” &lt;em>Technometrics&lt;/em> 2 (2): 123–46.&lt;/p>
&lt;/div>
&lt;div id="ref-barnett1974outliers">
&lt;p>Barnett, Vic, and Toby Lewis. 1974. &lt;em>Outliers in Statistical Data&lt;/em>. Wiley.&lt;/p>
&lt;/div>
&lt;div id="ref-box1968bayesian">
&lt;p>Box, George EP, and George C Tiao. 1968. “A Bayesian Approach to Some Outlier Problems.” &lt;em>Biometrika&lt;/em> 55 (1): 119–29.&lt;/p>
&lt;/div>
&lt;div id="ref-de1972probability">
&lt;p>De Finetti, Bruno. 1972. “Probability, Induction, and Statistics.”&lt;/p>
&lt;/div>
&lt;div id="ref-dixon1950analysis">
&lt;p>Dixon, Wilfred J. 1950. “Analysis of Extreme Values.” &lt;em>The Annals of Mathematical Statistics&lt;/em> 21 (4): 488–506.&lt;/p>
&lt;/div>
&lt;div id="ref-grubbs1969procedures">
&lt;p>Grubbs, Frank E. 1969. “Procedures for Detecting Outlying Observations in Samples.” &lt;em>Technometrics&lt;/em> 11 (1): 1–21.&lt;/p>
&lt;/div>
&lt;div id="ref-huber1964">
&lt;p>Huber, Peter J. 1964. “Robust Estimation of a Location Parameter.” &lt;em>Ann. Math. Statist.&lt;/em> 35 (1): 73–101. &lt;a href="https://doi.org/10.1214/aoms/1177703732">https://doi.org/10.1214/aoms/1177703732&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-huber1965">
&lt;p>———. 1965. “A Robust Version of the Probability Ratio Test.” &lt;em>Ann. Math. Statist.&lt;/em> 36 (6): 1753–8. &lt;a href="https://doi.org/10.1214/aoms/1177699803">https://doi.org/10.1214/aoms/1177699803&lt;/a>.&lt;/p>
&lt;/div>
&lt;div id="ref-jaynes2003probability">
&lt;p>Jaynes, Edwin T. 2003. &lt;em>Probability Theory: The Logic of Science&lt;/em>. Cambridge University Press.&lt;/p>
&lt;/div>
&lt;div id="ref-kurakin2016adversarial">
&lt;p>Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. 2016. “Adversarial Machine Learning at Scale.” &lt;em>arXiv Preprint arXiv:1611.01236&lt;/em>.&lt;/p>
&lt;/div>
&lt;div id="ref-maindonald2006data">
&lt;p>Maindonald, John, and John Braun. 2006. &lt;em>Data Analysis and Graphics Using R: An Example-Based Approach&lt;/em>. Vol. 10. Cambridge University Press.&lt;/p>
&lt;/div>
&lt;div id="ref-serfling2011asymptotic">
&lt;p>Serfling, Robert. 2011. “Asymptotic Relative Efficiency in Estimation.” &lt;em>International Encyclopedia of Statistical Science&lt;/em> 23 (13): 68–72.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="footnotes">
&lt;hr />
&lt;ol>
&lt;li id="fn1">&lt;p>I define an outlier as an observation which seems “to deviate markedly from the other members of the data sample in which it appears.” &lt;span class="citation">(Grubbs &lt;a href="#ref-grubbs1969procedures" role="doc-biblioref">1969&lt;/a>)&lt;/span>?&lt;a href="#fnref1" class="footnote-back">↩︎&lt;/a>&lt;/p>&lt;/li>
&lt;li id="fn2">&lt;p>The breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. See &lt;span class="citation">Serfling (&lt;a href="#ref-serfling2011asymptotic" role="doc-biblioref">2011&lt;/a>)&lt;/span> for a formal definition.&lt;a href="#fnref2" class="footnote-back">↩︎&lt;/a>&lt;/p>&lt;/li>
&lt;li id="fn3">&lt;p>In &lt;a href="#eq:simplification">(7)&lt;/a> I assume that &lt;span class="math inline">\(u\)&lt;/span> and &lt;span class="math inline">\(\xi\)&lt;/span> are independent. That is, &lt;span class="math inline">\(h(\xi,u) = h(\xi) h(u)\)&lt;/span>, which a reasonable assumption.&lt;a href="#fnref3" class="footnote-back">↩︎&lt;/a>&lt;/p>&lt;/li>
&lt;/ol>
&lt;/div></description></item></channel></rss>