— 5 —

Test set reuse

Summary

Statistics prescribes the iron vault for test data. But the empirical reality of machine learning benchmarks couldn’t be further from the prescription. Repeated adaptive testing brings theoretical risks and practical power.


The previous chapter covered powerful theoretical guarantees for the holdout method. Unfortunately, these guarantees hold only under one-time use. If a researcher builds a model based on prior interactions with the holdout set, the holdout set loses those formal guarantees. This chapter develops a theoretical model that accommodates how people actually use the holdout method in practice. We then work out generalization bounds in this setting, contrasting them with the results of the previous chapter.

Test set reuse in machine learning benchmarks

Most machine learning benchmarks have a fixed test set freely available to researchers. Benchmarks are typically publicly available and researchers may use the test set for evaluation purposes without any limits or restrictions. Software frameworks and data sharing platforms make it easy to download and evaluate on popular test sets. Model evaluations on test sets have become so common that it’s easy to overlook just how many evaluations there are. Given the enormous amount of research and engineering work in machine learning, it’s safe to say that popular test sets often support millions of evaluations.

But what’s concerning is not so much the sheer number of test set evaluations. Keep in mind, the previous chapter developed guarantees for the holdout method that deteriorate very slowly with the number of evaluations. The price for making k evaluations is a factor \sqrt{\log k} in the error bound. By this calculation, the price of a million evaluations is a factor 3.71 in the error of the holdout method. This alone would be easy to address by making test sets just a bit larger.

What’s more concerning is that these guarantees simply don’t apply to how the holdout method is used in practice. The strong theoretical guarantees require the test set to be fresh: The models we evaluate may not in any way depend on previous evaluations on the test set. The only safe way guarantee this is to keep the test set “in a vault”—as the textbooks prescribe—and use it only once in the end. But that’s far from reality. Machine learning researchers and engineers instead work repeatedly and incrementally with various test sets. Model builders typically use test sets as part of their model building pipeline for continual model improvements. This might include trying out tweaks to the architecture, its hyperparameters, the optimization method, and various other moving pieces. Future design choices necessarily depend on past evaluations. Likewise, scientific papers report new achievements on benchmark test sets. These reported numbers influence what scientists do in the future and what directions they choose to pursue.

By working incrementally with tests sets, researchers essentially do what Duda and Hart called training on the test set. They incrementally refine their model by repeated evaluations against the test set.

Left: Textbook view of non-adaptive data analysis. Right: Adaptive data analysis.

The practice of machine learning is, by nature, an adaptive process in which new steps depend on previous interactions with the data. Formally, this means there is a feedback loop between the model and the data. Evaluations on the data inform changes to the model, which in turn lead to new evaluations on the same data. We call this adaptive data analysis. In contrast, the traditional view of data analysis that we worked with in the last chapter is non-adaptive.

Adaptive data analysis

To formally reason about adaptive data analysis, it is helpful to frame the problem as an interaction between two parties. One party holds the dataset S. Think of this party as implementing the holdout method. The other party—called analyst—can query the dataset by requesting an estimate of the risk R(f) for a given predictor f on the dataset S. In reality, the two parties might be one and the same researcher. Nevertheless, conceptually it’s helpful to think of the problem as an interaction between these two parties.

The standard holdout method returns the empirical risk R_S(f) given a query function f\colon{\cal X}\to{\cal Y}. But we’ll allow for other methods that don’t necessarily output the exact empirical risk. In fact, there are interesting alternatives to the standard holdout mechanism that enjoy stronger guarantees in the adaptive setting. Intuitively, these alternatives limit the amount of information about the holdout set revealed by each evaluation.

Throughout this chapter, we restrict our attention to the case of the zero-one loss and binary prediction, although the theory extends to other settings.

The two parties interact for some number k of rounds, thus creating a sequence of adaptively chosen predictors f_1,\dots,f_k. Keep in mind that this sequence depends on the dataset! In particular, when S is drawn at random, f_2,\dots, f_k become random variables, too, that are in general not independent of each other.

An adaptive analyst interacts with a dataset repeatedly.

Adaptive analysts

Formally, an adaptive analyst is an algorithm {\cal A} that, given a sequence f_1,R_1,\dots,f_t,R_t of queries and responses, returns a new query f_{t+1}\colon{\cal X}\to{\cal Y}, where we let f_1={\cal A}(\emptyset) be the first query the analyst chooses. We assume that the analyst {\cal A} is a deterministic algorithm.

A useful idea is to represent an adaptive analyst {\cal A} as a tree. The root node is labeled by f_1={\cal A}(\emptyset), i.e., the first function that the analyst queries without any context. The holdout mechanism then returns a response R_1. This response is generally a sample quantity like the empirical risk. Note that the empirical risk R_S(f_1) can only take on n+1 possible values. This is because the zero-one loss can only take values in the set {\cal R}=\{0, 1/n, 2/n,\dots, 1\}. So, it makes sense to assume that R_1\in{\cal R} takes values in the same set. We can always achieve this by rounding R_1 to the nearest multiple of 1/n without changing the response significantly.

Each possible response value r\in{\cal R} creates a new child node in the tree corresponding to the function f={\cal A}(r) that the analyst queries when receiving the response r to the first query f_1. This gives us n+1 children to the root node, one for each possible response r\in{\cal R}. Each child corresponds to one function. In this manner, recursively continue the process until you have a tree of depth k and degree n+1.

Constructing a tree of depth k and degree n+1 representing an adaptive analyst. Each node corresponds to the predictor the analyst chooses based on the responses seen so far.

Note that this tree depends only on the analyst {\cal A} and how it responds to all the possible transcripts that can occur in the interaction with a holdout set. The tree does not depend on the random sample, however. The tree is therefore data-independent. It’s an explicit representation of the algorithm {\cal A}. This property is a useful tool for proving generalization bounds in the adaptive setting.

Guarantees of the holdout method under adaptivity

We can now derive guarantees for the holdout method in the adaptive analyst model. The idea is to fix an analyst and apply the analysis from Chapter 4 to all the functions appearing in the tree corresponding to the analyst.

Proposition.

For any sequence of k adaptively chosen predictors f_1,\dots, f_k, the holdout method satisfies with probability 1-\delta, \mathop{\mathrm{\mathbb P}}\left\{ \max_{1\le t\le k} \left|\Delta_S(f_t)\right| \le \sqrt{\frac{k\log(4(n+1)/\delta)}{2n}} \right\} \ge 1-\delta\,.

Proof.

Fix an adaptive analyst {\cal A} and the corresponding tree. The tree is of size 1 + (n+1) + (n+1)^2 + \dots + (n+1)^k = \frac{(n+1)^{k+1} - 1}{n} \le 2(n+1)^k\,. Let F be the set of functions appearing at any of the nodes in the tree. Since each node has one function, we have |F|\le 2(n+1)^k\,. Moreover, the set of functions is fully determined by the algorithm {\cal A} and therefore does not depend on any sample. Now, pick a random sample S of size n. The model selection guarantee for the holdout method from the previous chapter gives us for every \delta>0, \mathop{\mathrm{\mathbb P}} \left\{ \max_{f\in F} \left|\Delta_S(f_t) \right| \leq \sqrt{\frac{\log(2|F|/\delta)}{2n}} \right\} \geq 1-\delta\,. Plugging in the upper bound on |F|, \log(2|F|/\delta) \le k\log(4(n+1)/\delta)\,. This completes the proof.

In the typical regime where k\ge\log(n) is at least logarithmic in n and \delta\ge 1/n is not too small, the bound on the maximum error simplifies to O(\sqrt{k/n}). This contrasts with the bound O(\sqrt{\log(k)/n}) that holds in the non-adaptive setting.

Analyst Error bound
non-adaptive O(\sqrt{\log(k)/n})
adaptive O(\sqrt{k/n})

The dependence on k in the adaptive setting is exponentially worse than in the non-adaptive setting. This raises the question if there’s a way to do better.

Climbing the leaderboard without looking at the data

So far, the guarantee of the standard holdout mechanism in the adaptive case is exponentially worse in k compared with the non-adaptive case. As it turns out, this is unavoidable in the worst case. What we’ll work out is a worst-case lower bound on the gap between risk and empirical risk in the adaptive setting. To prove such a lower bound it is enough to exhibit an adaptive analyst that forces a large gap.

Indeed, there is a fairly natural sequence of k adaptively chosen predictors, resembling the practice of ensembling, on which the empirical risk diverges from the risk by at least \Omega(\sqrt{k/n}). This matches the upper bound from the previous section up to a constant factor. In particular, with k\approx n queries, you can force a constant generalization gap. The lower bound is worst-case: It only holds for this one analyst and doesn’t say anything about the typical behavior of the holdout method.

Throughout, focus on zero-one loss in a binary prediction problem. The core idea extends to many other settings.

Algorithm.

Overfitting by ensembling:

  1. Choose k random binary predictors f_1,\dots, f_k\colon{\cal X}\to\{0,1\}.
  2. Compute the set I=\{i\in[k]\colon R_S[f_i] < 1/2\}.
  3. Output the predictor f=\mathrm{majority}\{ f_i \colon i\in I \} that takes a majority vote over all the predictors computed in the second step.

The key idea of the algorithm is to select all the predictors that have accuracy strictly better than random guessing. This selection step creates a bias that gives each selected predictor an advantage over random guessing. The majority vote in the third step amplifies this initial advantage into a larger advantage that grows with k. If we want to be a bit more clever, we don’t have to throw away any functions at all. If R_S[f_i]>1/2, then flipping all the bits gives R_S(1-f_i)<1/2. So, include 1-f_i in the ensemble. This saves a factor two in the number of queries.

In practice, we can do a bit better still by weighting each function with its advantage over random guessing. The larger the advantage, the larger the weight of the function in the ensemble. This makes intuitive sense and leads to some improvements. What this algorithm essentially does is boosting. Boosting is a well-known technique for turning weak predictors into strong predictors.

The next proposition confirms that indeed this strategy finds a predictor whose empirical risk is bounded away from 1/2 (random guessing) by a margin of \Omega(\sqrt{k/n}). Since the predictor does nothing but taking a majority vote over random functions, its risk is of course no better than 1/2.

Overfitting by ensembling on a test set of size n=10000. Thousand queries force a generalization gap of 10%.
Proposition.

For sufficiently large k\le n, overfitting by ensembling returns a predictor f whose classification error satisfies with probability 1/3, R_S(f)\le \frac12-\Omega(\sqrt{k/n})\,. In particular, \Delta_S(f)\ge\Omega(\sqrt{k/n}).

Proof.

(Sketch)

The formal proof is a bit tedious, but it’s good to develop the intuition.

On a randomly chosen function the number of errors on the test set S is distributed according to the binomial distribution B(n,1/2). We expect to see n/2 errors and one standard deviation is \sqrt{n}/2. This is because the variance of B(n,p) is np(1-p).

When we observe that R_S(f_t)<1/2 for a randomly chosen f_t, we know that it makes fewer than n/2 errors on the test set. But apart from this condition, the errors are still randomly distributed. So, we get the distribution of B(n,1/2) conditional on falling below its mean. The mean of this conditional distribution is smaller than n/2-\sqrt{n}/2. If it’s below the mean, it’ll be so by at least one standard deviation more than half the time.

This is the first part of the argument. Our selection step biases the functions to make about \sqrt{n} fewer errors on the test set than a random function.

How many functions do we select? In other words, how large is the index set I? Since the binomial distribution is symmetric around its mean, the probability of any given function being selected is about 1/2. So we expect the index set to be of size m=k/2. In fact, m concentrates around k/2. Therefore, we can be pretty sure that we select at least, say, k/4 functions. Up to constant factors, we can think of m and k as being the same.

To summarize, we know that we select many functions and each selected function is a bit better than random guessing.

The last step is to show that the majority vote amplifies this small advantage over random guessing. This is the same kind of argument we do when we reason about ensembling methods. Taking the majority vote over many experts—each a bit better than random guessing—results in a more accurate prediction.

So, consider the majority predictor f=\mathrm{majority}\{ f_i \colon i\in I \}. Fix a data point (x,y)\in S. What is the probability that f makes a mistake on x? For convenience, assume y=1. The case of y=0 is the same. By definition, the majority classifier makes a mistake if more than half the functions in I vote 0. Formally, \sum_{i\in I} f_i(x) < \frac m2\,. What is the chance that this happens? The errors that each f_i makes on the test set are randomly located, since our selection step is invariant under permutation of the test set. Therefore, each function we select is correct on x with probability 1/2 + \epsilon for some \epsilon > 1/\sqrt{n}.

This means that the random variable X = \sum_{i\in I} f_i(x) follows the binomial distribution B(m,1/2+\epsilon). Its mean and variance are \mathop{\mathrm{\mathbb E}}X = \frac m2 + \epsilon m\,, \qquad \mathop{\mathrm{\mathbb V}}X = m(1/2+\epsilon)(1/2-\epsilon)\le m/4\,. For the majority vote f to be wrong, the random variable X has to be below its mean by an additive \epsilon m. Since a standard deviation is less than \sqrt{n}, this deviation is equivalent to \epsilon \sqrt{m} in units of standard deviation. Note that \epsilon\sqrt{m} = \sqrt{\frac m n} \approx \sqrt{\frac k n}\,. Since we assume k\le n, we’re talking about less than one standard deviation. For large enough n and constant p, the binomial distribution B(n,p) behaves very much like a normal distribution with mean np and variance np(1-p). Within one standard deviation of its mean, the normal distribution acts a lot like a uniform distribution. In particular, we can calculate that a deviation of \sqrt{k/n} below its mean has probability less than 1/2-c\sqrt{k/n} up to some positive constant c>0 in front of the \sqrt{k/n} term.

So, we can conclude that the probability of a mistake by our majority vote is at most \mathop{\mathrm{\mathbb P}}\left\{\sum_{i\in I} f_i(x) < \frac m2\right\} \le \frac12-c\sqrt{\frac kn}\,, for some positive constant c>0. Written differently, this means R_S(f) \le \frac{1}{2}-c\sqrt{\frac kn}\,. That’s what we wanted to show. This concludes the proof sketch.

Zooming out again, ensembling by majority voting shows that the problem of “training on the testing data” is real. In the worst case, holdout data can quickly lose its guarantees.

Alternatives to the holdout method

In the worse case, the error of the holdout method grows with \sqrt{k} not \sqrt{\log k} as in the non-adaptive case. This is exponentially worse. If this pessimistic bound manifested in practice, popular benchmark datasets would quickly become useless. Is there anything we can do about the pessimistic lower bound we just encountered?

It turns out, there is. So far, we’ve analyzed the standard holdout method that returns the exact empirical risk R_S(f) given a query function f. There’s hope that alternative mechanisms might have stronger guarantees. For example, we could release an approximate answer rather than an exact answer. We have to be a bit careful though. The argument from the previous section extends to the case where the holdout method only gives approximate answers so long as these are within o(1/\sqrt{n}) from the exact answer. The reason is that 1/\sqrt{n} is roughly one standard deviation of the empirical risk. If all answers deviate less than one standard deviation, the argument is roughly the same. So, the solution isn’t as easy as rounding the answer to four digits of precision. This is a common heuristic in machine learning competitions, but it has no formal guarantees.

Nevertheless, a related idea works. If we carefully add certain noise variables to each answer, we can improve dependence on k from \sqrt{k} to k^{1/4}. This alternative holdout method permits a quadratic number k=n^2 of queries before it becomes useless in the worst-case. The proof of this result is surprisingly tricky and requires several tools that are beyond the scope of this chapter. The key idea is based on the privacy guarantee differential privacy: If you can make the holdout mechanism differentially private, then no adaptive analyst can learn about the test set enough to overfit to it. This is a theorem that forms the basis for many alternative mechanisms that uses noise addition to achieve the quadratic number of queries. Unfortunately, researchers also showed that in the worst-case no holdout mechanism can give an answer to more than a quadratic number of queries with small constant error on every query.

The bounds in the previous chapter are optimistic. They are strong, but they require an assumption that is clearly violated in practice. The bounds in this chapter go about it from the other end. They allow for powerful adaptive analyses, like those you’ll see in practice. The downside is that we end up with fairly pessimistic worst-case guarantees. If we typically encountered these bounds in practice, holdout sets would have a serious problem. Both perspectives are useful though. One tells us the best that we can hope for. The other shows the worst that could happen.

In the next two chapters we’ll look at the empirical phenomena around test set reuse in machine learning research. The practice of benchmarking seems to successfully avoid the worst-case bounds from this chapter. Chapter 8 collects possible explanations for why this is. In that chapter, we’ll also see another powerful alternative to the standard holdout method. We end this chapter on a closer look at the problem of adaptivity in the context of statistics.

Freedman’s paradox

The problem with the holdout method we just saw has a close cousin in the field of statistics called Freedman’s paradox. The statistician David Freedman pointed out this problem in 1983, although he thought that his observation was neither new, nor a paradox.

Freedman’s problem routinely arises in the context of variable selection for statistical modeling. If we first select variables in a data-dependent way and then fit a model on the selected variables, the model will appear to fit better than it actually does. At a high-level this is directly analogous to the ensembling procedure we discussed above. The details are different though and require a bit of statistical nomenclature.

Freedman considers a linear regression problem y=X\beta + \eta\,, where the error term \eta \sim N(0,\sigma^2) is sampled from a centered Gaussian distribution with variance \sigma^2. The matrix X has shape n\times d where n is the number of observations and d is the number of features. The vector \beta\in\mathbb{R}^d corresponds to the unknown solution to the system of equations. We only observe the matrix X and the vector y.

Least squares regression attempts to find an approximate solution to the linear equations by solving the optimization problem: \min_{\hat\beta\in\mathbb{R}^d}\quad \|X\hat\beta -y\|^2\,. Let \hat y = X\hat\beta denote the optimal solution to least squares. Due to the noise in the system of equations, we can’t hope to recover \beta exactly.

The situation statisticians want to avoid is that \beta=0 but somehow we come away thinking that \beta\ne0. The case \beta=0 corresponds to the situation where there is no signal in the data. The vector y is just a random normal vector with no dependence on X. Statisticians call this a null model. A null model is a data-generating distribution corresponding to the case where there is nothing to discover. In contrast, the case \beta\ne 0 means that there is some relationship between X and y that we would like to discover and report.

To test if we’re dealing with the null model, statisticians use hypothesis tests: Compute a test statistic of the data. If it exceeds a threshold, reject the null model. A common hypothesis test is the F-test that uses the observed F-value as a test statistic: F^{\mathrm{obs}}= \frac{\|\hat y - \bar y\|^2 / d}{\|y-\hat y\|^2/(n-d-1)}\,. Here, \bar y is the mean \bar y=\sum_{i=1}^n y_i of the observations. We call F^{\mathrm{obs}} the observed F-value, since it’s the one we compute from our sample.

The observed F-value is the test statistic. The way the hypothesis test works is that we’re going to reject the null model if the observed F-value is above a certain threshold, call \tau. Under the null model, the observed F-value follows an F-distribution. It doesn’t matter what exactly this distribution is. It’s just some continuous distribution we can compute.

The probability that the F-value we observe exceeds some threshold \tau is given by 1-\mathrm{CDF}(\tau), where \mathrm{CDF} is the cumulative density function of the F-distribution. The probability that the F-value exceeds the value we observed is therefore p = 1-\mathrm{CDF}(F^{\mathrm{obs}})\,. Here, the tail probability p is called the p-value of the test. The p-value is the probability under the null model that the F-distribution exceeds the value F^{\mathrm{obs}} we observe.

That means p-values are tail probabilities under the null model.

A fact called integral probability theorem implies that \mathrm{CDF}(F^{\mathrm{obs}}) follows the uniform distribution \mathrm{Uniform}([0,1]) over the interval [0,1]. Therefore, the distribution of a p-value under the null model is the uniform distribution. This is the only fact about p-values we’ll need.

So far, everything is working as intended. The p-values we see under the null model are uniformly distributed as they should. The chance of seeing a p-value as small as 0.05 is, by construction, at most 5%. This is the now infamous significance level at which statisticians will reject the null hypothesis. The idea is that seeing p\le 0.05 renders the null model implausible.

Variable selection

Our goal is not only to avoid rejecting the null model when \beta=0. This alone would be easy: Never reject anything. But our goal is also to actually reject the null model when \beta\ne0 and so we should. The higher the dimension d of our regression problem, the harder it generally is to find signal in the data. An F-test on all variables—when there are many—is safe, but it’s unlikely to find any signal in the data.

Therefore, it is common practice to first select some promising variables from the set of all variables, before fitting a model on the selected variables. To select variables, we can again perform hypothesis tests. Specifically, we can test individual coefficients in the regression model to see if they are significantly large using the T-statistic: T^{\mathrm{obs}}_j = \hat\beta_j/s_j,\quad\text{with}\quad s_j^2 = \hat\sigma^2(X^TX)^{-1}_jj Call the p-value associated with the j-th test p_j. That is, p_j is the probability of seeing a value as extreme as T_j under the null model, i.e., \beta=0 and Gaussian errors. Again, under the null model all these p-values are uniformly distributed.

Now consider the following two step procedure:

Algorithm.

Regression after variable selection:

  1. Select all variables with p_j < 0.25. Call that index set I.
  2. Solve a regression problem on X_I, the n\times|I| matrix where we retain only the columns in X corresponding to selected variables.

Since the p-values in the first step are all uniformly distributed, the selection step simply picks a random subset of roughly d/4 variables. Any variable is equally likely to be chosen. Starting from \beta=0, of course, there’s still no signal in the linear system given by X_I and y. So, we still shouldn’t reject the null hypothesis.

What Freedman observed, however, is that the p-value corresponding to an F-test on the final model is sharply biased towards smaller p-values. We’re therefore much more likely to reject the null hypothesis (\beta=0) than we should be.

Biased p-values in Freedman’s two-stage regression with n=1000 samples and d=50 variables

The problem Freedman discovered is now often called inference after selection. We first select variables and then we’d like to do inference, i.e., compute p-values or confidence intervals, on the selected variables. There’s nothing specific about F-tests or T-tests in this example. It’s a fundamental problem with two-stage data-dependent analyses.

A safe way to do this is to perform the two steps on independent data sets. So, in a sense, sample splitting solves this two step problem. Statisticians have also come up with a number of clever methods to do the two steps correctly on the same sample.

More broadly, Freedman’s paradox relates to a practice that later became known as p-hacking. The term describes the practice deliberately choosing an analysis in a data-dependent way so as to find a significant p-value. We’ll return to this problem in the next chapter. Freedman’s observation foreshadowed problems to come.

Notes

Dwork et al. initiated the study of adaptive data analysis.1 Tools from differential privacy give variants of the holdout method that have stronger guarantees under adaptive use. The idea is to add a sufficient amount of noise to each empirical risk computation on the holdout data. Whereas the plain holdout method supports linear number of queries in the worst case, a holdout method based on noise addition can support a quadratic number of queries. Bassily et al.2 improved the error bounds compared with the suboptimal bounds by Dwork et al. Under cryptographic hardness assumption, however, no holdout method can support more than a quadratic number of queries.3

Blum and Hardt4 study holdout reuse in the context of machine learning benchmarks. The key observation they make is that ranking is different from evaluation: If the goal is to identify the best model from a sequence of model evaluations, a variant of the holdout method supports an exponential number of queries. We will return to this result in Chapter 8. In the same paper, they also point out how the basic holdout method fails in the adaptive setting due to overfitting via ensembling. Hardt provided a proof of this proposition in a subsequent paper.5 Dwork et al. gave a similar argument for how adaptively fitting a linear model can lead to the same deviation between risk and empirical risk.6 Feldman, Frostig, and Hardt further develop the connection between boosting and overfitting.7

In a 2015 blog post, I illustrated how overfitting by ensembling may be applied in a real data science competition.8 On the Heritage Health Prize competition—that featured 3 million dolars in prize money—a variant of the algorithm can climb the public leaderboard form rank 146 to 6 with 700 queries. This requires solving an additional technical challenge. The algorithm in this chapter starts from accuracy 0.5. This isn’t helpful, if the top of the leaderboard is already much more accurate than that. Instead of picking random functions in the ensemble, the fix is to instead pick functions that randomize over two well-performing models: Pick a random clasisifer by choosing for each input one of the two models at random. The ensemble will then improve relative to the average accuracy of the two models. You can think of the method in this chapter as randomizing over the all ones and the all zeros classifier.

Freedman described the problem with p-values after selection in a short 1983 paper.9 Freedman neither called it a paradox, nor did he believe that the observation was new. Nevertheless, his note has been quite influential. There’s now much work on this problem in statistics under the name post-selection inference10 or inference after selection.11 Hastie, Tibshirani, and Friedman discuss a variant of Freedman’s paradox in the context of cross validation in Chapter 7.10.2 of their book.12

While adaptivity brings theoretical challenges, many recognize the need for permitting adaptivity in statistical analysis. In particular, John Tukey and and George Box both advocated for allowing incremental progress in data analysis.13 See also the argument by Gelman and Loken14 that we’ll return to in the next chapter.

References

Bassily, Raef, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. “Algorithmic Stability for Adaptive Data Analysis.” In ACM Symposium on Theory of Computing (STOC), 1046–59, 2016.
Blum, Avrim, and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” In International Conference on Machine Learning (ICML), 1006–14. PMLR, 2015.
Box, George. “Scientific Method: The Generation of Knowledge and Quality.” Quality Progress 30, no. 1 (1997): 47.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. “The Reusable Holdout: Preserving Validity in Adaptive Data Analysis.” Science 349, no. 6248 (2015): 636–38.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. “Preserving Statistical Validity in Adaptive Data Analysis.” In ACM Symposium on Theory of Computing (STOC), 117–26, 2015.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. “Generalization in Adaptive Data Analysis and Holdout Reuse.” In Neural Information Processing Systems (NeurIPS), 2015.
Feldman, Vitaly, Roy Frostig, and Moritz Hardt. “The Advantages of Multiple Classes for Reducing Overfitting from Test Set Reuse.” In International Conference on Machine Learning (ICML), 1892–1900. PMLR, 2019.
Fithian, William, Dennis Sun, and Jonathan Taylor. “Optimal Inference After Model Selection.” arXiv:1410.2597, 2014.
Freedman, David A, and David A Freedman. “A Note on Screening Regression Equations.” The American Statistician 37, no. 2 (1983): 152–55.
Gelman, Andrew, and Eric Loken. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘p-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” Department of Statistics, Columbia University 348, no. 1-17 (2013): 3.
Hardt, Moritz. “Climbing a Shaky Ladder: Better Adaptive Risk Estimation.” arXiv:1706.02733, 2017.
———. “Competing in a Data Science Contest Without Reading the Data.” https://blog.mrtz.org/2015/03/09/competition.html, March 2015.
Hardt, Moritz, and Jonathan Ullman. “Preventing False Discovery in Interactive Data Analysis Is Hard.” In Annual Symposium on Foundations of Computer Science (FOCS), 454–63. IEEE, 2014.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Corrected 12th Printing). Springer, 2017.
Lee, Jason D, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. “Exact Post-Selection Inference, with Application to the Lasso.” The Annals of Statistics 44, no. 3 (2016): 907.
Tukey, John W. “We Need Both Exploratory and Confirmatory.” The American Statistician 34, no. 1 (1980): 23–25.

  1. Dwork et al., “The Reusable Holdout”; Dwork et al., “Preserving Statistical Validity in Adaptive Data Analysis”; Dwork et al., “Generalization in Adaptive Data Analysis and Holdout Reuse”.↩︎

  2. Bassily et al., “Algorithmic Stability for Adaptive Data Analysis”.↩︎

  3. Hardt and Ullman, “Preventing False Discovery in Interactive Data Analysis Is Hard”.↩︎

  4. Blum and Hardt, “The Ladder”.↩︎

  5. Hardt, “Climbing a Shaky Ladder”.↩︎

  6. Dwork et al., “Preserving Statistical Validity in Adaptive Data Analysis”.↩︎

  7. Feldman, Frostig, and Hardt, “The Advantages of Multiple Classes for Reducing Overfitting from Test Set Reuse”.↩︎

  8. Hardt, “Competing in a Data Science Contest Without Reading the Data”.↩︎

  9. Freedman and Freedman, “A Note on Screening Regression Equations”.↩︎

  10. Lee et al., “Exact Post-Selection Inference, with Application to the Lasso”.↩︎

  11. Fithian, Sun, and Taylor, “Optimal Inference After Model Selection”.↩︎

  12. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning.↩︎

  13. Tukey, “We Need Both Exploratory and Confirmatory”; Box, “Scientific Method”.↩︎

  14. Gelman and Loken, “The Garden of Forking Paths”.↩︎