Our closer empirical look at the scientific crisis in the applied statistical sciences made one thing clear. By all accounts, many fields that apply statistical methods have been engulfed in a major replication crisis. Machine learning shares the Achilles heel of statistical measurement with these data-driven disciplines. In addition, the research ecosystem in machine learning is far from perfect. Yet, some central empirical phenomena are different and unique to machine learning.

If the core scientific mandate of machine learning is to identify better and better models, benchmarking seems to be doing a reasonable job at promoting progress. At least, it did in the ImageNet era. Replication studies in machine learning reveal that model rankings have internal validity and that they are often stable across different domains, so long as models train on the same training data. This is a better outcome than we had reason to hope for given the formal guarantees of the holdout method.

The empirical reality begs an explanation. How could it be that a decade of competitive use of the ImageNet test set didn’t go south? The excessive use of ImageNet in the deep learning era of the 2010s certainly looks like what Duda and Hart called training on the testing set: “a long series of refinements guided by the results of repeated testing”. This concern is more than a hypothetical. We saw simple and practical demonstrations, supported by theoretical arguments, showing that it is entirely possible to overfit to test sets through repeated use. A few thousand evaluations are enough, in principle, to break the holdout method. The community has easily made hundreds of thousands of evaluations on popular test sets. So, why didn’t the benchmarking enterprise run off a cliff, leading to overspecialized results of little scientific value?

We now examine a few emerging explanations. These explanatory pieces all have something in common. They aren’t so much about statistical method, but rather about how scientists operate individually and as a community. Several psychological and social patterns in the community turn out to be unexpected forces against the scientific crisis. Throughout this chapter, the focus is on the internal validity of the iron rule. Why does competitive empirical testing on benchmarks clear the bar of replication on fresh data? Theoretically we can answer this question by proving stronger guarantees for the holdout method under different assumptions that capture how the community operates. We’ll see a few such assumptions and theorems that follow from them.

Internal validity rules out that you can game a benchmark by climbing the leaderboard with spurious actions, such as ensembling random models as we discussed in Chapter 5. It’s the first necessary step towards external validity. External validity of benchmark results is less well understood than internal validity. But at least there is an intuitive link between internal and external validity in benchmarking. If there is no spurious way to climb the leaderboard, your best strategy is to make honest progress. Real progress, in turn, should hold up in other domains as well.

Fully understanding the empirical phenomena is an ongoing research quest, full of intriguing open problems. This chapter provides a starting point.

Beating the previous best

A key characteristic of the machine learning community is its focus on achieving state-of-the-art results. State-of-the-art is whatever is at the top of the leaderboard and that’s where the community’s attention naturally gravitates toward. The basic research mechanic is beating the previous best. And, to oversimplify, that’s primarily what benchmarking is about.

Competition over the top spots on the leaderboard is a unique aspect of the benchmarking research ecosystem. In other scientific applications of statistics, we may be interested in estimating numerical quantities, such as the treatment effect of a new intervention. In machine learning, researchers compete over the best model for a given task. Competing over the top of the leaderboard might seem narrow and, in a sense, anti-scientific. Shouldn’t science explore more freely?

But if the scientific crisis had something to do with researcher degrees of freedom, we can also see the benefits of competitive testing. Competition over the top spot in a benchmark radically reduces researcher degrees of freedom. The test set is fixed, the metric is fixed, and the evaluation protocol is fixed. In traditional benchmarks, like ImageNet, also the choice of training data is set in stone and agreed on. The only freedom is in the choice of the training algorithm and model architecture.

For a benchmark to work, it must be able to correctly identify the best model at any point in time under competitive pressure. Correctly identifying the best is also sufficient to get a full ranking. After all, the k-th best model is the best model after discarding the top k-1 models. So, we can rank by repeatedly identifying the best.

As it turns out, identifying the best model from a sequence of adaptively chosen models is easier than accurately estimating the loss of each model. While estimating the loss of each model is sufficient for identifying the best, it is not necessary. To put it succinctly, if all we care about is beating the previous best, there is no way to overuse a test set. This is a theoretical result we prove next.

Leaderboard error

To make the point precise, we need a definition of what it means for a holdout mechanism to correctly identify the best model at any point in time. Given a sequence of models, we require that the holdout mechanism correctly estimates the smallest risk seen so far. This is different from trying to give an accurate answer to all requests. The mechanism might err—or decline to answer—on models that are nowhere near the best.

This relaxed requirement motivates the notion of leaderboard error. Leaderboard error is small if the holdout method keeps track of the risk of the best performing model over time, rather than the risk of all models ever evaluated.

Definition.

The leaderboard error of a holdout mechanism producing a sequence of estimates R_1,\dots,R_k given a sequence of adaptively chosen models f_1,\dots,f_k is defined as: \mathrm{lberr}(R_1,\dots,R_k)= \max_{\text{step } t\in\{1,\dots,k\}}\left|R_t - \min_{\text{model }i\in\{1,\dots, t\}} R(f_i)\right|\,.

Leaderboard error is small if at any point in time t, the estimate R_t is a good estimate of the smallest risk seen up until point t, i.e., \min_{1\le i\le t}R(f_i).

Operationally, if the leaderboard error is bounded by \epsilon>0 and the holdout mechanism returns a value R_t such that R_t < R_{t-1}-\epsilon, we know that model f_t improved over all prior models f_{t-1},\dots,f_1. Note that the leaderboard error, like empirical risk, is a sample quantity, i.e., a function of the sample that the holdout mechanism works with.

There is a simple and natural holdout method that achieves small leaderboard error. It works like this: Given a new model, check if it beats the previous best by a significant amount on the test set. If so, update the previous best, otherwise do nothing.

We call this the Ladder mechanism. More precisely, for each given model the Ladder compares the empirical risk estimate of the model to the smallest empirical risk achieved by any model encountered previously. If the empirical risk is below the previous best by some margin, it announces the empirical risk of the current model and stores it as the best seen so far. However, if the empirical risk is not smaller by a margin, the Ladder simply continues to report the previous best. This reveals no new information other than that the model didn’t improve.

We focus on risk with respect to the zero-one loss. However, the ideas apply more generally to any bounded loss function.

Algorithm.

Ladder mechanism:

Given dataset S, threshold \eta>0

Assign initial top score R_0=1.
For each round t = 1,2,\ldots:
1. Receive model f_t\colon{\cal X}\to{\cal Y}
2. Compute the empirical risk R_S(f_t).
  1. If R_S(f_t) < R_{t-1} - \eta, update top score to R_t= R_S(f_t).
  2. Else keep previous top score R_t= R_{t-1}.
3. Output current top score R_t

The next theorem shows that the Ladder algorithm indeed achieves small leaderboard error. The main feature of the error bound is a logarithmic dependence on the number k of models that we evaluate. This is the kind of dependence we get out of the Hoeffding bound in the non-adaptive case. However, here we get the logarithmic dependence even in the adaptive case, if we only care about identifying the best model.

Theorem.

For a suitably chosen threshold parameter, for any sequence of adaptively chosen models f_1,\dots,f_k, the Ladder algorithm achieves with probability 1-o(1): \mathrm{lberr}(R_1,\dots,R_k) \le O\left(\frac{\log^{1/3}(kn)}{n^{1/3}}\right)\,.

The proof of the theorem follows from a variant of the adaptive tree argument introduced in Chapter 5. The new element here is that we carefully prune the tree so as to better bound its size.

Proof.

Set the threshold parameter to \eta = \log^{1/3}(kn)/n^{1/3}. This technical choice will become clear later. For this threshold \eta, we need to show that with probability 1-o(1) we have for every round t\in[k] the error bound |R_S(f_t)-R(f_t)|\le O(\eta) = O(\log^{1/3}(kn)/n^{1/3})\,.

Let {\cal A} be the adaptive analyst generating the function sequence. The algorithm {\cal A} naturally defines a rooted tree {\cal T} of depth k recursively defined as follows:

The root is labeled by f_1 = {\cal A}(\emptyset).
Each node at depth 1<i\le k corresponds to one realization (h_1,r_1,\dots,h_{i-1},r_{i-1}) of the tuple of random variables (f_1,R_1,\dots,f_{i-1},R_{i-1}) and is labeled by h_i = {\cal A}(h_1,r_1,\dots,h_{i-1},r_{i-1}). Its children are defined by each possible value of the output R_i of the Ladder mechanism on the sequence h_1,r_1,\dots,r_{i-1},h_i.

We are going to bound the size of the tree using properties of the Ladder algorithm. Let B = (1/\eta+1)\log(4k(n+1)). We claim that the size of the tree satisfies |{\cal T}|\le 2^B. To prove the claim, we will uniquely encode each node in the tree using B bits of information. The claim then follows directly.

This is called a compression argument and it goes as follows.

We use \lceil\log k\rceil\le \log(2k) bits to specify the depth of the node in the tree.
We then specify the index of each i\in[k] for which the Ladder algorithm performs an “update” so that R_i < R_{i-1}-\eta together with the value R_i.

The key observation is that since R_i \in[0,1] there can be at most \lceil 1/\eta\rceil\le (1/\eta)+1 many such steps. This is because the loss is lower bounded by 0 and decreases by \eta each time there is an update.

Low bit encoding of the adaptive tree. Dashed lines correspond to rounds with no update.

Moreover, there are at most n+1 many possible values for R_i, since we’re talking about the zero-one loss on a dataset of size n. Hence, specifying all such indices requires at most (1/\eta + 1)(\log(n+1)+\log(2k)) bits. These bits of information uniquely identify each node in the tree, since for every index i not explicitly listed we know that R_i=R_{i-1}. The total number of bits we used is: (1/\eta+1)(\log(n+1)+\log(2k))+\log(2k) \le (1/\eta +1)\log(4k(n+1)) = B\,. This establishes the claim we made. The theorem now follows by applying a union bound over all nodes in {\cal T} and using Hoeffding’s inequality for each fixed node. Let \mathcal{F} be the set of all functions appearing in {\cal T}. By a union bound, we have \begin{aligned} \mathop{\mathrm{\mathbb P}}\left\{\exists f\in \mathcal{F}\colon \left|R(f)-R_S(f)\right|>\epsilon\right\} & \le 2|\mathcal{F}|\exp(-2\epsilon^2n)\\ & \le 2\exp(-2\epsilon^2n+B)\,. \end{aligned} Verify that by putting \epsilon=5\eta, this expression can be upper bounded by 2\exp(-n^{1/3})=o(1), and thus the claim follows.

The logarithmic dependence on the number of evaluations is great, but the term n^{-1/3} that depends on the sample size doesn’t meet the more familiar n^{-1/2} rate. The exponent 1/3 in the error just doesn’t seem quite right. A bit of work, however, can show that the analysis is tight for the Ladder algorithm. You can find a sequence of functions that forces an error matching the upper bound. But there could still be another algorithm that gets a better exponent.

We know from Chapter 2 that we’re not going to beat the exponent \frac12. But between \frac13 and \frac12 we don’t know what’s possible. Although not obvious, a more sophisticated randomized version of the Ladder algorithm achieves the exponent 2/5. But that’s the best researchers know to date. The following table summarizes the state of affairs.

Analyst	General estimation error	Leaderboard error
non-adaptive	O(\sqrt{\log k/n})	O(\sqrt{\log k/n})
adaptive	O(\sqrt{k/n})	O((\log k/n)^{2/5})

SOTA psychology

The Ladder algorithm shows how to guarantee benchmark integrity under an excessive number of model evaluations. Rather than guaranteeing accurate evaluations for all possible queries, the algorithm focuses on identifying the best. The mechanism couldn’t be any simpler: Check to see if you beat the previous best by some margin. If you did, publish the result. Otherwise, try again.

The principle is so simple, in fact, that researchers could easily apply it in their minds. And perhaps they knowingly or unknowingly do—as sort of a mental heuristic. Heuristics always discard some information and avoid excessive computation in favor of simple ways of going about things. Humans excel at finding good heuristics to navigate complex systems.

So, let’s postulate about the community. Imagine we’re in a world where researchers primarily care if their model improved over state-of-the-art (SOTA). Call this community mindset “SOTA psychology”. It’s fundamentally the same principle that lies behind the iron rule. All scientific questions must ultimately be settled by competitive empirical testing. Once we decide on a benchmark, we want the best model to win.

In this view of the research community, benchmarks have the primary purpose to identify the best model. Rather than providing a numerical estimate of model accuracy for each model we evaluate on the test set, the goal is to identify the best model at any point in time. Even if syntactically researchers apply the standard holdout method, they primarily use the information from the test set to identify when they improved. If they didn’t improve, they don’t read much into it other than I didn’t improve. This mental pattern is enough to implement the Ladder algorithm. The community reinforces this pattern by rewarding state-of-the-art results with conference publications, awards, and an increase in attention not afforded to results that didn’t beat the previous best.

Consequently, there are two ways to think of the Ladder algorithm. One is prescriptive. It’s an alternative holdout method you can actively and explicitly implement to avoid test set overuse in a benchmark or a competition. This is the typical way we think about algorithm design. In this view, algorithms are interventions that change the way a running system works.

The other way to think about the Ladder algorithm and its analysis is as a descriptive model of the community. We can think of the Ladder algorithm as a kind of social algorithm that the community implements on its own. From this perspective, the algorithm describes a psychological and sociological aspect of how the community works. The apparent flaw that critics of machine learning point to—SOTA chasing—is actually a mechanism (the Ladder algorithm) that prevents the holdout method from collapsing.

Biases and heuristics

The Ladder algorithm is a heuristic resulting from an excessive focus on top results. It might seem narrow-minded as a scientific credo to only care about identifying the best, but it has the benefit of protecting the test set from overuse. Ignorance is bliss, even if counterintuitive in the context of scientific work. There’s more to learn by looking at how humans make decisions.

Decades of research have compared human decision making to the old economic ideal of a rational decision maker. Rational agents compute optimal—that is, utility maximizing—solutions to decision problems using all available information. Rational agents all strategize alike. Optimality makes them interchangeable so that in theory we’re left with one representative agent. The rational agent model—homo economicus—is the formal basis for large swaths of economic theory.

Humans almost never meet the economic ideal of rationality. You may not be shocked to find out that humans aren’t all alike. They don’t optimize perfectly. And they don’t exhaustively use all available information when choosing what to have for dinner. This clash of common sense with economic theory is what motivated economist and AI pioneer Herb Simon to develop the theory of bounded rationality. Simon’s work in turn influenced the heuristics and biases research program of Kahneman and Tversky. Core to this line of research is the idea of cognitive biases. Cognitive biases are systematic deviations from rationality commonly found in human decision making.

The last chapter touched on how cognitive biases can fuel the scientific crisis. Researchers, for example, are vulnerable to confirmation bias. Scientists tend to confirm what they think is true in the first place. They select evidence that supports their beliefs, while discarding information that is in conflict with their prior assumptions. Fischhoff’s hindsight bias is another such culprit.¹ Once an uncertain event is resolved one way or the other, we mistakenly believe—with hindsight—that we predicted this outcome all along. “I’ve always been saying” is a phrase you’ll hear a lot among scientists. This cognitive pattern makes us dismiss valuable experimental data as obvious or unsurprising.

But cognitive shortcuts of the analyst can also mitigate test set overuse. A researcher is unlikely to behave like the worst-case analyst that gave us the pessimistic bounds in Chapter 5. In this worst case example, the analyst had to act on minor differences, such as distinguishing between a model achieving error 0.501 and one achieving error 0.499. This defies common sense. Moreover, the analyst had to look back at all the things ever tried and precisely put them together in some specific way. This again seems to run counter to intuitive human reasoning.

Try it out. Ask your office mates to multiply out 1\times2\times3\times4\times5\times6\times7\times8 in under 5 seconds. Because they’ll likely run out of time, they’ll make some approximation. Now ask another group of friends to multiply out the same sequence but in reverse order 8\times 7\times 6\times 5\times 4\times 3\times 2\times 1. They will again make some approximation. It turns out that humans tend to come up with an underestimate if they go about multiplying the numbers in natural order and a significantly higher estimate if they go in reverse. Kahneman and Tversky discovered this bias some fifty years ago and called it anchoring effect.² Humans rely heavily on initial pieces of information (anchors) when solving problems. This is the kind of observation that launched the study of cognitive biases.

A related observation is even older.³ Humans tend to have a recency bias that assigns greater weight and significance to more recent events compared to earlier ones. A similar but broader cognitive tendency is known as the availability heuristic.⁴ Humans instinctively prioritize and give undue weight to evidence that’s easier to recall or retrieve, whether due to recency, convenience, prominence, or frequency of exposure.

Anchoring, recency, and availability all have something in common. When applied to a scientist reusing a test set, they limit the amount of information that can pass from the test set to the analyst. This in turn mitigates overfitting. It’s as if the analyst made fewer evaluations. Just like the obsession with chasing state-of-the-art protects the test set, so do many cognitive biases.

We can make the intuition mathematically rigorous by modeling the analyst as a dynamical system.⁵ The analyst has some internal mental state h_{t+1}=\Psi(h_t, a_t) evolving according to state transition map \Psi. The state transition map depends on the internal state h_t and the holdout answer a_t that the analyst observes in response to the query q_t=q_t(h_t) chosen based on the internal state. By making different assumptions on the state transition map, we can model different cognitive limitations. Consider, for example, a contractive map satisfying \|\Psi(h, a) - \Psi(h', a)\|\le \lambda \|h-h'\| for some scalar \lambda\in(0,1). This corresponds to an analyst that discounts past information in each step, resulting in recency bias. We can dial \lambda between 0 and 1 to quantify the strength of the bias. What’s neat is that this interpolates between the two extremes of a non-adaptive analyst and a fully adaptive analyst, giving us a continuum of generalization bounds from optimistic to pessimistic.

Gigerenzer and Brighton argue that heuristics that rely on less information and computation can often lead to better results—a less is more phenomenon.⁶ We can see an instance of this general pattern in holdout reuse. A simple mental heuristic like the Ladder principle that discards information ultimately leads to better estimates. Likewise, different cognitive shortcuts effectively reduce test set leakage.

Rip Van Winkle’s replication problem

Limiting information is a powerful tool in arguing about test set reuse. Formally, the proof of the Ladder theorem is based on a compression argument: If we can show that at most B bits of information were revealed about the test set, then the generalization gap can be at most O(\sqrt{B/n}). The reason is that B bits of information can index at most k=2^B different functions. We can therefore apply Hoeffding’s inequality with a union bound over k=2^B functions and make use of its \sqrt{\log k}=\sqrt{B} dependence in the error. You have all the tools you need to formalize and prove this statement as an exercise.

Compression is a useful tool, since it works for arbitrary algorithms and also applies to the adaptive case. Any deterministic adaptive analyst using B bits of information from the test set can implicitly only search over 2^B functions, even if the analyst uses the test set adaptively. It makes no difference for the argument.

The compression argument has another application to benchmark longevity that starts from a lovely thought experiment proposed by researchers Arora and Zhang. Call it Rip Van Winkle’s replication problem.

Rip Van Winkle is an American short story about a Dutch-American villager who drinks a mysterious liquor in the New York state mountains, falls asleep for 20 years, and wakes to find he has slept through the American Revolution. Imagine a modern day Rip who fell asleep in 2012, days after the release of ILSVRC-2012, and wakes up in 2022. Before Rip fell asleep, he trained linear models on top of handcrafted feature transforms. On ILSVRC-2012, these models would include the correct label among their top five predictions about 75% of the time, showing modest success in a metric called top-5 accuracy. Building highly accurate models for ImageNet looked like the holy grail of computer vision. When Rip woke up, however, end-to-end trained deep neural networks would get the correct label right with one guess about 90% of the time. People had stopped looking at top-5 accuracy, because it was too easy. To the extent that models made errors at all, it was largely due to ambiguous instances.⁷ For academic purposes, researchers considered ImageNet solved.

Shocked by the developments, Rip rushes to get caught up. Now, how many bits of information about the ILSVRC-2012 test set do you have to tell Rip so that he can reproduce, say, ResNet-152? Rip has everything you could’ve asked for in 2012, but not more than that. You need to describe whatever is new, the model architecture, the hyperparameters, and so forth. Since these design choices directly or indirectly depend on test set information, they potentially reveal information about the test set. Based on some plausible assumptions and simplifications, Arora and Zhang estimate that 1032 bits are enough to replicate ResNet-152 from available knowledge in 2012. With an optimized compression bound, this gives an upper bound on the generalization gap of at most 5%. Not bad!

There is a subtlety though. It would likely be quite hard to actually train ResNet-152 in the 2012 software ecosystem. Modern deep learning libraries and GPU optimizations were still missing. Training deep neural networks in 2012 was still a high-friction grind: What is an import statement in Python now, was months of overcoming obstacles back then. On the other hand, if we allow Rip to use the 2022 software ecosystem, we’re cheating a bit. After all, the software ecosystem co-evolved in tandem with progress on ImageNet. Various software developments happened with ImageNet in mind. It’s not clear how much information from the test set leaked into the software ecosystem.

The touch of the Blenheim Spaniel

How much can we overfit to the ImageNet test set if we really try? Let’s put on our adversarial hat and try to come up with an attack on the ILSVRC test set. Ignoring all better judgment, our only goal is to find a sequence of adaptive queries to the test set that drives up the generalization gap as much as possible. In Chapter 5, we saw such an attack for binary classification problems that works by ensembling random models. This attack is quite effective, forcing a large generalization gap with a few thousand queries on relatively large test sets. Can we do the same on ImageNet?

The attack from Chapter 5 was for binary classification problems and doesn’t directly apply to ImageNet, since ILSVRC-2012 has 1000 classes. But it’s not hard to generalize the idea. First, we pick k random classifiers as we did before. We evaluate each on the test set. Select the classifiers that have better than random chance accuracy on the test set. For C=1000 classes, accuracy of random guessing equals 1/C. So, we select the random classifiers that exceed accuracy 1/C. We then ensemble the selected classifiers. To do so, replace the majority vote with a plurality vote that picks the most frequently predicted label on each data point.

Code it up and try it out. You’ll be surprised to find out how poorly it performs. The attack barely gets off the ground. Although the algorithmic idea of ensembling random functions generalizes from two classes to any number of classes, it somehow just doesn’t work well at all. This is for a fundamental reason. Ensembling random functions in the binary case forced a generalization gap around \sqrt{k/n} where k is the number of functions and n is the sample size. But in the case of C classes, the analogous attack only achieves generalization gap about C^{-1}\sqrt{k/n}. It’s worse by a factor C. So, for 1000 classes, we’d need one million times as many queries to achieve the same error as in the binary case.

There is a more clever attack that forces a generalization gap \sqrt{k/Cn}. But that is where it ends. No attack in the multi-class setting can do asymptotically better than this (up to perhaps a logarithmic factor). Having many classes therefore mitigates the risk of overfitting due to test set reuse. The larger the number of classes, the harder it is to overfit to a test set. It’s as though the sample size increases from n to Cn. For ImageNet ILSVRC-2012 that’s a factor 1000 improvement in effective sample size. It’s like having a test set of size 50,000,000.

There is only one catch. These bounds are for an analyst that knows nothing about the test set and just starts querying the holdout from scratch. This is the same setup as the method we saw in Chapter 5 that overfits by ensembling random functions. What the above bounds show is that any such method no longer works well on a task with many classes.

In practice, however, researchers always have some prior information about the test set. In particular, we might already know a model that performs reasonably well on the test set. This gives us some good guess to start from about what the labels in the test set are. So, to make our attack more powerful, we give it access to a good model to begin with. What this means is that our attack doesn’t have to start from scratch. It can utilize good predictions.

But how exactly can we use a well-performing model to overfit with fewer queries? There are at least three increasingly effective strategies:

The attacker uses the model’s logits as the prior information about what the unknown labels are in the test set. This gives only a minor improvement over a prior-free attack.
The attacker uses the model’s logits to restrict the attack to a subset of the test set corresponding to the lowest “confidence” points. This strategy gives modest improvements over a prior-free attack.
The attacker can exploit the fact that the model has good top-R accuracy, meaning that, for every image, the R highest weighted categories are likely to contain the correct class label. The attacker then focuses only on selecting from the top R predicted classes for each point. For R = 2, this effectively reduces class count to the binary case.

While the first two show only minor improvements over the baseline, the third idea has legs. Based on this idea, researchers came up with a clever attack on the ImageNet ILSVRC test set that forces a 3% generalization gap with 5000 queries.⁸ This is not too far from the Rip Van Winkle estimate.

You might wonder if we can prevent overfitting by spuriously copying two classes into many classes. Unfortunately, this alone doesn’t help since a good model for the problem would have good top-2 accuracy, therefore reducing the problem to the binary case. For multiple classes to help, the classes have to be reasonably hard to distinguish.

What’s the moral of this story? ImageNet curiously contains some 120 different dog breeds, such as the Appenzeller, Blenheim Spaniel, Otterhound, Bluetick Coonhound, Groenendael, and the Schipperke, to name a few. This benchmark artifact might seem frivolous. After all, most humans struggle to identify any of these. Unexpectedly, though, having many similar classes makes it harder—not easier—to overfit to the test set. Much like SOTA psychology, what appears to be an irrational quirk of the community’s benchmark design turns out to be a structural safeguard.

Code and collaboration

Researchers compete but they collaborate just as much. The impressive software ecosystem growing around benchmarks makes it easy to share resources for model development. You’d never start from scratch if you tried out a new idea. You’d always take the most similar piece of code you can find and incrementally tweak it.

Collaboration and code sharing mean that researchers try out similar things. We can see this empirically by looking at model similarity. Suppose one model has 90% accuracy and another model has 75% accuracy. What’s the chance that the two binary classifiers agree on a randomly chosen point? If each classifier erred on a random subset of points independently, we’d see the two models agree on a randomly chosen point with probability 0.9\times 0.75 + 0.1\times 0.25 = 0.7\,. This is the baseline level of agreement that follows from the accuracy numbers if errors are independent and random. For a multi-class problem with C classes, the chance that both models are wrong but agree goes down to 0.1\times 0.25\times \frac1{C}. For ILSVRC-2012 (C=1000) the baseline is therefore roughly 0.675 rather than 0.7. Still, highly accurate models must agree a lot. Two models are more similar than expected if the agreement is significantly higher than this random baseline value.

We can look at a set of ImageNet models and ask, what is the average similarity between pairs of models from this set. ImageNet models turn out to have higher than expected model similarity. This has interesting theoretical consequences. Evaluating models with high similarity is like evaluating fewer models. There can only be so many functions that agree a lot. As a result, it’s possible to prove stronger generalization bounds for the holdout method for model families of high similarity. This works in the non-adaptive as well as the adaptive case.

To summarize, dataset and code sharing lead to high model similarity, which in turn mitigates test set overuse somewhat.

Frictionless reproducibility

In Data Science at the Singularity, Donoho coins the term frictionless reproducibility to describe the advances in the data and software ecosystem that power applied machine learning research. Frictionless reproducibility has three components:

Dataset creation
Code ecosystem enabling easy re-execution
Challenges, including both data science competitions and machine learning benchmarks

These aspects of machine learning research deserve attention. Whereas model architectures and optimization methods often take the limelight, the software ecosystem around machine learning benchmarks is the quiet infrastructure that we all use but often take for granted. And it’s a precious public infrastructure that the scientific community might lose again, if software moves behind the closed doors of a few hegemonic corporations.

When it works, frictionless reproducibility acts as a lubricant in the gears of a scientific machine built after the iron rule. If we’re going to settle all disputes by competitive empirical testing, it helps that we can carry out these empirical tests efficiently. As the cost of empirical dispute resolution approaches zero, the field can iterate through hypotheses at a pace never before seen in the history of science. Lower friction makes the wheels of the machine spin faster. But who sets its direction?

Kaggle versus science

Take the data science platform Kaggle as a case study. In its heyday, Kaggle organized hundreds of popular data science competitions, growing from competition platform into a general data science community hub. In 2025, Kaggle issued nearly $4 million in prize money.⁹ Kaggle checks all three boxes of frictionless reproducibility. The datasets are easily available. The code is ready to run; Kaggle even provides Docker images—virtual containers to run reproducible workflows. A steady stream of challenges with public leaderboards gamify the experience, while cash prizes incentivize participants to put in the work. Surely, Kaggle excels at all three criteria for frictionless reproducibility.

Kaggle and similar platforms undoubtedly promote productivity by lowering friction. But there’s a difference between productivity and scientific advances. Productivity doesn’t guarantee scientific value. Indeed, Kaggle competitions have rarely produced novel science. Winning submissions in competitions often combine standard methods, such as gradient boosting, problem-specific feature engineering, and ensembling. The primary model-building lesson from years of ML competitions is that gradient boosting is the go-to choice for most competitions.¹⁰ This is an intriguing meta-finding about machine learning, but competitions rarely result in academic papers.

Code from competitions tends to be of little reusability value. Companies sponsoring competitions rarely end up using the winning submissions in their own systems. Netflix notably discarded the code from its $1 million Netflix Prize competition, one of the most widely publicized machine learning competitions, held years before Kaggle launched.¹¹ Companies were more interested in branding and recruiting than in the code itself. Kaggle recognized this early on. Before its acquisition by Google in 2017, the startup had pivoted from competition platform to community hub. What mattered wasn’t the output of the competitions, but the people competing in them.

Lack of friction doesn’t on its own create value. Conversely, scientists often overcome significant friction in pursuit of scientific advances. Return to ImageNet. Early work on ImageNet was relatively high friction. Predating the software we use today, building the breakthrough AlexNet architecture required writing low-level GPU optimizations from scratch. Especially in the early days, training and testing on ImageNet was slow. Reproducing the results from other labs was tedious, if at all possible. Yet, progress was steepest in those early days. ResNets, one of the greatest highlights of the era, came long before Hugging Face, PyTorch, or Weights and Biases. The development of residual networks even predates the first public release of TensorFlow. The first release of the ResNet code was written in Caffe, an academic deep learning framework mostly senior researchers still remember. The early ImageNet model development happened despite relatively high friction. The same is true for the early large language model development that we’ll get to in Chapter 10.

The difference between a Kaggle competition and science more broadly clarifies an important aspect of the iron rule. On the one hand, ImageNet catalyzed the deep learning revolution instead of promoting endlessly tweaked feature transforms. Kaggle competitions, on the other hand, often result in tweaked standard methods. These two facts seem at odds with each other, since Kaggle and ImageNet both run on competitive empirical testing. What’s different? And why doesn’t competitive empirical testing in the form of scientific benchmarking descend into mindless metric optimizing, a recipe for Goodhart’s law?

While the iron rule dictates that scientists resolve disputes via competitive empirical testing, it does not dictate what motivates scientists in the first place. Scientists often passionately hold beliefs that motivate what they do, what they try out, and what they build. The real prize is that their idea wins—not that they reach the top of the leaderboard. In a Kaggle competition, in contrast, the cash prize is directly equivalent to being at the top of the leaderboard.

Competitive empirical testing is an organizing principle for modern science. It prescribes a regulatory mechanism for scientific acceptance. But it doesn’t prescribe what motivates scientists. Nor does it mean that tweaking numbers on datasets necessarily promotes scientific progress. There’s no inherent direction to the iron rule. Scientists hold the steering wheel.

From benchmarking to scientific progress

Several theoretical results in this chapter speak to the strong internal validity of machine learning benchmarks. The holdout method and its variants often do better than the results from Chapters 4 and 5 suggest. Key to these results are psychological and social factors in how individual scientists and the community as a whole utilize benchmarking. Focusing on beating the previous best promotes benchmark longevity. Mental heuristics and shortcuts can thwart overfitting. Practices like code sharing give test sets higher mileage. Even seemingly mundane problem artifacts can curb the possibility of climbing a leaderboard through spurious improvements. All of these results give guarantees that limit the extent to which we can deliberately or inadvertently climb a leaderboard with actions that don’t correspond to progress on the task. Benchmarks turn out to be more resilient than we had reason to believe. But this resilience alone isn’t enough.

Successful benchmark design has two components. First, we need to make sure that there is at least one interesting solution to the problem. In some domains—like image classification—this is easy to guarantee, because we can verify that humans solve the task with high accuracy. In other domains, this is not as straightforward as it sounds. The Fragile Families Challenge, for example, was a well-run machine learning competition about predicting life outcomes. But the dataset seemed to contain no signals that allowed for models better than baseline predictions.¹² When benchmarks go wrong, it’s often because there isn’t a good solution.

This first part of benchmark design is a matter of domain knowledge. We need to have substantive reasons to believe that there is an interesting solution to the benchmark problem. There is no universal, meta-scientific formula for guaranteeing that a benchmark will yield an interesting solution. Scientists have to leap into the unknown. Benchmarking itself is a meta-scientific heuristic. It requires trial and error.

Benchmarks can also go wrong, if there are multiple ways of being good at the task, of which only some are interesting. In this case, we might end up with the uninteresting solution. If you unleash the community on a benchmark, you will eventually find a risk minimizer, but you can’t choose which one you’ll get. You need to set up the problem so that you’re happy with any risk minimizer. For example, there seems to be no easy way to do image classification on a sufficiently large and diverse dataset. Whatever the dataset—be it ImageNet or ImageNot—achieving high accuracy forces the model to do something impressive.

For the second component of benchmarking, we need to make sure that climbing the leaderboard corresponds to decreasing risk on the task. If there are easy ways to climb the leaderboard without improving on the task, it’s likely that one way or the other competitors will—knowingly or unknowingly—exploit these channels. This is where the results of this chapter come in. There are several factors that limit the scope of spurious improvements. Put together, if we curb spurious improvements and there is at least one good solution, we have reason to hope that leaderboard climbing converges to the good solution.

Benchmarking is no silver bullet. In fact, the second part of this book, starting with Chapter 10, examines several ways that benchmarking can fail. Scientists come around to benchmarking—often reluctantly—for lack of a better option. Ultimately, its success depends on psychological and sociological factors within the research community. That is part of what makes a metascience of benchmarking challenging. It isn’t a purely statistical problem. Nevertheless, the last couple of chapters covered empirical and theoretical tools to better understand benchmarking by focusing on how the scientific community works.

Notes

Blum and Hardt introduced the Ladder algorithm and the definition of leaderboard error.¹³ Hardt gave improved bounds on the leaderboard error using noise addition ideas from differential privacy.¹⁴ Chazelle coined the term natural algorithms to describe algorithms that nature implements on its own.¹⁵ My use of the term social algorithm to describe the Ladder mechanism draws on this idea.

Zrnic and Hardt made the connection between cognitive biases and adaptive overfitting.¹⁶ Arora and Zhang contributed the Rip Van Winkle thought experiment and calculated the compression bounds for ImageNet.¹⁷ Compression bounds in the context of adaptive data analysis appeared in the work of Dwork et al.¹⁸ Mania et al.¹⁹ study the role of model similarity in mitigating test set overuse. Mania and Sra offer a theoretical explanation of the on the line phenomenon in terms of model similarity and distributional closeness.²⁰

Feldman, Frostig, and Hardt studied the problem of adaptive test set reuse in the multi-class setting, providing both the theoretical bounds, and the empirical results for ImageNet that I discussed.²¹ In particular, they showed how an adaptive analyst can force an error of \Omega(\sqrt{k/nC^2}). This bound follows from a boosting argument that is, roughly speaking, a more sophisticated version of the ensembling argument in Chapter 5. Acharya and Suresh²² achieved the nearly optimal bound \Omega(\sqrt{k/nC}) using a sophisticated adaptive strategy.

Donoho’s article Data Science at the Singularity develops the argument for frictionless reproducibility, emphasizing its role in the progress of empirical machine learning.²³ The machine learning competitions ecosystem is alive and well in 2025, according to a recent study.²⁴ In particular, competitions have become more diverse in terms of data modalities and methods.

Simon developed bounded rationality in the 1950s.²⁵ Kahneman and Tversky developed and popularized the idea of behavioral biases.²⁶ These ideas have been hugely influential, spawning research fields in several disciplines. Gigerenzer argues against the tendency in behavioral economics to overemphasize biases as flaws in human decision making, a problem he calls bias bias, pointing out how human heuristics often successfully navigate complex scenarios.²⁷

References

Acharya, Jayadev, and Ananda Theertha Suresh. “Optimal Multiclass Overfitting by Sequence Reconstruction from Hamming Queries.” In Algorithmic Learning Theory, 3–21. PMLR, 2020.

Arora, Sanjeev, and Yi Zhang. “Rip van Winkle’s Razor: A Simple Estimate of Overfit to Test Data.” arXiv:2102.13189, 2021.

Blum, Avrim, and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” In International Conference on Machine Learning (ICML), 1006–14. PMLR, 2015.

Carlens, Harald. “State of Machine Learning Competitions in 2025.” ML Contests, 2026.

Chazelle, Bernard. “Natural Algorithms.” In ACM-SIAM Symposium on Discrete Algorithms (SODA), 422–31. SIAM, 2009.

———. “Natural Algorithms and Influence Systems.” Communications of the ACM 55, no. 12 (2012): 101–10.

Donoho, David. “Data Science at the Singularity.” Harvard Data Science Review 6, no. 1 (2024).

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. “Generalization in Adaptive Data Analysis and Holdout Reuse.” In Neural Information Processing Systems (NeurIPS), 2015.

Feldman, Vitaly, Roy Frostig, and Moritz Hardt. “Open Problem: How Fast Can a Multiclass Test Set Be Overfit?” In Conference on Learning Theory, 3185–89. PMLR, 2019.

———. “The Advantages of Multiple Classes for Reducing Overfitting from Test Set Reuse.” In International Conference on Machine Learning (ICML), 1892–1900. PMLR, 2019.

Fernández-Delgado, Manuel, Eva Cernadas, Senén Barro, and Dinani Amorim. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” The Journal of Machine Learning Research 15, no. 1 (2014): 3133–81.

Fischhoff, Baruch. “Hindsight \neq Foresight: The Effect of Outcome Knowledge on Judgment Under Uncertainty.” Journal of Experimental Psychology: Human Perception and Performance 1, no. 3 (1975): 288–99.

Gigerenzer, Gerd. “The Bias Bias in Behavioral Economics.” Review of Behavioral Economics 5, no. 3-4 (2018): 303–36.

Gigerenzer, Gerd, and Henry Brighton. “Homo Heuristicus: Why Biased Minds Make Better Inferences.” Topics in Cognitive Science 1, no. 1 (2009): 107–43.

Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. “Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?” In Neural Information Processing Systems (NeurIPS), 507–20, 2022.

Hardt, Moritz. “Climbing a Shaky Ladder: Better Adaptive Risk Estimation.” arXiv:1706.02733, 2017.

Johnston, Casey. “Netflix Never Used Its $1 Million Algorithm Due to Engineering Costs.” Wired, 2012. https://www.wired.com/2012/04/netflix-prize-costs/.

Kahneman, Daniel, Paul Slovic, and Amos Tversky. Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, 1982.

Kahneman, Daniel, and Amos Tversky. “Prospect Theory: An Analysis of Decision Under Risk.” In Handbook of the Fundamentals of Financial Decision Making: Part i, 99–127. World Scientific, 2013.

Mania, Horia, John Miller, Ludwig Schmidt, Moritz Hardt, and Benjamin Recht. “Model Similarity Mitigates Test Set Overuse.” In Neural Information Processing Systems (NeurIPS), 2019.

Mania, Horia, and Suvrit Sra. “Why Do Classifier Accuracies Show Linear Trends Under Distribution Shift?” arXiv:2012.15483, 2020.

Murdock Jr, Bennet B. “The Serial Position Effect of Free Recall.” Journal of Experimental Psychology 64 (1962): 482–88.

Salganik, Matthew J., Ian Lundberg, Alexander T. Kindel, Caitlin E. Ahearn, Khaled Al-Ghoneim, Abdullah Almaatouq, Drew M. Altschul, et al. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proc. National Academy of Sciences 117, no. 15 (2020): 8398–8403. https://doi.org/10.1073/pnas.1915006117.

Simon, Herbert A. “A Behavioral Model of Rational Choice.” The Quarterly Journal of Economics, 1955, 99–118.

———. Models of Man: Social and Rational; Mathematical Essays on Rational Human Behavior in Society Setting. Wiley, 1957.

———. “Theories of Bounded Rationality.” Decision and Organization 1, no. 1 (1972): 161–76.

Tversky, Amos, and Daniel Kahneman. “Availability: A Heuristic for Judging Frequency and Probability.” Cognitive Psychology 5 (1973): 207–32.

———. “Judgment Under Uncertainty: Heuristics and Biases.” Science 185 (1974): 1124–31.

Vasudevan, Vijay, Benjamin Caine, Raphael Gontijo Lopes, Sara Fridovich-Keil, and Rebecca Roelofs. “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on Imagenet.” In Neural Information Processing Systems (NeurIPS), 6720–34, 2022.

Zrnic, Tijana, and Moritz Hardt. “Natural Analysts in Adaptive Data Analysis.” In International Conference on Machine Learning (ICML), 7703–11. PMLR, 2019.

Fischhoff, “Hindsight \neq Foresight”.↩︎
Tversky and Kahneman, “Judgment Under Uncertainty”.↩︎
Murdock Jr, “The Serial Position Effect of Free Recall”.↩︎
Tversky and Kahneman, “Availability”.↩︎
Zrnic and Hardt, “Natural Analysts in Adaptive Data Analysis”.↩︎
Gigerenzer and Brighton, “Homo Heuristicus”.↩︎
Vasudevan et al., “When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on Imagenet”.↩︎
Feldman, Frostig, and Hardt, “The Advantages of Multiple Classes for Reducing Overfitting from Test Set Reuse”.↩︎
Carlens, “State of Machine Learning Competitions in 2025”.↩︎
Fernández-Delgado et al., “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?”; Grinsztajn, Oyallon, and Varoquaux, “Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?”↩︎
Johnston, “Netflix Never Used Its $1 Million Algorithm Due to Engineering Costs”.↩︎
Salganik et al., “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration”.↩︎
Blum and Hardt, “The Ladder”.↩︎
Hardt, “Climbing a Shaky Ladder”.↩︎
Chazelle, “Natural Algorithms”; Chazelle, “Natural Algorithms and Influence Systems”.↩︎
Zrnic and Hardt, “Natural Analysts in Adaptive Data Analysis”.↩︎
Arora and Zhang, “Rip van Winkle’s Razor”.↩︎
Dwork et al., “Generalization in Adaptive Data Analysis and Holdout Reuse”.↩︎
Mania et al., “Model Similarity Mitigates Test Set Overuse”.↩︎
Mania and Sra, “Why Do Classifier Accuracies Show Linear Trends Under Distribution Shift?”↩︎
Feldman, Frostig, and Hardt, “The Advantages of Multiple Classes for Reducing Overfitting from Test Set Reuse”; Feldman, Frostig, and Hardt, “Open Problem”.↩︎
Acharya and Suresh, “Optimal Multiclass Overfitting by Sequence Reconstruction from Hamming Queries”.↩︎
Donoho, “Data Science at the Singularity”.↩︎
Carlens, “State of Machine Learning Competitions in 2025”.↩︎
Simon, “A Behavioral Model of Rational Choice”; Simon, Models of Man; Simon, “Theories of Bounded Rationality”.↩︎
Tversky and Kahneman, “Judgment Under Uncertainty”; Kahneman, Slovic, and Tversky, Judgment Under Uncertainty; Kahneman and Tversky, “Prospect Theory”.↩︎
Gigerenzer, “The Bias Bias in Behavioral Economics”.↩︎

Forces against crisis