Machine learning turns on one simple trick: Split your data into training and test sets. Anything goes on the training set. Rank models on the test set. Let model builders compete. Call it a benchmark.
Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of machine learning benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart’s law cautions against applying competitive pressure to statistical measurement. Over time, they say, researchers overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities, deceiving us especially when comparing humans and machines. Top off the list of issues with a slew of reasons why things don’t transfer from benchmarks to the real world.
These scorching critiques go hand in hand with serious ethical objections. Benchmarks reinforce and perpetuate biases in our representation of people, social relationships, culture, and society. Worse, the creation of massive human-annotated datasets extracts labor from a marginalized workforce excluded from the economic gains it enables.
All of this is true.
Many have said it well. Many have argued it convincingly. I’m particularly drawn to the claim that benchmarks serve industry objectives, giving big tech labs a structural advantage. The case against benchmarks is clear, in my view.
What’s far less clear is the scientific case for benchmarks.
It’s undeniable that benchmarks have been successful as a driver of progress in the field. ImageNet was inseparable from the deep learning revolution of the 2010s, with companies competing fiercely over the best dog breed classifiers. The difference between a Blenheim Spaniel and a Welsh Springer became a matter of serious rivalry. A decade later, language model benchmarks reached geopolitical significance in the global competition over artificial intelligence. Tech CEOs now recite the company’s number on MMLU—a set of 14,042 college-level multiple choice questions—in presentations to shareholders. I’m writing after news broke that David beat Goliath on reasoning benchmarks, a sensation that shook global stock markets.
Benchmarks come and go, but their centrality hasn’t changed. Competitive leaderboard climbing has been the main way machine learning advances.
If we accept that progress in artificial intelligence is real, we must also accept that benchmarks have, in some sense, worked. But the fact that benchmarks worked is more of a hindsight observation than a scientific lesson. Benchmarks emerged in the early days of pattern recognition. They followed no scientific principles. To the extent that benchmarks had any theoretical support, that theory was readily invalidated by how people used benchmarks in practice. Statistics prescribed locking test sets in a vault, but machine learning practitioners did the opposite. They put them on the internet for everyone to use freely. Popular benchmarks draw millions of downloads and evaluations as model builders incrementally compete over better numbers.
Benchmarks are the mistake that made machine learning. They shouldn’t have worked and, yet, they did. In this book, my goal is to shed light on why benchmarks work and what for.
The first part of this book covers foundations, some mathematical, some empirical. The first two chapters after the introduction add just enough standard background material to make the book self-contained. Here, I stick closely to the canon.
The next few chapters cover the train/test split, called holdout method. I start with the classical guarantees for the holdout method and related tools in the family of cross validation methods. These guarantees, however, don’t apply to how people use the holdout method in practice. The problem is adaptivity: Repeated use creates a feedback loop between the model and the data that invalidates traditional analysis. This problem of adaptivity is a cousin of Freedman’s paradox, a conundrum that has vexed statisticians since the 1980s. Freedman noticed how easily data-dependent statistical analyses can go wrong.
Freedman’s observation foreshadowed an ongoing scientific crisis in the statistical sciences. Evidently, successful replication is limited and false discovery common when researchers compete on the basis of statistics, such as p-values. But p-values aren’t the main culprit. Researcher degrees of freedom always seem to outwit statistical measurement. Indeed, Goodhart’s law predicts that statistical measurement breaks down under competitive pressure. What does that say about the benchmarking ecosystem, where researchers compete over statistics computed on a fixed test set?
The preconditions for crisis exist in machine learning, too. For one, it shares the Achilles’ heel of statistical measurement with other empirical sciences. In addition, machine learning operates in an ecosystem of maximal researchers’ degrees of freedom, rapid publication, and weak peer review. It might come as no surprise that absolute accuracy numbers—thought of as measurements of some capability—are woefully unreliable, failing to replicate even under similar conditions. Nevertheless, the situation in machine learning is markedly different. Model rankings replicate to a surprising degree. More specifically, three empirical facts emerge from the ImageNet era:
If machine learning appears to have thwarted scientific crisis, the question is why. I argue that the social norms and practices of the community rather than statistical methodology alone are key to understanding the function of benchmarks. A fundamental result shows that if the community only cares about identifying the best performing model at any point in time, the holdout method enjoys surprisingly strong theoretical guarantees.
Summarizing what we have so far, model rankings—rather than model evaluations—are the primary scientific export of machine learning benchmarks.
The first half of the book draws on lessons primarily from the ImageNet era, that is, roughly the decade following 2012. Characteristic of the ImageNet era was a single central benchmark that featured both a training set and a test set. The creators took care to clean labels thoroughly through aggregation. A chapter on labeling and annotation shows why some common practices of label cleaning are inefficient when the primary goal is model ranking.
The second part of this book is about recent developments around generative models, in particular, large language models. I cover the basics of large language models, scaling laws, emergent capabilities, and alignment methods, necessary to appreciate the challenges of benchmarking in this day and age.
The new era departs from the old in some significant ways.
First, models train on the internet, or at least massive minimally curated web crawls. At the point of evaluation, we therefore don’t know and can’t control what training data the model saw. This turns out to have profound implications for benchmarking. The extent to which a model has encountered data similar to the test task during training skews model comparisons and threatens the validity of model rankings. A worse model may have simply crammed better for the test. Would you prefer a worse student who came better prepared to the exam, or the better student who was less prepared? If you prefer the latter, then you’ll need to adjust for the difference in test preparation. Thankfully this can be done by fine-tuning each model on the same task-specific data before evaluation without the need to train from scratch.
Second, models no longer solve a single task, but can be prompted to tackle pretty much any task. In response, multi-task benchmarks have emerged as the de-facto standard to provide a holistic evaluation of recent models by aggregating performance across numerous tasks into a single ranking. Aggregating rankings, however, is a thorny problem in social choice theory that has no perfect solution. Working from an analogy between multi-task benchmarks and voting systems, ideas from social choice theory reveal inherent trade-offs that multi-task benchmarks face. Specifically, greater task diversity necessarily comes at the cost of greater sensitivity to irrelevant changes. For example, adding weak models to popular multi-task benchmarks can change the order of top contenders. The familiar stability of model rankings, characteristic of the ImageNet era, therefore does not extend to multi-task benchmarks in the LLM era.
Unlike ImageNet era image classifiers, chatbots interact with hundreds of millions of people globally. The massive reach of AI deployments has repercussions on evaluation. Models deployed at scale always influence future data, a phenomenon called performativity. Performativity challenges evaluation, since there is no longer model-independent data. The notion of ground truth—time-honored bedrock of evaluation—unravels when data and model create a closed feedback loop. Research on performativity sheds light on the problem of data feedback loops that many see as a fundamental risk to the machine learning ecosystem. Dynamic benchmarks try to make a virtue out of data feedback loops by creating benchmarks that evolve as models improve.
The final problem benchmarking faces is an existential one. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. There’s hope that models might be able to evaluate each other. But the idea of using models as judges runs into some serious hurdles. LLM judges are biased, unsurprisingly, in their own favor. Intriguing recent debiasing methods from statistics promise to debias model predictions from few human ground truth labels. Unfortunately, at the evaluation frontier—where new models are at least as good as the judge—even the optimal debiasing method is no better than collecting twice as many ground truth labels for evaluation.
And so… will our old engine of progress grind to a halt?
In a moment of crisis, we tend to accelerate. What if instead we stepped back and asked why we expected benchmarks to work in the first place—and what for? This question leads us into uncharted territory. For the longest time, we took benchmarks for granted and didn’t bother to work out the method behind them. We got away with it mostly by sheer luck, but we might not this time. Over the last decade, however, a growing body of work has begun to map out the foundations of a science of machine learning benchmarks. What emerges is a rich set of observations—both theoretical and empirical—raising intriguing open problems that deserve the community’s attention. If benchmarks are to serve us well in the future, we must put them on solid scientific ground. Supporting this development is the goal of this book.
There are many excellent books on machine learning; I’ll highlight several of them throughout. This book, however, covers a topic central to the development of machine learning that is largely missing from all of them. Existing textbooks overwhelmingly focus on the three classical pillars of supervised learning: representation, optimization, and generalization. These topics are important. But benchmarks are as vital to the functioning of the machine learning ecosystem as any of these. It’s impossible to do machine learning without using the holdout method and benchmarks extensively. For the longest time, the topic had primarily been in the purview of blog posts, Reddit threads, and industry chatter. Academic conferences, such as NeurIPS, have finally embraced the topic as part of the core discipline. But as a scientific discipline, benchmarks still lacked a foundation.
This is a book for all students and researchers who want to learn about machine learning benchmarks. As such, it’s suitable for self-study. Some mathematical training is required, mostly a bit of probability theory and statistics. The math is at the upper undergraduate level. I’d like to think, though, that a much broader audience can skip some of the math and still get much out of it by reading the surrounding narrative. A consistent story runs throughout the book; the analytical index summarizes key points from each chapter.
Instructors may use this book alongside their preferred machine learning text to incorporate benchmarks into their curriculum. I took a conservative approach to the foundations by using the standard supervised learning framework, thus making the book easily compatible with other textbooks. While most instructors will likely integrate this book with other course materials, it can also support a standalone class. I have taught a one-semester course based on this material, with each chapter suited to a 90-minute lecture. A full set of homework exercises, including coding, data work, and experimentation in the Python machine learning ecosystem, will be available online.
Theory and observation run closely together throughout this book. It’s neither a theory book, nor a practical guide to machine learning. I use theory where it illuminates empirical phenomena—recognizing that not every plot in the literature is one. I highlight robust empirical facts, while avoiding less established observations, speculations, and practical details that may be too ephemeral for a textbook.
This book is fundamentally about why benchmarks work. An answer to this question necessarily also reveals important limitations of benchmarks. There’s a lot more, however, that goes into the successful design of a benchmark or the execution of a machine learning competition in practice that I don’t cover. Likewise, there’s a lot more to the broader topic of dataset creation, as well as the broader topic of evaluation. I give pointers to additional reading throughout.
My interest in machine learning benchmarks dates back to collaborations at the Simons Institute for Theoretical Computer Science in the Fall of 2013. These collaborations led to the development of adaptive data analysis, an area of theoretical computer science that studies the challenges of data-dependent statistical analyses. I’m indebted to my close collaborators at the time, Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth, and Jon Ullman, who all shaped my thinking on the topic. Avrim Blum was the first to make the connection between adaptive data analysis and machine learning benchmarks, conjecturing that dataset reuse was less of a concern when the only goal is to identify the best performing model. This observation has been deeply influential for me. We collaborated to formalize and prove this conjecture and the results form a good part of a chapter in this book.
Thanks to an invitation from Percy Liang, I had the good fortune to moderate a panel on “The Role of Benchmarks in the Scientific Progress of Machine Learning” at NeurIPS 2021. The participants Lora Aroyo, Sam Bowman, Isabelle Guyon, and Joaquin Vanschoren contributed significant perspectives on the topic that had a lasting influence on me. I frequently come back to my 14-page transcript from the panel. At various points over the last ten years, I benefited from conversations with Sanjeev Arora and Sham Kakade about topics relating to this book. I’m thankful to Ben Recht for our discussions about benchmarks in preparation for our book Patterns, Predictions, and Actions that informed my perspective on the history of pattern recognition and benchmarks. I learned a lot from Ludwig Schmidt about robustness, replication, and distribution shift in machine learning. Ludwig also made the connection between Streven’s Knowledge Machine and machine learning research.
The second part of the book significantly draws on contributions from my recent collaborators Rediet Abebe, Ricardo Dominguez-Olmedo, Florian Dorner, Vivian Nastl, Celestine Mendler-Dünner, Olawale Salaudeen, Ali Shirali, and Guanhua Zhang.
I’m grateful for the participants of my class on this topic in the Fall of 2024 at the University of Tübingen. Special thanks to the graduate instructors Ricardo Dominguez-Olmedo, Tom Sühr, and Guanhua Zhang.
Throughout I used ChatGPT for spelling, grammar, and tikz figures. No unicorns were harmed.