The Emerging Science of Machine Learning Benchmarks

Moritz Hardt

Why benchmarks advanced machine learning, the crisis they now face, and the science we need to sustain progress

Contents

1
From its roots, machine learning embraces the anything goes principle of scientific discovery. Machine learning benchmarks become the iron rule to tame the anything goes. But after decades of service, a crisis grips the benchmarking enterprise.
2
The mathematical foundations of machine learning follow the astronomical conception of society: Populations are probability distributions. Optimal predictors minimize loss functions on a probability distribution.
3
A single statistical problem illuminates much of the mathematical tools necessary for benchmarking. The key lesson is that sample requirements grow quadratically in the inverse of the difference we try to detect.
4
The holdout method separates training and testing data, permitting anything goes on the training data, while enforcing the iron rule on the testing data. Not all uses of the holdout method are alike.
5
Statistics prescribes the iron vault for test data. But the empirical reality of machine learning benchmarks couldn’t be further from the prescription. Repeated adaptive testing brings theoretical risks and practical power.
6
A replication crisis has long gripped the empirical sciences. Statistical practice is vulnerable for fundamental reasons. Under competition, researcher degrees of freedom outwit statistical measurement.
7
The preconditions for crisis exist in machine learning, too. And yet, the situation in machine learning is different. While accuracy numbers don’t replicate, model rankings replicate to a significant degree.
8
If machine learning thwarted scientific crisis, the question is why. Some powerful explanations emerge. Key are the social norms and practices of the community rather than statistical methodology.
9
Labeling and annotation coming soon
If the holdout method is the greatest unsung hero, data annotation is not far behind. But conventional wisdom clouds the subtle role that annotation plays for benchmarking.
10
Generative models coming soon
The ImageNet era ends as attention shifts to powerful generative models trained on the internet. The new era also marks a turning point for machine learning benchmarks.
11
Post training coming soon
After training, alignment fits pretrained models to human preferences. At a fraction of the cost of pretraining, alignment transforms evaluation results. How so little makes such a big difference teaches us about models and benchmarks.
12
On training data coming soon
Benchmarking works best if all models have the same training data. Model builders today optimize training data mixes with the test task in mind, leading to confounded evaluations. A simple adjustment provides relief.
13
The problem of aggregation coming soon
Multi-task benchmarks promise a holistic evaluation of complex models. Social choice theory reveals limitations in aggregate benchmarks. Greater diversity comes at the cost of greater sensitivity to artefacts.
14
When the model moves the data coming soon
Models deployed at scale always influence future data, a phenomenon called performativity. Performativity sheds light on the problem of data feedback loops. Dynamic benchmarks try to make a virtue out of it.
15
Evaluation at the frontier coming soon
Machine learning runs out of data. As models gain in capabilities, human supervision inreasingly becomes a bottleneck. Researchers rush to find ways to have models supervise other models.
16
Outlook coming soon
Competition over AI models reaches geopolitical intensity. Rapidly changing model capabilities and expanding applications put unprecedented pressure on the iron rule. And yet, there are good reasons why the iron rule refuses to retire.

Contact

This is a work in progress. I'll add new chapters throughout the summer of 2025.

Reach out at contact@mlbenchmarks.org for feedback, questions, and suggestions. Please let me know if you find any errors. I appreciate your comments.

Cite

BibTeX
@misc{hardt2025emerging,
  author = {Moritz Hardt},
  title = {The Emerging Science of Machine Learning Benchmarks},
  year = {2025},
  howpublished = {Online at \url{https://mlbenchmarks.org}},
  note = {Manuscript}
}