Contents
1
From its roots, machine learning embraces the anything goes
principle of scientific discovery. Machine learning
benchmarks become the iron rule to tame the anything goes.
But after decades of service, a crisis grips the
benchmarking enterprise.
2
The mathematical foundations of machine learning follow the
astronomical conception of society: Populations are
probability distributions. Optimal predictors minimize loss
functions on a probability distribution.
3
A single statistical problem illuminates much of the
mathematical tools necessary for benchmarking. The key
lesson is that sample requirements grow quadratically in
the inverse of the difference we try to detect.
4
The holdout method separates training and testing data,
permitting anything goes on the training data, while
enforcing the iron rule on the testing data. Not all uses
of the holdout method are alike.
5
Statistics prescribes the iron vault for test data. But the
empirical reality of machine learning benchmarks couldn’t
be further from the prescription. Repeated adaptive testing
brings theoretical risks and practical power.
6
A replication crisis has long gripped the empirical
sciences. Statistical practice is vulnerable for
fundamental reasons. Under competition, researcher degrees
of freedom outwit statistical measurement.
7
The preconditions for crisis exist in machine learning,
too. And yet, the situation in machine learning is
different. While accuracy numbers don’t replicate, model
rankings replicate to a significant degree.
8
If machine learning thwarted scientific crisis, the
question is why. Some powerful explanations emerge. Key are
the social norms and practices of the community rather than
statistical methodology.
9
Labeling and
annotation
coming soon
If the holdout method is the greatest unsung hero, data
annotation is not far behind. But conventional wisdom
clouds the subtle role that annotation plays for
benchmarking.
10
Generative
models
coming soon
The ImageNet era ends as attention shifts to powerful
generative models trained on the internet. The new era also
marks a turning point for machine learning benchmarks.
11
Post
training
coming soon
After training, alignment fits pretrained models to human
preferences. At a fraction of the cost of pretraining,
alignment transforms evaluation results. How so little
makes such a big difference teaches us about models and
benchmarks.
12
On training
data
coming soon
Benchmarking works best if all models have the same
training data. Model builders today optimize training data
mixes with the test task in mind, leading to confounded
evaluations. A simple adjustment provides relief.
13
The problem of
aggregation
coming soon
Multi-task benchmarks promise a holistic evaluation of
complex models. Social choice theory reveals limitations in
aggregate benchmarks. Greater diversity comes at the cost
of greater sensitivity to artefacts.
14
When the model
moves the data
coming soon
Models deployed at scale always influence future data, a
phenomenon called performativity. Performativity sheds
light on the problem of data feedback loops. Dynamic
benchmarks try to make a virtue out of it.
15
Evaluation at
the frontier
coming soon
Machine learning runs out of data. As models gain in
capabilities, human supervision inreasingly becomes a
bottleneck. Researchers rush to find ways to have models
supervise other models.
16
Outlook
coming soon
Competition over AI models reaches geopolitical intensity.
Rapidly changing model capabilities and expanding
applications put unprecedented pressure on the iron rule.
And yet, there are good reasons why the iron rule refuses
to retire.
Contact
This is a work in progress. I'll add new chapters throughout the summer of 2025.
Reach out at contact@mlbenchmarks.org
for feedback, questions, and suggestions. Please let me know if you find any errors. I appreciate your comments.
Cite
@misc{hardt2025emerging,
author = {Moritz Hardt},
title = {The Emerging Science of Machine Learning Benchmarks},
year = {2025},
howpublished = {Online at \url{https://mlbenchmarks.org}},
note = {Manuscript}
}