AI benchmarks systematically ignore how humans disagree, Google study finds
Google study reveals flaws in AI benchmark evaluations

A study by Google Research and the Rochester Institute of Technology reveals that traditional AI benchmarks, which typically rely on three to five human evaluators per test example, fail to capture the full diversity of human opinion. The research indicates that at least ten raters are necessary to produce reliable results, emphasizing the importance of how annotation budgets are allocated between the number of test examples and the number of raters. This finding challenges existing practices in AI evaluations, suggesting that a more nuanced approach is needed to accurately assess model performance.
Key Takeaways
- 1.
Current benchmarks often use only three to five raters per test example.
- 2.
Reliable results typically require more than ten raters per example.
- 3.
An optimal budget split is crucial for accurate AI model comparisons.
Get your personalized feed
Trace groups the biggest stories, videos, and discussions into one feed so you can stay current without scanning ten tabs.
Try Trace free