Rethinking the Significance of Benchmarks

Rethinking the Significance of Benchmarks

Why Benchmarks Might Not Matter as Much as You Think

From the beginning of Large Language Models (LLMs), benchmarks have been the go-to method for evaluating their effectiveness, at least on paper. However, the race to be the best often leads companies to manipulate data, making it hard to determine a clear winner.

Benchmark Manipulation in Action

  • The launch of Gemini and its comparison with GPT-4 revealed benchmark manipulation.
  • Google claimed superiority on the MMLU benchmark but used COT@32 instead of 5-shot learning.
  • Microsoft countered with medprompt+ on GPT-4, achieving a record score of 90.10%, emphasizing the importance of prompt engineering.

The Role of Prompt Engineering

  • Microsoft's recent paper highlights the effectiveness of systematic prompt engineering in making a generalist GPT-4 perform like a specialist on medical challenges.
  • Microsoft continues to explore the capabilities of frontier models, recently launching Phi-2, claiming it outperforms Mistral 7B, Llama 2, and Gemini Nano.
  • Mistral AI also claims its 8X7B model performs better than GPT-3.5 and Llama 2.

Benchmarks vs. Real-World Performance

  • While benchmarks provide a general idea, they shouldn't be the sole criteria for judgment.
  • The primary purpose of any LLM should be to serve its customers by streamlining tasks.
  • Users care more about practical improvements in tasks than marginal increases in benchmark scores.

Practical Usability Over Benchmarks

  • For tasks like creating meeting notes, summaries, and blogs, the ability of an LLM to perform accurately is more crucial than its benchmark score.
  • AI advisor Vin Vashista emphasizes that generative AI winners are decided by delivering products that win over customers, not just benchmark scores.

Skepticism and Goodhart's Law

  • Some users express skepticism about the relevance of LLM benchmarks, suggesting that they may have reached a point where Goodhart's law applies.
  • Goodhart's law states that when a measure becomes a target, it ceases to be a good measure because individuals or entities may optimize behavior to achieve favorable results in that specific metric, leading to distortions.


While benchmarks have their place, they might not be as crucial as they seem. The emphasis should be on practical usability and real-world performance, ensuring that LLMs meet the needs of users rather than just excelling in benchmark competitions.

Read more