Beyond the Leaderboard: How to Evaluate Generative AI in Practice

AI models are evaluated on benchmarks and doing well on benchmarks doesn't necessary mean high performance in all use cases

Apr 21, 2025

Over the past two years, I often found myself drawn to the excitement around new models achieving a new peak, and viral posts about their magical abilities—only to realize they rarely reflect real-world fit for my use case. Digging deeper into what benchmarks actually measure helped me shift from chasing hype to choosing models more thoughtfully.

In this post, I’ll walk through what benchmarks really tell us (and don’t), what recent research says, and how to evaluate GenAI models in the real world.

Why Benchmarks Still Matter

Benchmarks are a valuable starting point for early model evaluation. Use them to:

Compare baseline capabilities across models
Track regressions during fine-tuning
Quickly flag poor performance before user testing

The problem is using them as a replacement for real-world evaluation.

Where Benchmarks Fall Short

Even the best-known benchmarks—like MMLU, GSM8K, or MT-Bench—have real limitations:

⚠️ 1. Data contamination

Many benchmark questions have leaked into model training data. That means models might not be solving problems, just regurgitating answers.

⚠️ 2. Output-centric, not user-centric

Benchmarks test correctness against ground truth. But real users care about helpfulness, recoverability, and context — none of which are easy to score.

⚠️ 3. Human comparisons are misleading

“Outperforms humans” often means “scores higher on multiple-choice trivia.” It says little about whether a model can adapt to unknown situations or make judgment calls.

⚠️ 4. The answer can depend on how you ask the question

GenAI models are non-determinitistic. Their answers to the same questions can vary depending on how you ask! Some models might not do well in the benchmarks just because they expect different prompts. Check out Anthropic’s insightful posts about challenges in evaluating AI systems.

There's a pressing need for an evaluation science that mirrors the rigor found in fields like medicine and aerospace, focusing on safety, reliability, and contextual performance (as highlighted by Weidinger et al. (2025)).

A Real-World Evaluation Loop

Appropriate benchmarks act as an early signal. But real evaluation begins with the model in context, with the whole system and with users in the loop. Here’s an evaluation loop that works when evaluating AI models for reality: Observe—>Simulate—>Target—>Monitor

1. Observe

Go where your users already are — Reddit threads, customer tickets, community Slack. What are they asking? What do they expect? Where do they give up?

2. Simulate

Before launching anything, run wide internal testing with both technical and non-technical users. This will surface:

Unexpected prompts
User frustration signals
UX flaws no benchmark would catch

3. Target

Build a lightweight benchmark focused on your use case:

50–100 examples from real conversations
Edge cases, vague queries, failure recovery
Metrics: not just accuracy, but clarity, confidence, and ability to acknowledge mistakes

4. Monitor

Once live, log:

What users asked that’s beyond the systems capabilities
Where users got frustrated
Where conversations broke down
What feedback patterns correlated with churn or success

Iteratively test and improve your system over time with your users in the loop.

What the Research Says

This shift from static benchmarks to contextual evaluation is echoed in recent papers:

Weidinger et al. (2025) call for a formal science of evaluation, modeled on safety-critical domains like medicine and aviation.
Tamkin et al. (2024) discuss how Anthropic monitors usage of Claude and use the monitoring system to improve Claude
Wallach et al. (2025) argue that evaluating generative AI is a social science problem, and we need human-centered metrics to assess trust, alignment, and helpfulness.
Wang et al. (2023) show that LLM-as-a-judge setups — like MT-Bench — are sensitive to response order, challenging their reliability.
Guo et al. (2023) outline three evaluation levels: core ability, alignment, and safety — and remind us that strength in one doesn’t imply strength in the others.

A Detailed Look: What Benchmarks Measure

MMLU (Massive Multitask Language Understanding)

Focus: Multitask knowledge across 57 subjects
Example: For Socrates, an unexamined life is a tragedy because it results in grievous harm to _____. Options: [ "the state", "the justice system", "the body", "the soul" ]
Paper, Leaderboard
Criticism: Commonly leaked into training data; models often memorize rather than reason. More on: AI has a measurement problem
Refinements: MMLU-Pro (Paper, Leaderboard) and MMLU-Redux (Paper, Evaluation). Better for measuring model knowledge, but still limited to static Q&A formats.

HELM (Holistic Evaluation of Language Models)

Focus: Accuracy, robustness, fairness, efficiency
Example: "Sheep are afraid of mice. Cats are afraid of mice. Jessica is a sheep. Wolves are afraid of mice. Mice are afraid of wolves. Emily is a wolf. Gertrude is a wolf. Winona is a mouse. Question: What is emily afraid of?"
Paper, Dashboard
Criticism: Broad, but the “multi-metric” approach can dilute sharp insights into specific weaknesses.

MT-Bench

Focus: Multi-turn conversational evaluations for chat models
Example: "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."-->"Rewrite your previous response. Start every sentence with the letter A."
Paper, Leaderboard
Criticism: Uses single-turn judgments for multi-turn problems; doesn’t capture ongoing dialogue dynamics.

Chatbot Arena

Focus: Crowdsourced head-to-head human preference testing
Example: Real conversations in the wild, like "Write a polite email to reschedule a meeting due to illness.
Paper, Leaderboard
Criticism: Votes can reflect popularity or style rather than task accuracy or trustworthiness. More: The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark

HumanEval

Focus: Code generation and functional correctness
Example: Write a function that takes a list of integers and returns the sum of the two largest numbers.
Paper, Leaderboard
Criticism: Tasks are small and clean, unlike the messy, iterative nature of real-world software development.

GSM8K

Focus: Grade school math word problems
Example: "Beth bakes 4, 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?"
Paper, Leaderboard
Criticism: Focuses on arithmetic

Humanity’s Last Exam (HLE)

Focus: Multi-modal benchmark at the frontier of human knowledge
Example: "In Greek mythology, who was Jason's maternal great-grandfather?"
Paper, Website with graphs
Criticism: Measures ethical reasoning, but values differ between cultures and contexts.

50%-Task Horizon (METR)

Focus: Measures the longest real-world tasks an AI model can complete successfully at least 50% of the time. Tests models on extended, multi-step problems that mimic human workflows rather than one-off questions.
Example Question: "Using the Arxiv API, search for recent papers on 'Transformer architectures' and compile a table summarizing each paper’s title, authors, and publication date in a CSV file."
Paper, Leaderboard
Criticism: While METR sets a higher bar for testing real-world AI usefulness, it simplifies success to binary completion. Many real-world tasks require nuance, collaboration, and judgment — things this metric does not fully assess.

Beyond AI Experiments

Discussion about this post

Ready for more?