Short Answer
Build a 50-100 example eval set from your real worst-case inputs, score each model on it, and look at variance not just mean. Don't trust benchmarks โ they're not your workload.
Detailed Answer
Benchmark leaderboards are useful for shortlisting but useless for choosing. Your workload has distribution shifts the benchmarks don't capture. Build your own. A practical eval: collect 50-100 real examples of the inputs you actually see, write down what 'correct' looks like for each (or run them through the existing solution and call that the baseline), and run every candidate model on the same set. Look at: (1) pass rate against your quality bar, (2) distribution of failures โ are they consistent or random?, (3) cost per successful answer, (4) latency variance. For the hard cases (the 10% where models fail), use a stronger model or a human-in-the-loop. For the easy 90%, route to the cheapest model that passes.
No answers yet
Be the first to answer this question!
Sign in to post an answer