OCR mini-bench
Business-first OCR benchmark for standard operational documents.
This benchmark compares OCR extraction performance on real business documents across repeated runs, so you can see both quality and consistency, not just a single score. It explicitly measures how well models transform an input document and expected keys into the correct output values. It highlights what matters in production: critical-field success, reliability over repeats (pass^n), latency, stability, and cost per successful outcome.
Leaderboard
42 docs • 16 models • 6,720 runs
Last updated: 18/03/2026
| # | Model | Success | pass^3 | pass^5 | Cost/success | Latency | Critical | All fields | Cost/doc | Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3 Flash | 73.8% | 73.8% | 73.8% | 0.67¢ | 16.0s | 96.7% | 95.3% | 0.46¢ | |
| 2 | Claude Sonnet 4.6 | 73.8% | 73.8% | 73.8% | 3.61¢ | 18.9s | 94.3% | 94.7% | 2.46¢ | |
| 3 | Claude Opus 4.6 | 73.0% | 72.2% | 71.8% | 6.25¢ | 18.9s | 95.8% | 94.8% | 4.15¢ | |
| 4 | Gemini 3.1 Pro | 68.7% | 68.7% | 68.7% | 2.55¢ | 65.3s | 91.5% | 89.5% | 1.63¢ | |
| 5 | Gemini 3.1 Flash-Lite | 61.2% | 61.2% | 61.2% | 0.32¢ | 12.8s | 93.3% | 93.2% | 0.19¢ | |
| 6 | Gemini 2.5 Flash-Lite | 58.6% | 58.6% | 58.6% | 0.10¢ | 14.4s | 94.1% | 91.9% | 0.06¢ | |
| 7 | Medium | 54.1% | 49.8% | 47.5% | 0.69¢ | 21.0s | 91.0% | 88.4% | 0.29¢ | |
| 8 | Large | 50.5% | 48.4% | 47.3% | 0.31¢ | 23.2s | 92.0% | 89.9% | 0.28¢ | |
| 9 | OCR | 48.4% | 43.5% | 41.6% | 0.67¢ | 11.8s | 92.3% | 91.8% | 0.30¢ | |
| 10 | Small | 46.2% | 43.1% | 41.9% | 0.12¢ | 12.6s | 88.6% | 88.0% | 0.05¢ | |
| 11 | GPT-5 | 44.6% | 39.3% | 37.9% | 24.20¢ | 19.8s | 88.8% | 89.4% | 1.01¢ | |
| 12 | GPT-5.4 mini | 43.2% | 35.9% | 32.4% | 3.30¢ | 13.7s | 91.5% | 92.0% | 0.65¢ | |
| 13 | GPT-5 mini | 39.3% | 32.7% | 30.5% | 3.09¢ | 25.0s | 90.2% | 89.9% | 0.28¢ | |
| 14 | Claude Haiku 4.5 | 34.9% | 34.9% | 34.9% | 3.73¢ | 13.6s | 89.9% | 89.9% | 0.97¢ | |
| 15 | GPT-5.4 nano | 23.6% | 13.7% | 11.3% | 7.05¢ | 19.9s | 82.8% | 78.7% | 0.27¢ | |
| 16 | GPT-5 nano | 8.7% | 5.2% | 4.1% | 2.88¢ | 17.0s | 63.1% | 52.6% | 0.05¢ |
† passn metric: Probability of n consecutive successes in n runs (strict).
† Variance column: Shows min–max interval with bar width indicating spread.
† All metrics: Aggregated across all documents and 10 runs per model.