OCR mini-bench

Business-first OCR benchmark for standard operational documents.

This benchmark compares OCR extraction performance on real business documents across repeated runs, so you can see both quality and consistency, not just a single score. It explicitly measures how well models transform an input document and expected keys into the correct output values. It highlights what matters in production: critical-field success, reliability over repeats (pass^n), latency, stability, and cost per successful outcome.

42 docs • 16 models • 6,720 runs

Last updated: 18/03/2026

# Model Success pass^3 pass^5 Cost/success Latency Critical All fields Cost/doc Variance
1
Gemini 3 Flash
Google • BALANCED
73.8% 73.8% 73.8% 0.67¢ 16.0s 96.7% 95.3% 0.46¢
2
Claude Sonnet 4.6
Anthropic • BALANCED
73.8% 73.8% 73.8% 3.61¢ 18.9s 94.3% 94.7% 2.46¢
3
Claude Opus 4.6
Anthropic • SOTA
73.0% 72.2% 71.8% 6.25¢ 18.9s 95.8% 94.8% 4.15¢
4
Gemini 3.1 Pro
Google • SOTA
68.7% 68.7% 68.7% 2.55¢ 65.3s 91.5% 89.5% 1.63¢
5
Gemini 3.1 Flash-Lite
Google • BALANCED
61.2% 61.2% 61.2% 0.32¢ 12.8s 93.3% 93.2% 0.19¢
6
Gemini 2.5 Flash-Lite
Google • BUDGET
58.6% 58.6% 58.6% 0.10¢ 14.4s 94.1% 91.9% 0.06¢
7
Medium
Mistral • BALANCED
54.1% 49.8% 47.5% 0.69¢ 21.0s 91.0% 88.4% 0.29¢
8
Large
Mistral • SOTA
50.5% 48.4% 47.3% 0.31¢ 23.2s 92.0% 89.9% 0.28¢
9
OCR
Mistral • SOTA
48.4% 43.5% 41.6% 0.67¢ 11.8s 92.3% 91.8% 0.30¢
10
Small
Mistral • BUDGET
46.2% 43.1% 41.9% 0.12¢ 12.6s 88.6% 88.0% 0.05¢
11
GPT-5
OpenAI • SOTA
44.6% 39.3% 37.9% 24.20¢ 19.8s 88.8% 89.4% 1.01¢
12
GPT-5.4 mini
OpenAI • BALANCED
43.2% 35.9% 32.4% 3.30¢ 13.7s 91.5% 92.0% 0.65¢
13
GPT-5 mini
OpenAI • BALANCED
39.3% 32.7% 30.5% 3.09¢ 25.0s 90.2% 89.9% 0.28¢
14
Claude Haiku 4.5
Anthropic • BUDGET
34.9% 34.9% 34.9% 3.73¢ 13.6s 89.9% 89.9% 0.97¢
15
GPT-5.4 nano
OpenAI • BUDGET
23.6% 13.7% 11.3% 7.05¢ 19.9s 82.8% 78.7% 0.27¢
16
GPT-5 nano
OpenAI • BUDGET
8.7% 5.2% 4.1% 2.88¢ 17.0s 63.1% 52.6% 0.05¢

passn metric: Probability of n consecutive successes in n runs (strict).

Variance column: Shows min–max interval with bar width indicating spread.

All metrics: Aggregated across all documents and 10 runs per model.