Back to Blog · AI & Automation

OCR Extraction Benchmark

I built a benchmark to compare three approaches for structured data extraction from Quebec receipts.

MF
Martin Fournier
· May 25, 2026 · 5 MIN READ
Illustration for: OCR Extraction Benchmark

OCR Extraction Benchmark: DeepSeek API vs Regex vs NuExtract-tiny

Every freelancer in Quebec has the same tax-time ritual: a stack of receipts, hours of manual data entry, and the nagging feeling that somewhere, a deduction slipped through the cracks.

I set out to automate this. The goal: feed a PDF receipt through Tesseract, extract structured data (vendor, date, subtotal, TPS, TVQ, total, payment method), and dump it straight into an expense report. The question was which extraction method works best for real Quebec receipts — the kind with French text, QC-specific tax lines, and the occasional handwritten total.

The Contenders

Three approaches, spanning the spectrum from fully local to cloud API:

Approach Model Where it runs Size Cost
DeepSeek API DeepSeek V3 Cloud $0.50/1M output tokens
Regex baseline Custom regex Local Free
NuExtract-tiny NuExtract-1.5-tiny Local (CPU) 350 MB GGUF Free

DeepSeek is the heavy lifter — a frontier LLM with deep reasoning. The regex baseline is a hand-crafted set of patterns for structured Quebec receipts. NuExtract-tiny is a 0.5B parameter model purpose-built for structured extraction, running locally via llama-cpp-python.

The Method

I built a benchmark framework (benchmark/) with three layers:

1. Fixtures — 18 Synthetic Receipts

Each fixture is a realistic Quebec receipt as Tesseract would see it: item lines, subtotals, TPS/TVQ lines (at 5% and 9.975%), payment method, and footer text. Every receipt is paired with ground-truth JSON of what the extraction should return.

The receipts span 15 professions and 18 stores — from a contractor buying lumber at RONA to a lawyer expensing a client dinner at Le Pois Penché. The full list:

Construction   → RONA, Canac
TI/Consulting  → Starbucks, Best Buy
Juridique      → Le Pois Penché, Stationnement Indigo
Sante          → Bureau en Gros
Coiffure       → Sally Beauty
Creatif        → L.L. Lozeau
Transport      → Shell
Restauration   → Marche Atwater
Immobilier     → L'Express
Comptable      → Bureau en Gros
Mecanique      → Napa
Fitness        → Sportium
Tuteur         → Librairie Renaud-Bray
Entretien      → Canadian Tire
Coaching       → Tim Hortons

2. Model Runners

Both runners implement the same ModelRunner protocol:

class ModelRunner(ABC):
    @abstractmethod
    def extract(self, ocr_text: str, fixture_id: str) -> RunnerResult: ...
    @abstractmethod
    def is_available(self) -> bool: ...

DeepSeekRunner sends OCR text to the DeepSeek API with a system prompt requesting structured JSON output. Temperature is pinned to 0.0 for deterministic results.

RegexRunner uses hand-crafted patterns per field: the first non-address line for vendor name, date regex for date, subtotal patterns, TPS/TVQ patterns for taxes, and total. It runs in microseconds.

3. Evaluator — +/-2 cent Tolerance

The evaluator compares each extracted field against ground truth. Numeric fields (subtotal, taxes, total) get a +/-$0.02 tolerance. String fields require exact case-insensitive match. Each of the 7 fields is scored per receipt, then rolled up.

Results

The benchmark produced concrete numbers across all 18 fixtures:

Model Accuracy Success Receipts Avg time
DeepSeek API 89.7% 5/18 18 1.07s
Regex baseline 93.7% 10/18 18 0.00s
Field DeepSeek API Regex baseline
fournisseur 27.8% 55.6%
date 100.0% 100.0%
subtotal 100.0% 100.0%
tps 100.0% 100.0%
tvq 100.0% 100.0%
total 100.0% 100.0%
payment_method 100.0% 100.0%

The results surprised me. Let's break down what happened.

The Regex Baseline Won

A hand-crafted regex pipeline achieved 93.7% accuracy vs DeepSeek's 89.7% — not because the regex is smarter, but because DeepSeek is too literal. When the OCR text says RONA - Magasin 1234, DeepSeek returns that verbatim as the vendor name. The regex pipeline strips store numbers, suffixes, and address fragments.

This is a genuine limitation of LLM-based extraction: a frontier model with deep reasoning doesn't know the convention for vendor names on Quebec receipts.

DeepSeek's One Real Weakness

Beyond the fournisseur issue, DeepSeek was perfect on every other field: dates, subtotals, taxes, totals, and payment methods all hit 100%. It correctly handles decimal separators, French tax line labels ("TPS (5%)"), and the rare receipt where the tax line doesn't match the item total at the cent level.

The regex was also perfect on non-fournisseur fields — a testament to how structured and consistent retail receipts are.

The Timing Reality

DeepSeek averaged 1.07 seconds per receipt, dominated by network latency. At 18 receipts, the total was 19.3 seconds. The regex baseline ran in microseconds — functionally instant.

Cost Analysis

Metric DeepSeek API Regex baseline
Cost per receipt ~$0.0004 $0.00
500 receipts/mo ~$0.20 $0.00
Speed (per receipt) ~1.07s (network) ~0.00001s
Privacy Data leaves machine Fully local
Vendor name handling Requires post-processing Built-in

The actual cost for the full benchmark (18 receipts): $0.0025 in DeepSeek API fees. For a solo freelancer processing 50–100 receipts per month, that's essentially free.

What I Learned

  1. Structured data extraction doesn't need an LLM. For consistent receipt formats — which Quebec retail receipts absolutely are — well-crafted regex beats DeepSeek on both accuracy and speed.

  2. But you need the right domain knowledge. The regex approach works because I know Quebec receipts: TPS is always 5%, TVQ is always 9.975%, payment methods appear at the bottom, store names are on the first line.

  3. DeepSeek's vendor name issue is fixable. A simple post-processing step — strip common suffixes (store numbers, "Magasin", "Succursale", department names) — would push DeepSeek to 100% on fournisseur.

  4. For inconsistent receipts, use DeepSeek. If a receipt is handwritten, crumpled, or from an unusual vendor, the LLM handles edge cases gracefully while regex breaks silently. The hybrid approach is the real win.

What's Next

  1. Add the NuExtract-tiny runner (350 MB GGUF via llama-cpp-python)
  2. Test on real scanned receipts (the current fixtures are idealized Tesseract output)
  3. Build a hybrid pipeline: regex for standard fields to DeepSeek for vendor name normalization to CSV export

Built with Python, httpx, and too many Quebec receipts.

Source code: github.com/martinfou/ocr-benchmark