Pick Donut if…
OCR-free end-to-end
- You want one model, no OCR pipeline to maintain.
- Your documents are scans, images, or low-quality PDFs.
- You need nested structured JSON output directly.
- You can afford generative-decode latency at inference.
A practical, honest comparison of three popular document understanding models — Donut, LiLT, and LayoutLMv3 — for invoice and form extraction. Architecture, inputs, F1 benchmarks, and when to pick each.
Teams building invoice and form extraction pipelines keep asking the same question: Donut vs LiLT vs LayoutLM for invoices — which one should I fine-tune? The short answer is that all three are strong document understanding models with real trade-offs, and the right pick depends on whether you need OCR-free inference, multilingual coverage, or maximum English F1. This page lays out the architectures, the input formats, the published benchmarks, and a decision flow so you can pick the right model for your corpus — and then actually train it on enough data to win.
Skip the rest of the page if you just want the verdict. Each column lists the conditions under which that model is the best choice.
Pick Donut if…
Pick LiLT if…
Pick LayoutLMv3 if…
All three models process a document page, but they do it in fundamentally different ways. Getting the architecture right is the first step to choosing between them.
| Aspect | Donut | LiLT | LayoutLMv3 |
|---|---|---|---|
| Paradigm | Seq2seq image → text | Two-stream encoder | Unified dual-stream |
| Visual encoder | Swin Transformer | None (layout-only) | Linear patch embedding |
| Text decoder | BART-style autoregressive | Swappable (XLM-R, InfoXLM) | RoBERTa-style encoder |
| OCR required? | No | Yes | Yes |
| Pretraining | Synthetic doc rendering | IIT-CDIP + MLM-style | IIT-CDIP + MLM/MIM/WPA |
| Output head | Generated JSON string | Token classifier (BIO) | Token classifier (BIO) |
| Multilingual? | Partial (tokenizer dependent) | Yes, by design | v3 is English-only (v2 multilingual) |
The key architectural insight: Donut is a seq2seq image-to-text model that decodes a JSON string directly from page pixels. No OCR, no layout encoder — the Swin visual encoder reads the page and a BART-style decoder writes the answer. LiLT decouples the text encoder from the layout encoder so you can plug in any pretrained language model, which is why it works across languages. LayoutLMv3 unifies text, image patches, and layout embeddings in a single dual-stream transformer with pretraining objectives on all three modalities at once (masked language modeling, masked image modeling, word-patch alignment).
What each model actually eats at training and inference time. This determines your data pipeline, your OCR dependency, and your per-page compute.
A single rendered page image (RGB). No tokens, no OCR, no bboxes. The model reads pixels and writes a JSON string.
input: page.png (1280×960 RGB)
output: "<s_invoice_parsing>…</s>"
Tokens from OCR plus [x0, y0, x1, y1] bboxes normalized to a 0–1000 coordinate system. No page image required for the base layout stream.
input: tokens, bboxes (0-1000)
output: BIO tags per token
Tokens from OCR with pixel-space bboxes plus the page image as patch tokens. The model cross-attends across all three streams.
input: tokens, bboxes (pixel), image
output: BIO tags per token
This is the part every comparison article glosses over. All three models were published with impressive benchmark numbers on public datasets — and all three will underperform on your forms unless you fine-tune them on a large, high-quality, labeled corpus that matches your document distribution.
The public benchmarks everyone quotes are tiny. FUNSD has 199 forms total (149 train / 50 test). CORD has about 1,000 receipts. SROIE has 626 training receipts. Those sizes are fine for publishing a paper; they are nowhere near enough to productionize document AI on real invoices, claims, or tax forms.
Real fine-tuning runs want thousands to tens of thousands of labeled examples per document type, with realistic variation in layout, handwriting, stamps, stains, and edge cases. Manual annotation of real documents runs roughly $0.50–$3 per field and carries compliance risk (HIPAA, GLBA, GDPR) for anything containing actual PII. Public synthetic datasets like SynthDoG are rendered from Wikipedia pages — not invoices, not tax forms, not claims.
This is the gap SymageDocs closes. We generate pixel-perfect synthetic invoices, W-2s, 1040s, 1099s, CMS-1500s, and more — with ground-truth labels pre-formatted for every target: nested JSON for Donut synthetic data, token + layout pairs for LiLT synthetic data, and FUNSD-format data for LayoutLM. No PII, no annotation bottleneck, no volume ceiling.
For the bigger picture on training strategy across document models, see our pillar on synthetic training data for document AI.
The same invoice, three different model targets. This is what ends the "which format should I generate?" debate — you don't have to pick. The same labeled document can emit Donut JSON, LiLT token/bbox pairs, and LayoutLMv3 token/bbox pairs in a single generation run.
Form: invoice_classic. Donut JSON is the decode target; LiLT and LayoutLM feed BIO-tagged token/bbox sequences.
{
"task": "invoice_parsing",
"instance": {
"document": {
"date": "11/23/2020",
"reference_number": "Invoice48"
},
"employer": {
"name": "Atlas Logistics Co",
"phone": "(858) 433-1954"
},
"invoice": {
"po_number": "Po92",
"due_date": "11/28/2018",
"payment_terms": "Payment15",
"line_item": {
"amount": "73,123.17",
"description": "Line65",
"quantity": "Line25",
"unit_price": "Line94"
}
}
}
}Note the coordinate-space difference between LiLT and LayoutLM: LiLT normalizes every bbox into a 0–1000 grid (the LayoutLMv1 convention it inherits); LayoutLMv3 processors expect pixel coordinates and handle rescaling internally.
Head-to-head comparison is tricky because these models were evaluated on different datasets with different task formulations. The numbers below come straight from the original papers — click through to verify.
| Model | Dataset | F1 | Source |
|---|---|---|---|
| LayoutLMv3 (base) | FUNSD | 92.08 | Huang et al., 2022LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking |
| LiLT (with InfoXLM-base) | FUNSD | 88.41 | Wang et al., 2022LiLT: A Simple yet Effective Language-Independent Layout Transformer |
| Donut | CORD | 84.1 | Kim et al., 2022OCR-free Document Understanding Transformer |
Read these numbers carefully: LayoutLMv3's 92.08 F1 on FUNSD and LiLT's 88.41 on the same benchmark are directly comparable — both do token-level entity recognition on the 50-form FUNSD test split. Donut's 84.1 F1 is on CORD (receipt parsing), which is a different task formulation — field-level structured extraction — so you can't line it up one-for-one with the other two. What you can take away: on English form understanding, LayoutLMv3 is the highest-F1 public result today; LiLT is close behind and trades raw F1 for language coverage; Donut trades F1 for eliminating OCR entirely.
If you just want to be told what to do, walk this flow top-to-bottom.
Step 1
OCR-free requirement? (Low-quality scans, no OCR budget, need a single model.)
Yes → Donut. Stop here.
No → continue.
Step 2
Multilingual corpus? (Forms in multiple languages, cross-lingual transfer, non-English target.)
Yes → LiLT. Stop here.
No → continue.
Step 3
Default: English forms, OCR available, want maximum F1.
→ LayoutLMv3. Mind the CC BY-NC-SA 4.0 license on the weights for commercial use.
Whichever you pick, you still need a detector or layout pass to isolate the regions the extractor reads. For the upstream half of a two-stage pipeline, see synthetic training data for YOLOv8 document detection.
SymageDocs generates Donut JSON, LiLT inputs, and LayoutLM inputs from the same synthetic forms. No PII, no manual annotation, no volume ceiling. Start with 250 free credits.
Start for free