Model comparison

Donut vs LiLT vs LayoutLM — for invoices and forms

A practical, honest comparison of three popular document understanding models — Donut, LiLT, and LayoutLMv3 — for invoice and form extraction. Architecture, inputs, F1 benchmarks, and when to pick each.

Teams building invoice and form extraction pipelines keep asking the same question: Donut vs LiLT vs LayoutLM for invoices — which one should I fine-tune? The short answer is that all three are strong document understanding models with real trade-offs, and the right pick depends on whether you need OCR-free inference, multilingual coverage, or maximum English F1. This page lays out the architectures, the input formats, the published benchmarks, and a decision flow so you can pick the right model for your corpus — and then actually train it on enough data to win.

TL;DR — which model should you pick?

Skip the rest of the page if you just want the verdict. Each column lists the conditions under which that model is the best choice.

Pick Donut if…

OCR-free end-to-end

  • You want one model, no OCR pipeline to maintain.
  • Your documents are scans, images, or low-quality PDFs.
  • You need nested structured JSON output directly.
  • You can afford generative-decode latency at inference.

Pick LiLT if…

Multilingual forms

  • Your corpus spans multiple languages.
  • You want to swap in XLM-R or InfoXLM as the text encoder.
  • You already run OCR and want to reuse tokens + boxes.
  • You need cross-lingual zero-shot transfer.

Pick LayoutLMv3 if…

Maximum English F1

  • English forms where every F1 point matters.
  • You already have OCR tokens and page images.
  • You want a unified text+image+layout transformer.
  • You can license-wrangle the CC BY-NC-SA 4.0 weights.

Architecture comparison

All three models process a document page, but they do it in fundamentally different ways. Getting the architecture right is the first step to choosing between them.

AspectDonutLiLTLayoutLMv3
ParadigmSeq2seq image → textTwo-stream encoderUnified dual-stream
Visual encoderSwin TransformerNone (layout-only)Linear patch embedding
Text decoderBART-style autoregressiveSwappable (XLM-R, InfoXLM)RoBERTa-style encoder
OCR required?NoYesYes
PretrainingSynthetic doc renderingIIT-CDIP + MLM-styleIIT-CDIP + MLM/MIM/WPA
Output headGenerated JSON stringToken classifier (BIO)Token classifier (BIO)
Multilingual?Partial (tokenizer dependent)Yes, by designv3 is English-only (v2 multilingual)

The key architectural insight: Donut is a seq2seq image-to-text model that decodes a JSON string directly from page pixels. No OCR, no layout encoder — the Swin visual encoder reads the page and a BART-style decoder writes the answer. LiLT decouples the text encoder from the layout encoder so you can plug in any pretrained language model, which is why it works across languages. LayoutLMv3 unifies text, image patches, and layout embeddings in a single dual-stream transformer with pretraining objectives on all three modalities at once (masked language modeling, masked image modeling, word-patch alignment).

Input format comparison

What each model actually eats at training and inference time. This determines your data pipeline, your OCR dependency, and your per-page compute.

Donut: image only

A single rendered page image (RGB). No tokens, no OCR, no bboxes. The model reads pixels and writes a JSON string.

input: page.png (1280×960 RGB)
output: "<s_invoice_parsing>…</s>"

LiLT: tokens + layout

Tokens from OCR plus [x0, y0, x1, y1] bboxes normalized to a 0–1000 coordinate system. No page image required for the base layout stream.

input: tokens, bboxes (0-1000)
output: BIO tags per token

LayoutLMv3: tokens + layout + image

Tokens from OCR with pixel-space bboxes plus the page image as patch tokens. The model cross-attends across all three streams.

input: tokens, bboxes (pixel), image
output: BIO tags per token

Training data needs — where every team hits the wall

This is the part every comparison article glosses over. All three models were published with impressive benchmark numbers on public datasets — and all three will underperform on your forms unless you fine-tune them on a large, high-quality, labeled corpus that matches your document distribution.

The public benchmarks everyone quotes are tiny. FUNSD has 199 forms total (149 train / 50 test). CORD has about 1,000 receipts. SROIE has 626 training receipts. Those sizes are fine for publishing a paper; they are nowhere near enough to productionize document AI on real invoices, claims, or tax forms.

Real fine-tuning runs want thousands to tens of thousands of labeled examples per document type, with realistic variation in layout, handwriting, stamps, stains, and edge cases. Manual annotation of real documents runs roughly $0.50–$3 per field and carries compliance risk (HIPAA, GLBA, GDPR) for anything containing actual PII. Public synthetic datasets like SynthDoG are rendered from Wikipedia pages — not invoices, not tax forms, not claims.

This is the gap SymageDocs closes. We generate pixel-perfect synthetic invoices, W-2s, 1040s, 1099s, CMS-1500s, and more — with ground-truth labels pre-formatted for every target: nested JSON for Donut synthetic data, token + layout pairs for LiLT synthetic data, and FUNSD-format data for LayoutLM. No PII, no annotation bottleneck, no volume ceiling.

For the bigger picture on training strategy across document models, see our pillar on synthetic training data for document AI.

One dataset, three model targets

The same invoice, three different model targets. This is what ends the "which format should I generate?" debate — you don't have to pick. The same labeled document can emit Donut JSON, LiLT token/bbox pairs, and LayoutLMv3 token/bbox pairs in a single generation run.

Same invoice, three targets

Form: invoice_classic. Donut JSON is the decode target; LiLT and LayoutLM feed BIO-tagged token/bbox sequences.

json
{
  "task": "invoice_parsing",
  "instance": {
    "document": {
      "date": "11/23/2020",
      "reference_number": "Invoice48"
    },
    "employer": {
      "name": "Atlas Logistics Co",
      "phone": "(858) 433-1954"
    },
    "invoice": {
      "po_number": "Po92",
      "due_date": "11/28/2018",
      "payment_terms": "Payment15",
      "line_item": {
        "amount": "73,123.17",
        "description": "Line65",
        "quantity": "Line25",
        "unit_price": "Line94"
      }
    }
  }
}

Note the coordinate-space difference between LiLT and LayoutLM: LiLT normalizes every bbox into a 0–1000 grid (the LayoutLMv1 convention it inherits); LayoutLMv3 processors expect pixel coordinates and handle rescaling internally.

Published F1 benchmarks

Head-to-head comparison is tricky because these models were evaluated on different datasets with different task formulations. The numbers below come straight from the original papers — click through to verify.

ModelDatasetF1Source
LayoutLMv3 (base)FUNSD92.08Huang et al., 2022LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
LiLT (with InfoXLM-base)FUNSD88.41Wang et al., 2022LiLT: A Simple yet Effective Language-Independent Layout Transformer
DonutCORD84.1Kim et al., 2022OCR-free Document Understanding Transformer

Read these numbers carefully: LayoutLMv3's 92.08 F1 on FUNSD and LiLT's 88.41 on the same benchmark are directly comparable — both do token-level entity recognition on the 50-form FUNSD test split. Donut's 84.1 F1 is on CORD (receipt parsing), which is a different task formulation — field-level structured extraction — so you can't line it up one-for-one with the other two. What you can take away: on English form understanding, LayoutLMv3 is the highest-F1 public result today; LiLT is close behind and trades raw F1 for language coverage; Donut trades F1 for eliminating OCR entirely.

Decision flowchart

If you just want to be told what to do, walk this flow top-to-bottom.

  1. Step 1

    OCR-free requirement? (Low-quality scans, no OCR budget, need a single model.)

    Yes → Donut. Stop here.

    No → continue.

  2. Step 2

    Multilingual corpus? (Forms in multiple languages, cross-lingual transfer, non-English target.)

    Yes → LiLT. Stop here.

    No → continue.

  3. Step 3

    Default: English forms, OCR available, want maximum F1.

    LayoutLMv3. Mind the CC BY-NC-SA 4.0 license on the weights for commercial use.

Whichever you pick, you still need a detector or layout pass to isolate the regions the extractor reads. For the upstream half of a two-stage pipeline, see synthetic training data for YOLOv8 document detection.

Frequently asked questions

Which model should I pick for invoice extraction — Donut, LiLT, or LayoutLM?
If you want a single OCR-free model and can afford the compute, pick Donut. If your corpus is multilingual, pick LiLT — its layout backbone transfers across languages with a swapped text encoder. For English-only forms with published OCR, LayoutLMv3 is the highest-F1 option on FUNSD (92.08) and is the safest default.
Do I need OCR for LiLT and LayoutLM but not Donut?
Yes. LiLT and LayoutLM consume tokens plus bounding boxes that come from an OCR step (Tesseract, AWS Textract, Azure Read, etc.). Donut is OCR-free — it reads pixels directly and decodes structured JSON. That removes an error-propagation step but costs more compute per page at inference time.
How much training data do these models need?
All three benefit from large, high-quality, labeled form corpora. The public benchmarks are tiny — FUNSD has 199 forms and CORD has about 1,000 receipts — so real-world fine-tuning usually hits a data wall. Synthetic documents with ground-truth labels remove that ceiling and are how teams get past 90+ F1 on their own document types.
Can I use the same dataset to train Donut, LiLT, and LayoutLM?
Yes, as long as the source labels carry both text content and spatial coordinates. From one labeled form you can emit Donut's nested-JSON target, LiLT's tokens with 0–1000 normalized bboxes, and LayoutLM's tokens with pixel bboxes. SymageDocs generates all three formats from the same underlying document.
What is the difference in architecture between LiLT and LayoutLMv3?
LayoutLMv3 is a single dual-stream transformer that jointly encodes text, layout, and image patches with masked modeling objectives on all three. LiLT uses a dedicated layout-only backbone wired to a swappable off-the-shelf text encoder (XLM-R, InfoXLM, etc.) via bi-directional attention complementation — which is what makes it language-agnostic.

Train whichever one you pick on enough data

SymageDocs generates Donut JSON, LiLT inputs, and LayoutLM inputs from the same synthetic forms. No PII, no manual annotation, no volume ceiling. Start with 250 free credits.

Start for free