ModelDonut

Synthetic training data for Donut (OCR-free)

Fine-tune Donut on business forms it will actually see in production. Every synthetic page ships with the nested, schema-rich JSON Donut's decoder is meant to emit — no OCR step, no bbox remapping, no label re-formatting.

Donut is the model that ditched OCR. Instead of a three-stage pipeline (detect text, recognize text, re-assemble into structure), it reads the pixels and writes the JSON. That elegance comes with a training data problem: Donut learns the schema from the label, so if your fine-tuning labels are shallow, the model stays shallow. This page walks through how our platform produces the deep, internally-consistent JSON Donut needs, how we compare to the SynthDoG generator Donut ships with, and exactly how to wire our output into a Hugging Face VisionEncoderDecoderModel fine-tuning loop. The example on this page is a classic commercial invoice with 30 linked fields, the kind of structured target that SynthDoG's generic text-on-image output cannot supply.

Donut: the OCR-free seq2seq model

Donut was introduced in OCR-free Document Understanding Transformer (Kim et al., ECCV 2022). Architecturally it is two pieces glued end-to-end: a Swin Transformer vision encoder that ingests a full-page document image, and a BART-style decoder that autoregressively emits a token sequence. The twist is what that token sequence represents — not transcribed text, not token-level classifications, but the target structured output itself, formatted as a serialized JSON tree.

The appeal is that Donut sidesteps cascaded error. OCR pipelines lose information between every stage: glyphs get mis-read, tokenization drifts, entity spans don't line up with layout blocks. Donut learns all of that jointly. The drawback is that the model is only as good as the structure it's trained to reproduce, and there's nowhere to hide a weak label — the decoder sees exactly what you give it, character-for-character.

That's the theme of this page. The bigger the gap between SynthDoG-style pre-training (generic) and your production data (specific business forms), the more fine-tuning matters, and the more your fine-tuning labels matter. Let's address SynthDoG head-on.

The SynthDoG question: why not just use that?

If you've read the Donut repo you already know about SynthDoG (Synthetic Document Generator). It ships in the same codebase as the model, and it's the source of the million-page multilingual corpus Donut was pre-trained on. It's a real piece of engineering, and we're not going to pretend otherwise. So the honest question is: Donut already has its own synthetic data tool — why would I use yours?

The short answer is that SynthDoG and SymageDocs solve different stages of the same problem. SynthDoG is a pre-training tool: it composites multilingual text onto document-like backgrounds at massive scale, so the vision encoder learns to read character glyphs and the decoder learns what document structure looks like in the abstract. It is not trying to match any particular downstream schema. The SynthDoG paper calls its output “generic document images” for good reason — fields are not linked, amounts don't sum, vendor addresses are not coherent with vendor names. That is a feature for pre-training and a showstopper for fine-tuning.

SymageDocs is a fine-tuning tool. We generate a specific form template — this page demonstrates an invoice_classic — with a known schema, realistic values, and the JSON target Donut's decoder is meant to emit for that specific task. That's the shape of data that moves the fine-tuning loss down the curve. The workflow is complementary: start from the SynthDoG-pretrained checkpoint on Hugging Face, then fine-tune on our output for whatever business form you actually care about.

We wrote more about the trade-offs — language coverage, schema richness, volume, licensing — on our SynthDoG alternative page. For the rest of this page we assume you agree that fine-tuning matters, and we show you the mechanics.

Same model, two stages

SynthDoG vs. SymageDocs for Donut

  • SynthDoG (pre-training)

    Generic multilingual document images with no field linkage. Great for teaching glyph recognition. Wrong granularity for schema fine-tuning.

  • SynthDoG + manual labelling

    You hand-annotate SynthDoG outputs to match your schema — which defeats the point of synthetic data and introduces annotator drift.

  • Fine-tuning on real documents

    PII-heavy, expensive to annotate, legally fraught. Hard to scale beyond a few thousand pages for a new schema.

SymageDocs

Schema-first synthetic data: real business form templates paired with the nested JSON target Donut is meant to emit. You skip pre-training (use the public checkpoint), then fine-tune directly on our output. No annotation, no PII.

Why structured business JSON beats raw synthetic images for fine-tuning

There are three properties of fine-tuning data that matter for a seq2seq OCR-free model like Donut, and all three break down when you try to use SynthDoG-style output for your downstream task:

  • Schema richness. Donut emits nested JSON with section-level grouping (vendor, bill-to, line items). If the label doesn't have that nesting, the decoder never learns to produce it. Our invoice example ships with a full document / employer / invoice.bill_to / invoice.line_item tree — the same shape a human reviewer would expect.
  • Field linking. The PO number on an invoice corresponds to a real bill-to customer; the line-item amount corresponds to real unit prices and quantities. SynthDoG does not model those relationships — it renders plausible-looking text. Our forms are driven by a single underlying entity model, so the relationships are coherent by construction.
  • Real-world distributions. Invoice totals follow realistic magnitudes. Dates are in realistic relative order (due date after issue date). Customer names have the right character-class distribution for the region. This is what teaches Donut to generalize rather than memorize.

None of this requires you to throw away SynthDoG-style pre-training. It just means you can't end with it. The synthetic training data for document AI pillar page unpacks the same argument for LayoutLMv3, LiLT, and the other entity-token models.

Donut's prompt → JSON output format

Donut doesn't take a natural-language prompt the way a chat model does. Instead, each fine-tuning task gets its own start token that conditions the decoder. In the reference codebase these look like <s_docvqa> for DocVQA, <s_cord> for CORD receipt parsing, and so on. The decoder is taught to emit the task-appropriate JSON between that start token and a matching end token.

For the invoice task we use on this page, the token boundaries are <s_invoice> and </s_invoice>. Internally, Donut linearizes the nested JSON by wrapping every key in its own start/end token pair — <s_vendor>Acme</s_vendor> — which is why the target sequence looks more like XML than JSON in the training logs. At inference time you reverse the process: run the decoder, collect the token stream, and parse it back into a Python dict with a small utility (the Donut codebase calls it token2json).

Here's Donut-target JSON from our platform

Below is the actual output from one SymageDocs job on the invoice_classic template. The Donut target JSON tab shows the nested structure Donut's decoder should reproduce. The prompt → output pair tab shows how we wrap it for the seq2seq loss. The HF datasets tab is the paste-ready code to turn a directory of SymageDocs jobs into a Hugging Face dataset.

Rendered Classic Commercial Invoice page — training input for Donut
One rendered Classic Commercial Invoice page. Donut reads pixels directly — no OCR, no bboxes required in the label.

Paste-ready Donut training data

Source form: Classic Commercial Invoice (30 linked fields)

json
{
  "instance": {
    "document": {
      "date": "11/23/2020",
      "reference_number": "Invoice48"
    },
    "employer": {
      "address": {
        "full": "Pinebrook Services",
        "street": "6252 Oak Street"
      },
      "name": "Atlas Logistics Co",
      "phone": "(858) 433-1954"
    },
    "invoice": {
      "bill_to": {
        "account_number": "Customer34",
        "address": {
          "full": "Customer67",
          "street": "4108 Oak Street"
        },
        "email": "rowan@example.com",
        "name": "Customer68"
      },
      "due_date": "11/28/2018",
      "line_item": {
        "amount": "73,123.17",
        "description": "Line65",
        "quantity": "Line25",
        "unit_price": "Line94"
      },
      "payment_terms": "Payment15",
      "po_number": "Po92",
      "vendor": {
        "email": "rowan@example.com"
      }
    }
  },
  "task": "invoice_parsing"
}

Fine-tuning pipeline: our PDFs + our JSON as ground-truth pairs

The pipeline is about as short as seq2seq fine-tuning gets. You generate N synthetic forms with output_formats=["pdf_typed", "json"], rasterize the PDFs to page images (one call to pdf2image), build a Hugging Face dataset keyed on image + ground_truth, and kick off a standard VisionEncoderDecoderModel training loop against the naver-clova-ix/donut-base checkpoint.

End-to-end Donut fine-tuning with SymageDocs

Three stages — we own the first, Hugging Face owns the rest.

  1. Pick a form

    Classic Commercial Invoice

  2. Generate at volume

    pdf_typed + json, N examples

  3. Fine-tune Donut

    VisionEncoderDecoder seq2seq

VisionEncoderDecoderModel fine-tuning loop

Abbreviated — drop into your own Trainer or Accelerate setup.

python
from transformers import (
    DonutProcessor,
    VisionEncoderDecoderModel,
    VisionEncoderDecoderConfig,
)

# 1. Start from the public Donut checkpoint pre-trained on SynthDoG.
#    We keep SynthDoG's visual grounding and fine-tune on our schema.
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base")

# 2. Add our task-specific start token so the decoder knows what schema
#    to emit. This is the same mechanism CORD/DocVQA variants use.
processor.tokenizer.add_tokens(["<s_invoice>", "</s_invoice>"])
model.decoder.resize_token_embeddings(len(processor.tokenizer))
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids("<s_invoice>")

# 3. Training step — image is our rendered PDF page, labels are the
#    tokenized target_sequence built from our JSON.
def collate(batch):
    pixel_values = processor(
        [b["image"] for b in batch], return_tensors="pt"
    ).pixel_values
    targets = processor.tokenizer(
        [b["target_sequence"] for b in batch],
        padding="longest",
        return_tensors="pt",
    ).input_ids
    targets[targets == processor.tokenizer.pad_token_id] = -100
    return {"pixel_values": pixel_values, "labels": targets}

# Plug collate() into Trainer or Accelerate — standard seq2seq loop.

If your target task is token-level instead of end-to-end seq2seq — for example, you want a per-field bounding box with each extracted value — Donut is the wrong model. Reach for the LiLT training data pathway instead, or start from the FUNSD benchmark format and fine-tune a LayoutLMv3-class model. Donut is the right answer when you want a single end-to-end model that emits JSON directly, and you're comfortable with the trade-off that you can't easily introspect per-field confidence.

Side-by-side: training Donut on SynthDoG vs. SymageDocs

Qualitative comparison across the axes that actually matter when you are deciding how to produce fine-tuning data for a production-facing Donut deployment.

DimensionSynthDoGSymageDocs
Primary usePre-training (glyph recognition, language coverage).Fine-tuning (business schema, JSON output).
JSON schemaFlat text blocks. No nested fields.Nested, form-specific. 30 linked fields in this example.
Field linkingNone — composited glyphs have no semantic relationship.Vendor ↔ bill-to, line items ↔ totals coherent by construction.
Realism of valuesSampled from Wikipedia corpora — plausible strings.Domain-realistic amounts, dates, addresses, terms.
Template fidelityGeneric document backgrounds. No real form layout.Actual W-2, CMS-1500, I-9, invoice templates.
Best checkpoint to start fromTrain from scratch (rarely what you want).naver-clova-ix/donut-base, fine-tune on our data.

FAQ

What is Donut, and why is it called OCR-free?
Donut (Document Understanding Transformer, Kim et al. ECCV 2022) is an end-to-end seq2seq model that reads a document image and emits a structured JSON string directly. It has no OCR stage, no bounding-box regression head, and no separate layout encoder — just a Swin-style vision encoder feeding a BART-style text decoder. "OCR-free" means the model learns character recognition implicitly as a byproduct of predicting the target JSON.
How is this different from SynthDoG?
SynthDoG (Synthetic Document Generator) is the rasterizer that ships inside the Donut paper's repo. It composites synthetic text onto background images in many languages to create generic pre-training material. SymageDocs generates fully-realized business forms — invoices, W-2s, CMS-1500s, I-9s — with the structured JSON that Donut's decoder is supposed to emit. SynthDoG is for pre-training warm-up; we are for the fine-tuning step that actually teaches the model your target schema.
Do I still need SynthDoG if I use SymageDocs?
No, because the public naver-clova-ix/donut-base checkpoint on Hugging Face already contains the SynthDoG pre-training. You start from that checkpoint and fine-tune on our data. You do not need to run SynthDoG yourself unless you are training a brand-new Donut from random weights, which almost nobody should do.
What format do I need my labels in for a Donut fine-tune?
Donut expects a JSON object under a gt_parse key, which it then linearizes into a token sequence wrapped in schema-specific start and end tokens (for example, <s_invoice>...</s_invoice>). Our platform emits the JSON directly — you pass it through json.dumps and you are done. No bbox remapping, no IOB tagging, no FUNSD-style entity linking required.
How many examples do I need to fine-tune Donut for a new document type?
The original Donut paper fine-tunes on CORD (800 training examples) and DocVQA (around 40k), so the practical range is wide. For a single structured business form with a stable schema — like the invoice layout this page demonstrates — teams typically see usable accuracy at 500–2,000 examples and plateau-quality results around 5,000–10,000. SymageDocs can produce that volume in an afternoon with fully deterministic ground truth.
Why is JSON from a real form better than SynthDoG's synthetic text blocks?
SynthDoG produces images that look like documents but do not have the field relationships that power downstream extraction — no vendor-to-bill-to linkage, no line-item grouping, no address grounded in a real schema. Our output is a real business form template populated with internally consistent values, and the JSON encodes the same nested structure you would want Donut to predict at inference time. Fine-tuning on that teaches the decoder real schema shapes, not just glyph recognition.

Ready to fine-tune Donut on data that matches your schema?

Generate thousands of schema-rich training examples on our free tier. No credit card required.