CompareDonut

SynthDoG alternative for Donut fine-tuning

SynthDoG is the pre-training generator the Donut authors shipped with the model. It's excellent at what it does — and it was never designed to emit the structured JSON targets fine-tuning requires. Here's where it stops short, and how SymageDocs fills the gap.

If you landed here you are probably fine-tuning Donut on a business document schema — invoices, W-2s, CMS-1500s, purchase orders — and you have realized SynthDoG alone does not give you what the fine-tuning stage needs. This page is an honest breakdown of the two tools, what each is built for, and how to stack them. No FUD. SynthDoG is open source, it works, and it comes from the Donut team. It just solves a different problem than the one most commercial teams reach it looking for.

What SynthDoG is

SynthDoG — short for Synthetic Document Generator — was released alongside the Donut paper (Kim et al., ECCV 2022) and lives in the clovaai/donut GitHub repository. It is an MIT-licensed image generator: it samples text from a corpus, renders it onto scanned-paper-like backgrounds with realistic distortions, and emits one image per sample with a simple text annotation. Its job is to teach a Swin-Transformer-based vision encoder to read printed text at scale.

The output looks roughly like {image, text_annotations}. It is deliberately schema-free. SynthDoG does not know what a “bill_to.email” field is, and it does not try.

What SynthDoG is good at

  • Language and script coverage. Swap the corpus and you get a SynthDoG for Japanese, Korean, Arabic, Hindi, or any script your tokenizer covers. This is why it ships with the Donut paper — the authors needed a way to pre-train the encoder in multiple languages without sourcing real documents.
  • Volume. Pre-training wants millions of pages. Rendering is cheap relative to licensing real scans, and SynthDoG is tuned to produce that volume fast.
  • Reading-text-on-paper generalization. The paper-texture, warp, and noise augmentations teach the encoder to handle smudges, skew, and compression artifacts — the kind of distribution you get when real documents are scanned or faxed.

If you are pre-training Donut from scratch — new language, new script, new domain — SynthDoG is still the right tool. Keep using it.

Where SynthDoG stops short for fine-tuning

Fine-tuning Donut on a specific business schema is a different problem than pre-training. SynthDoG was never meant to solve it, and reaching for SynthDoG when what you actually want is fine-tuning data leaves four concrete gaps:

  • No structured JSON target. SynthDoG emits free-form text. Donut fine-tuning expects an image paired with the exact nested JSON your parser should output (Donut’s gt_parse field). That structure has to come from somewhere.
  • No field linking. A real invoice has a bill_to block: a name, an address, maybe an email, all belonging to the same logical entity. SynthDoG renders glyphs; it does not model which glyphs belong together.
  • No realistic field distributions. Invoice numbers look like invoice numbers. NPIs are ten digits with a valid checksum. Dates are dates. Random text drawn from Wikipedia does not give Donut these priors, and the model learns whatever distribution it sees.
  • No entity-linked labels. If you also want to train a downstream layout model on the same corpus — LayoutLM, LiLT, a field classifier — you need per-word bounding boxes and entity types. SynthDoG text annotations have neither.

What fine-tuning actually needs

Donut’s fine-tuning recipe, as described in the original paper and reinforced across downstream work on CORD, SROIE, and industry invoice datasets, needs four things from each training sample:

  1. A realistic document image with print/scan-style augmentation.
  2. A structured JSON target matching the downstream schema — the exact shape the decoder should learn to emit.
  3. Field-level linking so compound entities (bill_to, ship_to, line_item) stay coherent across samples.
  4. Field distributions drawn from a population model, not random strings — checksummed IDs, plausible dates, balance-preserving line totals.

SynthDoG covers number one. The other three are what SymageDocs exists to provide.

How SymageDocs fills the gap

Every SymageDocs document is generated from a schema-aware population model. The render is paired with the exact structured JSON you want Donut to learn to emit, plus per-word bounding boxes, BIO-tagged tokens, and field linkage — the raw material fine-tuning actually needs. The same artifact feeds Donut training data, a LiLT pipeline, a LayoutLM pipeline, or FUNSD-format synthetic data — one dataset, many targets.

Output shape comparison

Based on the Classic Commercial Invoice form (30 fields).

json
// SymageDocs output (fine-tuning target)
{
  "image": "invoice_classic_000001.png",
  "donut_target_json": {
    "instance": {
      "document": {
        "date": "11/23/2020",
        "reference_number": "Invoice48"
      },
      "employer": {
        "address": {
          "full": "Pinebrook Services",
          "street": "6252 Oak Street"
        },
        "name": "Atlas Logistics Co",
        "phone": "(858) 433-1954"
      },
      "invoice": {
        "bill_to": {
          "account_number": "Customer34",
          "address": {
            "full": "Customer67",
            "street": "4108 Oak Street"
          },
          "email": "rowan@example.com",
          "name": "Customer68"
        },
        "due_date": "11/28/2018",
        "line_item": {
          "amount": "73,123.17",
          "description": "Line65",
          "quantity": "Line25",
          "unit_price": "Line94"
        },
        "payment_terms": "Payment15",
        "po_number": "Po92",
        "vendor": {
          "email": "rowan@example.com"
        }
      }
    },
    "task": "invoice_parsing"
  },
  "word_annotations": "[...per-word bbox + field_id + entity_type]",
  "bio_tokens": "[...BIO-tagged tokens with page position]",
  "form_id": "invoice_classic",
  "field_count": 30
}

For fine-tuning

SynthDoG vs SymageDocs — fine-tuning readiness

  • SynthDoG (open source)

    Emits images + free-form text. No structured JSON target, no field linking, no entity labels.

  • Hand-annotated real documents

    Expensive, slow, privacy-sensitive, and impossible to scale to the thousands of samples Donut wants.

SymageDocs

Image + exact structured JSON target + per-word bbox/entity labels + BIO tokens, generated from realistic field distributions. Ready to drop into a Donut fine-tuning loop.

When to use each

StageSynthDoGSymageDocs
Pre-training from scratchUse it. This is its job.Overkill for this stage.
New language / script warmupBest-in-class — swap the corpus.Not the right tool here.
Fine-tuning on business schemaMissing JSON target, labels, linking.Purpose-built for this.
Evaluation / regression setsNot labeled for extraction.Reproducible with seeded JSON.

The honest answer for most teams is “both”: SynthDoG for the pre-training warmup when you need it, SymageDocs for the fine-tuning stage where the labels matter. If you want the broader story on how this fits into document AI training data in general, the synthetic training data for document AI pillar walks through every model family we support.

Frequently asked questions

Do I need to replace SynthDoG?
No. SynthDoG is a pre-training tool from the Donut authors — it teaches a vision-encoder to read rendered text on paper-like backgrounds. If you are pre-training Donut from scratch in a new language or script, SynthDoG is still the right tool. SymageDocs is complementary: it produces the image + structured-JSON target pairs you need for the fine-tuning stage, where Donut learns your specific business document schema.
Can I combine SynthDoG and SymageDocs in one training run?
Yes, and this is a common pattern. Stage 1, pre-train (or continue-pre-train) on SynthDoG-rendered pages so the encoder learns to read at scale. Stage 2, fine-tune on SymageDocs output, where every sample is an image paired with the exact structured JSON your downstream parser expects. The two tools cover different stages of the Donut training recipe and do not conflict.
What does licensing look like for each?
SynthDoG is MIT-licensed open source, maintained by the Clova AI team at Naver. SymageDocs is a commercial platform — you pay per generated document, own the output under a commercial license, and avoid the setup burden of rendering pipelines, corpus sourcing, schema authoring, and population modeling. Your Donut checkpoint and training code remain yours either way.

Ready to fine-tune Donut on real business schemas?

Generate image + structured JSON pairs from the same population model. Start with 250 free credits — no credit card required.

Start for free