Synthetic training data for document AI
Pixel-perfect labeled documents for every major document-AI architecture — YOLOv8, LiLT, Donut, LayoutLM, and BIO-tagging NER — generated as one training-ready payload of image, structured JSON, BIO tokens, and entity-linked word annotations. This is the home for everything about synthetic training data for document AI at SymageDocs.
Why synthetic document data exists
Document understanding is bottlenecked by data. The public benchmarks most teams fall back on were built to measure progress, not to train production systems. FUNSD, the Form Understanding in Noisy Scanned Documents dataset published by Jaume, Ekenel, and Thiran at ICDAR-OST 2019, contains exactly 199 forms — 149 training and 50 test. CORD, the Consolidated Receipt Dataset from Park et al., contains roughly 1,000 receipts. SROIE, the ICDAR-2019 Scanned Receipt OCR and Information Extraction challenge, ships 973 scanned receipts. IIT-CDIP Test Collection 1.0, the largest public document corpus, is unlabeled.
Those volumes are useful as yardsticks. They are not enough to train a production document extractor on a new form type. A single enterprise workflow — a W-2 ingestion pipeline, a CMS-1500 claims classifier, a 1040 field extractor — routinely needs thousands to tens of thousands of labeled examples to reach usable accuracy on the long tail of layouts, fonts, stamp overlays, and field combinations that real documents contain.
Collecting that volume from real documents runs into three walls. First, privacy: tax returns, medical claims, and identity documents carry regulated PII, PHI, and financial data that cannot simply be uploaded to a labeling vendor. Second, cost: manual bounding-box and entity annotation on dense forms is the expensive kind of human labeling — minutes per page, not seconds — and a modest training set can burn five-figure budgets before the first model converges. Third, coverage: real-document corpora reflect the sampling bias of whoever collected them. You get whatever forms that team happened to scan, not the full distribution of layouts your production system will see.
Synthetic document generation flips all three constraints. There is no real PII to exfiltrate because every field value is drawn from a coherent fictitious identity. There is no human labeler because the generator emits the ground truth — it knows what it drew, down to the pixel. And the distribution is yours to control: you can oversample the layouts, fonts, and anomalies your downstream pipeline struggles with.
What “training-ready” actually means
Most synthetic data tools stop at an image. A TrainingExample from SymageDocs is a bundle: the rendered page image, the structured JSON that produced it, token-level BIO tags aligned to words, and word-level annotations with field linkage. One API call, one payload, every downstream format derivable from it.
The example below is drawn from form irs_w2_2024 (W-2 Wage and Tax Statement). In this single example the generator emitted 47 tokens spanning 4 distinct entity labels, each tied back to a specific field_id on the source form. The JSON tab shows the word-annotation view; the BIO tokens tab shows the same content projected into CoNLL token-classification format; the Word annotations tab is the full annotation array you’d feed to a layout model.

One TrainingExample, three views
Same data, every schema a downstream model expects.
[
{
"entity_type": "answer",
"field_id": "employee_ssn",
"field_type": "other",
"height": 12.04,
"linked_field_id": null,
"page": 1,
"text": "621-56-9071",
"width": 119.3,
"x": 153.18,
"y": 48
},
{
"entity_type": "answer",
"field_id": "employer_ein",
"field_type": "other",
"height": 12.04,
"linked_field_id": null,
"page": 1,
"text": "31-3044107",
"width": 277.62,
"x": 38.01,
"y": 71.99
},
{
"entity_type": "answer",
"field_id": "box1_wages",
"field_type": "other",
"height": 12.04,
"linked_field_id": null,
"page": 1,
"text": "50,297.29",
"width": 112.5,
"x": 333.17,
"y": 71.99
},
{
"entity_type": "answer",
"field_id": "box2_fed_tax",
"field_type": "other",
"height": 12.04,
"linked_field_id": null,
"page": 1,
"text": "39,214.42",
"width": 112.5,
"x": 455.57,
"y": 71.99
},
{
"entity_type": "answer",
"field_id": "employer_name_address",
"field_type": "other",
"height": 60.03,
"linked_field_id": null,
"page": 1,
"text": "Meridian",
"width": 92.54,
"x": 38.01,
"y": 95.99
},
{
"entity_type": "answer",
"field_id": "employer_name_address",
"field_type": "other",
"height": 60.03,
"linked_field_id": null,
"page": 1,
"text": "Labs",
"width":
// ... truncatedEvery downstream dataset format this page links to is a view over this payload. You never need to regenerate your documents to switch from BIO NER to FUNSD layout — the word annotations and BIO tokens you see above already contain everything required to emit both.
Compatible models and output formats
Every model family below consumes a different input schema. Rather than translate by hand, pull the emitted format straight from the SDK. Each row links to a dedicated page with a copy-paste loader, fixture output, and notes on what that model cares about.
| Model / format | Output shape | Best for | Page |
|---|---|---|---|
| YOLOv8 document detection | YOLO txt + data.yaml | Normalized bounding boxes and a paste-ready Ultralytics data.yaml for region-level detectors. | synthetic training data for YOLOv8 |
| BIO / IOB2 token sequences | CoNLL TSV + JSONL | CoNLL-style BIO tags aligned to word tokens — drop-in for Hugging Face token-classification trainers. | BIO-tagged synthetic NER data |
| FUNSD-format entities and links | FUNSD JSON | Question / answer / header entities with word-level boxes and linking edges in the exact FUNSD JSON schema. | FUNSD-format synthetic data |
| LiLT-ready token input | LiLT tokens + bboxes | Tokens with 0-1000 normalized bboxes and aligned labels — exactly what LiLT's LayoutLMv3 backbone expects. | synthetic training data for LiLT |
| Donut target JSON | Donut target JSON | End-to-end image-to-JSON targets with the task prompt Donut consumes during supervised fine-tuning. | synthetic training data for Donut |
How one annotation maps to every schema
The canonical unit in our output is the WordAnnotation. Every word the renderer places on the page carries its text, its pixel bbox ( x, y, width, height), a page number, a field_id tying it back to the source form definition, an entity_type (e.g. answer, label, header), and a linked_field_id when the word is a label pointing at a value field. Every downstream model format is a projection of that core record.
- FUNSD linking. Words with
entity_type = "label"become question entities; their linked words become answer entities; thelinked_field_idedges become the FUNSDlinkingpairs. - LiLT input. Word text becomes the token sequence, bboxes are normalized into LiLT’s
[x0, y0, x1, y1]0-1000 scale, and the entity type becomes the per-token class label. - Donut prompt. The structured answer tree is serialized into Donut’s
<s_key>value</s_key>target grammar, keyed byfield_id. No bboxes are needed — Donut is OCR-free. - YOLO labels. Word bboxes are grouped by
field_typeto form class regions, then normalized to YOLO’s(cls, cx, cy, w, h)0-1 space using the page dimensions the document was rendered at. - BIO sequence. The word list is tokenized in reading order; the first word of each entity span gets a
B-tag, continuations getI-, and unlabeled whitespace getsO. The canonical BIO sample above is generated directly from the same fixture.
Because every format is derived from one underlying annotation graph, switching architectures is free. You don’t have to regenerate your batch to experiment with a Donut target instead of a LiLT tagger; one call gives you both.
Scale economics: volume, cost, privacy
The math for synthetic is straightforward and ugly for the alternative.
Volume. A labeled FUNSD-scale dataset of 199 forms takes minutes to generate and a few hundred credits to render. A dataset an order of magnitude larger than CORD’s 1,000 receipts costs less than the coffee budget for a single human labeling session. The constraint on dataset size stops being your bank account and becomes your training infrastructure.
Cost. Published benchmarks for professional document annotation vary, but typical rates run a few minutes per page for dense forms at a few tens of cents per labeled field. For a W-2 with dozens of labeled fields, that’s real dollars per document before quality review. The synthetic-equivalent marginal cost is whatever a single render costs — orders of magnitude less, and the labels come out right the first time.
Privacy. A real document dataset imports every compliance obligation the source documents carried. Real W-2s bring tax-data handling rules. Real CMS-1500 claims bring HIPAA obligations. Real driver’s licenses bring identity-document handling requirements. Synthetic documents bring none of them, because no real person’s data ever touched the pipeline.
Iteration speed. Real datasets are static snapshots. Need to stress-test your model against a new layout variant, a new optional field, or a new date format? With synthetic data you change the form definition and regenerate in minutes. With real data you launch another labeling project and wait.
Common pipelines: two-stage and end-to-end
Most document-AI systems in production fall into one of two architectures. Synthetic data slots naturally into both.
Two-stage pipeline: detection then extraction
The classical pattern is a detector that finds field regions followed by an extractor that reads structured values from those regions. A YOLOv8 region detector trained on synthetic bounding boxes localizes header, line-item, total, and party regions on an invoice; a LiLT or LayoutLMv3 tagger then consumes each region and emits structured key-value pairs. For the detection half, see our guide on synthetic training data for YOLOv8. For the extraction half, see synthetic training data for LiLT.
End-to-end pipeline: OCR-free image-to-JSON
The newer approach skips OCR entirely. Donut, Pix2Struct, and related vision-language models take the raw page image and emit structured JSON directly, bypassing the OCR-then-parse pipeline. These models need a lot of training data to learn both the visual reading and the structured decoding, which is exactly the kind of volume synthetic data unlocks. For the full end-to-end pattern, see synthetic training data for Donut.
A third pattern worth mentioning is the classification head: a lightweight model that decides which template a document belongs to before routing it to a specialized extractor. Synthetic data is especially useful here because you control the class balance exactly — real-world document streams are always skewed, and a classifier trained on the real skew will quietly misclassify the long tail.
Compliance and privacy
Synthetic document data exists outside the regulatory frameworks that govern real personal and financial records. That’s not a loophole — it’s the direct consequence of the data never having described a real person.
Every identity in a SymageDocs generation batch is drawn from a coherent fictitious persona. The name on a synthetic W-2 is fictional; the SSN is a syntactically valid but unassigned value; the EIN, address, and wage numbers are generated so the overall record looks plausible but corresponds to no real taxpayer. The same applies to CMS-1500 patients, driver’s licenses, passports, and every other form class we support.
The practical consequences:
- HIPAA. Synthetic CMS-1500 and healthcare claims data contain no PHI by construction, so training a claims model on them does not trigger HIPAA handling obligations.
- GLBA / financial data. Synthetic tax and banking documents carry no customer information, so they sidestep the safeguards rule entirely.
- GDPR / CCPA. Synthetic data is not personal data. It can be moved across borders, shared with vendors, and retained indefinitely without DSAR, erasure, or processing-basis obligations.
- Model sharing. A model trained on synthetic data can be published, open-sourced, or handed to a partner without the reidentification risk that comes with models memorizing rows of real training data.
Further reading
These support pages go deeper than the pillar allows:
- Donut vs LiLT vs LayoutLM comparison
When to pick an OCR-free image-to-JSON model versus a layout-aware token classifier, with notes on latency, label-efficiency, and failure modes.
- SynthDoG alternative
Why random-text-on-backgrounds generators stop at OCR pretraining, and what changes when you want structured labels alongside the rendered image.
Frequently asked questions
- What is synthetic training data for document AI?
- Synthetic training data for document AI is algorithmically generated document images paired with ground-truth labels — bounding boxes, field values, token-level entity tags, and linking edges — produced without scanning or transcribing any real-world documents. Because the generator knows every label it placed, the annotations are pixel-perfect by construction rather than reconstructed by a human labeler.
- Why not just train on real documents like FUNSD or CORD?
- FUNSD (Jaume et al., 2019) contains only 199 noisy scanned forms and CORD (Park et al., 2019) contains roughly 1,000 receipts. Both are widely used benchmarks, not production-scale training sets. Most teams that reach a deployable model supplement real data with synthetic data to cover long-tail layouts, unseen field combinations, and distribution shift.
- Is synthetic document data good enough on its own?
- For many tasks, yes — particularly schema-driven extraction on well-defined forms like tax documents, insurance claims, and structured invoices. For OCR on degraded scans, a synth-first, real-finetune approach typically works best: pretrain on synthetic for scale, then adapt to a small, representative real-document set to close the domain gap.
- Which model architectures can I train on this data?
- Every major document-AI architecture: YOLOv8 and similar detectors for region detection, LiLT and LayoutLMv3 for layout-aware token classification, Donut and Pix2Struct for end-to-end OCR-free extraction, and any BIO-tagging NER model built on BERT, RoBERTa, or DeBERTa backbones. Each shipping page below includes copy-paste loaders for that model family.
- How does this compare to SynthDoG?
- SynthDoG (the Donut synthetic dataset generator) composites random text over background images. It's useful for OCR pretraining but produces no structured labels — no field types, no linking, no entity spans. SymageDocs emits real form layouts with field-accurate ground truth that works for detection, NER, layout models, and end-to-end models alike.
- How do I avoid leaking PII when training document models?
- Train on synthetic data. SymageDocs draws every value (names, SSNs, EINs, dollar amounts, addresses) from coherent fictitious identities — no record traces back to a real person. That means no HIPAA obligations, no GLBA handling requirements, no residual de-identification risk, and no data-sharing agreements needed to move the dataset across teams or vendors.
- Can I use synthetic data commercially?
- Yes. Our generated documents and their labels are free for you to use in commercial model training, internal tooling, and downstream products. No viral licensing, no attribution requirements on the model weights themselves.
- How much data do I need to fine-tune a document model?
- A useful starting point is 2,000-5,000 labeled examples per document class for token-classification fine-tuning, and 5,000-20,000 for end-to-end models like Donut. Synthetic generation makes those volumes trivially reachable — generating 5,000 labeled W-2s is a batch job, not a six-month annotation project.
Start generating synthetic training data for document AI
250 credits free on signup. No credit card. Every format above emitted from one API call.