Dataset formatFUNSD

Synthetic training data in FUNSD format

The FUNSD benchmark ships 199 scanned forms. That is not enough to train a production form-understanding model. SymageDocs emits the exact same JSON schema — question / answer / header entities with word-level bboxes and explicit linking — at unlimited scale, from modern US forms, with zero PII risk.

What FUNSD is

FUNSD stands for “Form Understanding in Noisy Scanned Documents.” It was introduced by Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran in the 2019 paper “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents” at ICDAR-OST 2019. The dataset is a carefully annotated subset of the RVL-CDIP form category: 199 fully labeled scanned pages drawn from older, noisy US business forms, most of them from the 1980s and 1990s.

Each page is represented by a single JSON file that lists entities — semantic groupings of words — with four possible labels: question, answer, header, and other. Each entity carries its own bounding box, a list of constituent words (each with their own box), and an explicit linking array that ties questions to their answers. That last piece — the question → answer graph — is what made FUNSD the de-facto benchmark for form understanding and what later models like LayoutLMv3, LiLT, and XDoc all evaluate against.

FUNSD is a benchmark, not a model. When you see a paper report “F1 on FUNSD,” that is a number measured by training on the 149-page train split and predicting on the 50-page test split. That small size is the crux of the problem this page addresses.

The original release sat alongside LayoutLM and helped define the modern “layout-aware language model” wave. Since then it has shown up as the headline evaluation for LayoutLMv2, LayoutLMv3, LiLT, StrucTexT, XDoc, GeoLayoutLM, and DocFormer — basically every form-understanding paper measures against those 199 pages. That entrenchment is why the shape of FUNSD became the lingua franca of form annotation, even for teams that have no interest in the original scans.

Why FUNSD alone isn't enough

FUNSD is brilliant as a benchmark. As training data, it is painfully thin. Three hard limits matter in practice:

  • Scale. 199 pages total, 149 for training. Modern encoder-decoder models with tens of millions of parameters overfit this within a handful of epochs. To generalize, you need orders of magnitude more.
  • Domain. The underlying pages are photocopied, faxed, and typewriter-era business forms. If your production use case is modern IRS, CMS, USCIS, or ACORD forms, FUNSD's typography, layout density, and noise profile are a poor match.
  • Language. FUNSD is English-only. There is no native support for Spanish tax forms, Chinese medical intake, or any multilingual layout understanding.

These aren't gaps anyone ever claimed FUNSD would fill — it's a measurement instrument, not a training corpus. The good news: the shape of FUNSD — the JSON schema, the labels, the linking graph — is the right shape for training. It just needs to be filled with a lot more pages. That is exactly what SymageDocs does.

Teams looking for a larger sibling sometimes reach for FUNSD+ (Konfuzio, 2022), which re-annotates additional scans and brings the corpus to roughly 1,500 pages. That helps, but the underlying pages are still photocopied legacy business forms with the same narrow domain and English-only assumption. Even a 7x bigger FUNSD is orders of magnitude smaller than what you'd pretrain against in any other NLP or vision setting, and it still doesn't look like the documents real users will send your model in production.

The FUNSD JSON schema, walked through

A FUNSD document is one JSON file per page. The top level is a single key:

{
  "form": [ <entity>, <entity>, ... ]
}

Each entity has five fields. Knowing them by heart makes the rest of this page much easier:

  • id — an integer, unique per page. Used as the endpoint for linking.
  • text — the entity's surface text (joined words).
  • box — four integers, [x0, y0, x1, y1] in image pixels (top-left origin).
  • label — one of question, answer, header, other.
  • words — a list of { text, box } objects, one per token.
  • linking — a list of [from_id, to_id] pairs connecting this entity to others (typically a question pointing at its answer).

That is the entire schema. Nothing else. Every form understanding pipeline in the last five years has been trained against some variation of it.

Here's FUNSD-format data from our platform

Below is a single synthetic I-9 instance generated from our renderer, exported in native FUNSD shape. This one page carries 80 entities and 80 linking pairs — compare that to the 199-page source dataset's per-page density. The JSON tab shows the first 8 entities verbatim; the Python tabs show how to generate the same output at scale and feed it straight into a LayoutLMv3 fine-tune.

8 regions overlaid
I-9 form with FUNSD question/answer entities overlaid as colored boxes
Teal boxes are question entities; magenta are answers. Overlay derived from the first 8 FUNSD entities in the fixture.

Single-page FUNSD export

i9_2024, seed=99. First 8 entities shown; full page emits all entities.

json
{
  "form": [
    {
      "box": [
        42.72,
        156.76,
        198.4,
        172
      ],
      "id": 0,
      "label": "question",
      "linking": [
        [
          0,
          1
        ]
      ],
      "text": "Section 1: Last Name (Family Name)",
      "words": [
        {
          "box": [
            42.72,
            156.76,
            67.37,
            172
          ],
          "text": "Section"
        },
        {
          "box": [
            68.67,
            156.76,
            93.32,
            172
          ],
          "text": "1:"
        },
        {
          "box": [
            94.62,
            156.76,
            119.27,
            172
          ],
          "text": "Last"
        },
        {
          "box": [
            120.56,
            156.76,
            145.21,
            172
          ],
          "text": "Name"
        },
        {
          "box": [
            146.51,
            156.76,
            171.16,
            172
          ],
          "text": "(Family"
        },
        {
          "box": [
            172.45,
            156.76,
            197.1,
            172
          ],
          "text": "Name)"
        }
      ]
    },
    {
      "box": [
        42.72,
        172,
        198.4,
        187.24
      ],
      "id": 1,
      "label": "answer",
      "linking": [
        [
          0,
          1
        ]
      ],
      "text": "Patel",
      "words": [
        {
          "box": [
            42.72,
            172,
            190.62,
            187.24
          ],
          "text": "Patel"
        }
      ]
    },
    {
      "box": [
        204.16,
        156.77,
        341.84,
        172
      ],
      "id": 2,
      "label": "question",
      "linking": [
        [
          2,
          3
        ]
      ],
      "text": "Section 1: First Name (Given Name)",
      "words": [
        {
          "box": [
            204.16,
            156.77,
            225.96,
            172
          ],
          "text": "Section"
        },
        {
          "box": [
            227.11,
            156.77,
            248.91,
            172
          ],
          "text": "1:"
        },
        {
          "box": [
            250.06,
            156.77,
            271.85,
            172
          ],
          "text": "First"
        },
        {
          "box": [
            273,
            156.77,
            294.8,
            172
          ],
          "text": "Name"
        },
        {
          "box": [
            295.95,
            156.77,
            317.75,
            172
          ],
          "text": "(Given"
        },
        {
          "box": [
            318.89,
            156.77,
            340.69,
            172
          ],
          "text": "Name)"
        }
      ]
    },
    {
      "box": [
        204.16,
        172,
        341.84,
        187.24
      ],
      "id": 3,
      "label": "answer",
      "linking": [
        [
          2,
          3
        ]
      ],
      "text": "Casey",
      "words": [
        {
          "box": [
            204.16,
            172,
            334.96,
            187.24
          ],
          "text": "Casey"
        }
      ]
    },
    {
      "box": [
        348.44,
        156.77,
        413.84,
        172
      ],
      "id": 4,
      "label": "question",
      "linking": [
        [
          4,
          5
        ]
      ],
      "text": "Section 1: Middle Initial",
      "words": [
        {
          "box": [
            348.44,
            156.77,
            363.98,
            172
          ],
          "text": "Section"
        },
        {
          "box": [
            364.79,
            156.77,
            380.32,
            172
          ],
          "text": "1:"
        },
        {
          "box": [
            381.14,
            156.77,
            396.67,
            172
          ],
          "text": "Middle"
        },
        {
          "box": [
            397.49,
            156.77,
            413.02,
            172
          ],
          "text": "Initial"
        }
      ]
    },
    {
      "box": [
        348.44,
        172,
        413.84,
        187.24
      ],
      "id": 5,
      "label": "answer",
      "linking": [
        [
          4,
          5
        ]
      ],
      "text": "E",
      "words": [
        {
          "box": [
            348.44,
            172,
            410.57,
            187.24
          ],
          "text": "E"
        }
      ]
    },
    {
      "box": [
        420.44,
        156.77,
        576.12,
        172
      ],
      "id": 6,
      "label": "question",
      "linking": [
        [
          6,
          7
        ]
      ],
      "text": "Section 1: Other Last Names Used",
      "words": [
        {
          "box": [
            420.44,
            156.77,
            445.09,
            172
          ],
          "text": "Section"
        },
        {
          "box": [
            446.39,
            156.77,
            471.04,
            172
          ],
          "text": "1:"
        },
        {
          "box": [
            472.34,
            156.77,
            496.99,
            172
          ],
          "text": "Other"
        },
        {
          "box": [
            498.28,
            156.77,
            522.93,
            172
          ],
          "text": "Last"
        },
        {
          "box": [
            524.23,
            156.77,
            548.88,
            172
          ],
          "text": "Names"
        },
        {
          "box": [
            550.17,
            156.77,
            574.82,
            172
          ],
          "text": "Used"
        }
      ]
    },
    {
      "box": [
        420.44,
        172,
        576.12,
        187.24
      ],
      "id": 7,
      "label": "answer",
      "linking": [
        [
          6,
          7
        ]
      ],
      "text": "Kim",
      "words": [
        {
          "box": [
            420.44,
            172,
            568.34,
            187.24
          ],
          "text": "Kim"
        }
      ]
    }
  ]
}

Field-level linking, preserved natively

The hardest part of FUNSD — harder than the labels, harder than the boxes — is the linking graph. If you scrape OCR output from real scans, the question → answer edges have to be recovered heuristically (proximity, colon detection, layout columns), which is noisy and an entire research sub-field on its own.

SymageDocs sidesteps the problem entirely. Our rendering engine already knows, at layout time, which label goes with which input — it has to, in order to fill the form correctly. Every word that lands on the page carries a field_id, an entity_type (question / answer / header / other), and, when it applies, a linked_field_id pointing at its counterpart. Our FUNSD emitter then does a trivial 1:1 mapping:

SymageDocs WordAnnotationFUNSD entity field
field_id (hashed)id
entity_typelabel
(x, y, width, height)box [x0, y0, x1, y1]
text + per-word boxwords[]
linked_field_idlinking [[from_id, to_id]]

The upshot: our linking graph is ground truth, not an inference. Every question points at the answer that was actually rendered into that field. There is no annotation drift, no missed edge, no colon-heuristic false positive — making SymageDocs output the cleanest FUNSD-shape training signal you can feed to a form understanding model, stronger than what hand-annotation can produce on a real scan.

This matters more than it sounds. A common failure mode when training LayoutLMv3 or LiLT on real FUNSD is that the linking head plateaus well below the entity head — because the entity labels are easier for human annotators to get right than the edges between them. Teams that audit their FUNSD labels usually find a few percent of links are wrong or missing, and that noise puts a ceiling on linking F1. Synthetic data doesn't have that ceiling. Because every edge is generated, not labeled, the dataset is as clean as your renderer — which is to say, essentially perfect.

Fine-tuning LayoutLMv3 or LiLT on our FUNSD-format data

Because the export is FUNSD-shaped, the HuggingFace training path is a single load_dataset call away. The snippet in the HuggingFace tab above is a complete LayoutLMv3 fine-tune — tokenizer, label map, processor, model, trainer — that consumes the zip SymageDocs writes out, with no glue code in between.

The same approach works for LiLT. LiLT decouples the layout transformer from the text transformer, which means the encoded bboxes + label_ids produced by the snippet above plug directly into SCUT-DLVCLab/lilt-roberta-en-base. See our LiLT training data spoke for the exact fine-tuning recipe, including the language-independent bbox preprocessing LiLT expects.

For encoder-decoder approaches that forgo bboxes entirely, see how Donut on form understanding handles the same I-9 inputs with pure image-to-JSON supervision — no FUNSD shape required, same backing renderer.

A practical recipe we've seen teams get good results with is a two-phase schedule: pre-train on a large SymageDocs synthetic corpus (say, 10k pages spread across the form types in your production mix), then fine-tune for a few hundred steps on real FUNSD or a small hand-labeled subset of your own scans. The synthetic phase teaches the model layout, typography, and FUNSD conventions at scale; the real phase closes the small domain gap between rendered and photographed. That combination beats either diet alone on nearly every internal benchmark we have data on.

Because every generation is deterministic in the seed argument, your training corpus is exactly reproducible — a property real scanned data simply cannot offer. The same(form_id, seed, token_budget) tuple will produce the same FUNSD JSON on every run, which makes regression testing, ablations, and leaderboard submissions far cleaner than anything you can do with hand-annotated scans.

Scale comparison: real FUNSD vs FUNSD+ vs SymageDocs

Three numbers drive the decision of what to train on. Here are those numbers for the three options most teams consider.

DimensionFUNSD (2019)FUNSD+ (2022)SymageDocs
Forms199 pages~1,500 pagesUnlimited, on demand
DomainsRVL-CDIP legacy scansExtended legacy scansIRS, CMS, USCIS, ACORD, custom
LanguagesEnglishEnglishTemplate-driven, multilingual
Linking ground truthHand-annotatedHand-annotatedRendered, always exact
PII riskRedacted scansRedacted scansZero, synthetic only
LicenseResearch useResearch useCommercial, yours to use

For a production model, the right strategy is almost always the same: evaluate on real FUNSD (so your paper numbers are comparable), but train on a much larger synthetic corpus that matches your actual target domain. SymageDocs fills the second half of that equation.

FAQ

Is SymageDocs output a drop-in replacement for the FUNSD dataset?
Yes. Every per-document JSON we emit matches the canonical FUNSD schema — a top-level `form` array of entities, each with `id`, `text`, `box`, `label` (question / answer / header / other), a `words` array of token-level boxes, and a `linking` array of [from_id, to_id] pairs. Existing LayoutLMv3, LiLT, and XDoc training scripts that target FUNSD consume our exports without code changes.
How is this different from FUNSD+?
FUNSD+ (Konfuzio, 2022) expanded the original 199-form benchmark to around 1,500 annotated pages. That is still a fixed, mostly English, mostly legacy-scan corpus. SymageDocs generates unlimited FUNSD-format pages on demand across modern US tax, healthcare, insurance, and immigration forms, with deterministic seeds for reproducibility and zero PII risk.
Do you preserve FUNSD's question → answer linking?
Yes, natively. Our renderer tracks each label/value pair during layout, so the `linking` array is produced from real semantic relationships — not heuristics run after the fact. In this demo fixture, we emit 80 linked pairs across 80 entities on a single I-9 instance.
Can I train LayoutLMv3 or LiLT directly from the output zip?
Yes. Each batch export ships as a zip of PDFs with sibling `.funsd.json` files. Point `datasets.load_dataset("json", data_dir=...)` at the extracted folder and HuggingFace will load it like any FUNSD mirror. The HuggingFace snippet above shows a full LayoutLMv3 fine-tuning loop that takes our output unchanged.
What license is the generated data under?
All synthetic data you generate on SymageDocs is yours to use commercially, including as training data for internal or released models. Full terms are in our Terms of Service. Unlike the original FUNSD (RVL-CDIP derived, research-only constraints) our output has no academic-only clause.
How does this compare to OCR'ing my own scans and labeling them?
Labeling a few hundred pages by hand is feasible; labeling the 5,000-50,000 pages a modern form-understanding model actually wants to see is a multi-week, multi-annotator project that still leaves you with noisy linking edges. SymageDocs produces the same scale in minutes, with exact ground truth for every entity, every word, and every link — the labels come from the renderer, not a human.
Can I mix FUNSD-format output from different form types in one training run?
Yes, and we recommend it. Every form_id in the SymageDocs library emits the same FUNSD shape, so you can interleave I-9, W-2, CMS-1500, and ACORD pages in a single dataset without reshaping anything. Models trained on a mix generalize noticeably better to unseen form types than models trained on a single template.

Related reading

Ready to scale past 199 forms?

Generate thousands of FUNSD-format pages in minutes. 250 free credits, no credit card.

Generate my dataset