What LiLT actually is
LiLT was introduced in the ACL 2022 paper “LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding” by Wang, Jin, and Ding. Its design choice is narrow and clever: keep the layout encoder and the text encoder in two separate streams, and only couple them through a bi-directional attention complementation mechanism. Pre-train the layout backbone once; at fine-tuning time bolt on whichever text encoder matches your target language.
The practical consequence is that one layout model covers every language, and you don’t re-pre-train the expensive part every time you move from English invoices to Japanese receipts or Spanish medical claims. The paper pairs LiLT with XLM-RoBERTa as the default multilingual text stream, and that combination is what most production fine-tunes inherit today.
It’s worth sitting with why this matters. Most layout-aware models — LayoutLMv1, LayoutLMv2, LayoutLMv3 — tie their visual and text streams tightly together during pre-training. If you want Japanese support, you pre-train the whole thing on a Japanese corpus, which is both expensive and slow. LiLT’s decoupling turns that problem into a fine-tuning problem, which is a problem synthetic data can solve cheaply.
Why LiLT matters for document AI
Document AI teams converge on a familiar shortlist: LayoutLMv3 for English-first extraction, Donut when OCR is unreliable, and LiLT when a single pipeline needs to cover many languages without training and hosting a separate model per locale. LiLT also sits in an unusually friendly spot on the cost curve — its base configuration runs on a single modest GPU for both training and inference, and its token-classification head makes extraction results directly auditable against the input boxes.
The catch is data. A token-classification model can only learn from tokens, boxes, and labels, and those three have to stay perfectly aligned. Real documents rarely arrive that way. Scanning artifacts split words, OCR drops characters, and hand-labeling bounding boxes across thousands of pages is slow and expensive — that’s the precise gap synthetic training data for document AI fills for every model in this family, LiLT included.
There’s a second, subtler reason synthetic data pairs well with LiLT specifically. Because the layout encoder is frozen in spirit — pre-trained once, reused across languages — fine-tuning is almost entirely about teaching the text-plus-layout classifier the shape of your target forms. That is a tiny amount of task-specific knowledge, and it lives entirely inside the BIO labels. Generate enough synthetic forms with the right labels and LiLT converges fast, regardless of how exotic the form layout is.
LiLT’s training input schema
A single LiLT training example is three parallel arrays of the same length:
- tokens — a list of word strings, in reading order.
- bboxes — one
[x0, y0, x1, y1]per token, integers in the 0-1000 range. - labels — BIO-style tags, one per token (
B-NAME,I-NAME,O, and so on).
The 0-1000 normalization is inherited from LayoutLM and means rescale every coordinate to a 0-1000 integer relative to the page’s width and height. A token whose raw bounding box is (306, 396, 336, 416) on an 8.5 × 11 inch page (612 × 792 points) becomes [500, 500, 549, 525] after normalization. This removes DPI, zoom, and page-size variation in one step, which is what lets LiLT batch documents of wildly different sizes together.
The labels follow the same BIO tagging format used throughout sequence labeling: B- marks the first token of an entity, I- marks continuation, and O marks everything outside a tagged entity.
One subtlety that trips up first-time users: the LiLT processor expects word-level boxes, not character-level or subword-level boxes. Hugging Face’s AutoProcessor does the subword splitting internally and propagates each word’s box across its subwords, so your job is to hand it clean whitespace tokens with one box per token. SymageDocs emits exactly that from the rendering engine, so there is no tokenizer drift between what the model sees and what you labeled.
Here’s LiLT-ready training data from our platform
The example below is a real, rendered CMS-1500 Health Insurance Claim Form — a dense, entity-rich form that exercises LiLT’s layout-aware strength. The tokens, bboxes, and labels you see are produced by the same pipeline described in the next section; we just slice the first 20 tokens for readability. Unique labels present in the full example: B-data, I-data. Notice how the labels are almost all B-data and I-data — that’s a deliberate simplification in the demo schema. Production runs can subdivide data into richer entity types (patient name, diagnosis code, service date, and so on) and LiLT will learn those distinctions as long as the labels are consistent across examples.
CMS-1500 Health Insurance Claim Form — LiLT training tensors
Three parallel arrays plus a paste-ready Hugging Face Dataset builder.
[
"982216",
"Box124",
"Box137",
"Box150",
"Box132",
"Box148",
"Box140",
"Box115",
"10/16/2022",
"12/21/2024",
"04/27/1991",
"Box261",
"Box412",
"Box373",
"Box315",
"6149",
"Sycamore Ct",
"2394",
"Sycamore Ct",
"Box649"
]The same form structure also underpins our FUNSD training data — LiLT is commonly benchmarked on FUNSD, so training on FUNSD-format synthetic data and fine-tuning on it with LiLT composes cleanly.
Generating the data in one script
The snippet below is the entire ETL step. It creates a batch, pulls one training example through iter_training_examples, rescales the raw bboxes into the 0-1000 LiLT coordinate space, and drops them into three parallel arrays. Run it in a loop to generate thousands of examples; point the dataset path at a directory and Hugging Face’s load_dataset will pick them up. A few things worth noting about the loop: the page width and height constants come from US Letter at 72 DPI — if you’re generating A4 forms or receipts with non-standard dimensions, swap in the correct page size before rescaling. The BIO labels on example.bio.tokens are already in the format LiLT needs, so no further remapping is required. Saving the result as JSON Lines (one example per line) is the path of least resistance; it loads directly into datasets and plays nicely with streaming when your corpus grows past a single machine’s RAM.
Generate one LiLT example
Paste-ready. Requires a SymageDocs API key and the symagedocs Python SDK.
from symagedocs import SymageDocsClient
client = SymageDocsClient()
batch = client.batches.create(
name="lilt-demo",
form_id="cms_1500_02_12",
token_budget=300,
output_formats=["pdf_typed"],
label_scheme="nist3",
)
client.batches.generate(batch.batch_id, quantity=1, seed=17)
example = next(iter(client.batches.iter_training_examples(batch.batch_id)))
# LiLT expects bboxes normalized into the 0-1000 coordinate space.
PAGE_W, PAGE_H = 612.0, 792.0 # US Letter in PDF points
tokens, bboxes, labels = [], [], []
for t in example.bio.tokens:
x, y, w, h = t.bbox
tokens.append(t.text)
bboxes.append([
int(1000 * x / PAGE_W),
int(1000 * y / PAGE_H),
int(1000 * (x + w) / PAGE_W),
int(1000 * (y + h) / PAGE_H),
])
labels.append(t.label)
lilt_example = {
"tokens": tokens,
"bboxes": bboxes,
"labels": labels,
}
Fine-tuning LiLT on the generated data
The training loop below is loosely adapted from Phil Schmid’s LiLT walkthrough (philschmid.de/fine-tuning-lilt), which is still the best paste-ready pattern on the open web. Swap nielsr/lilt-xlm-roberta-base for a language-specific LiLT checkpoint if you’re fine-tuning on English-only data and want a smaller text stream. A single RTX 4090 or an A10G happily handles this configuration; we keep the batch size at 4 with mixed-precision fp16 because LiLT’s attention complementation adds a little memory pressure versus a vanilla BERT-sized encoder.
Production-bound runs typically add a validation split, a seqeval metric computation (precision, recall, F1 at the entity level, not the token level), and early stopping on validation F1. We trim all of that for clarity — once you have the three arrays shaped right, wiring up evaluation is mechanical.
Fine-tune LiLT with Hugging Face Trainer
Drop in your JSONL training file and go. Pattern adapted from Phil Schmid.
# Adapted from Phil Schmid's LiLT walkthrough:
# https://www.philschmid.de/fine-tuning-lilt
from transformers import (
AutoProcessor,
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
# 1. Load your generated dataset (from disk, Hub, or Dataset.from_dict).
ds = load_dataset("json", data_files="lilt_train.jsonl", split="train")
# 2. Build the label vocabulary from your BIO tags.
label_list = sorted({l for ex in ds for l in ex["ner_tags"]})
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for l, i in label2id.items()}
# 3. LiLT + an English RoBERTa text encoder (swap for xlm-roberta
# when you fine-tune on non-English data — that's LiLT's whole pitch).
processor = AutoProcessor.from_pretrained("nielsr/lilt-xlm-roberta-base")
model = AutoModelForTokenClassification.from_pretrained(
"nielsr/lilt-xlm-roberta-base",
num_labels=len(label_list),
id2label=id2label,
label2id=label2id,
)
def encode(batch):
enc = processor(
text=batch["tokens"],
boxes=batch["bboxes"],
word_labels=[[label2id[l] for l in ex] for ex in batch["ner_tags"]],
padding="max_length",
truncation=True,
return_tensors="pt",
)
return enc
tokenized = ds.map(encode, batched=True, remove_columns=ds.column_names)
args = TrainingArguments(
output_dir="lilt-finetuned",
num_train_epochs=8,
per_device_train_batch_size=4,
learning_rate=5e-5,
warmup_ratio=0.1,
fp16=True,
)
Trainer(model=model, args=args, train_dataset=tokenized).train()
Performance expectations
The original LiLT paper reports 88.41 F1 on FUNSD with the base configuration, which remains a useful ceiling number to calibrate against when you train on augmented synthetic data. The paper’s Table 2 benchmarks also cover CORD, RVL-CDIP, and the multilingual XFUND benchmark, where the language-independent design pays off most clearly.
For synthetic-augmented fine-tunes, teams typically report F1 in the same ballpark as a fully real-data fine-tune as long as the synthetic entity distribution covers the real form’s vocabulary. We don’t quote a specific number here because it is workload-dependent; we quote the paper’s benchmark so you have an honest starting target. Two levers move synthetic-trained F1 more than anything else in our experience: first, the diversity of entity values (real-looking names, addresses, dates, codes) — models trained on narrow value distributions over-fit to surface-level patterns; and second, the balance of negative context (the O-labeled tokens) — too little and the model labels everything as an entity, too much and it labels nothing. Good synthetic generators give you control over both.
If you’re also considering an OCR-free alternative, see our companion piece on Donut training data; Donut and LiLT make very different trade-offs, and picking the right one is often a bigger win than squeezing another F1 point out of either. For a rigorous architecture-vs-architecture view, cross-reference the Donut vs LiLT vs LayoutLM comparison deep dive when you’re evaluating the full field.
Frequently asked questions
- What is LiLT and why do the bounding boxes use a 0-1000 range?
- LiLT (Language-independent Layout Transformer) is a two-stream model introduced by Wang et al. at ACL 2022. One stream encodes layout (token bounding boxes) and the other encodes text, so a single pre-trained layout backbone can be paired with any text encoder at fine-tuning time. The 0-1000 normalization comes from LayoutLM's processor pipeline: every coordinate is rescaled to an integer between 0 and 1000 relative to page width/height so the model is resolution-independent and can mix documents of different DPIs in one batch.
- Do I need real labeled documents to fine-tune LiLT, or will synthetic data work?
- Synthetic data works — that is the entire point of this page. LiLT is fine-tuned with tokens, 0-1000 bboxes, and BIO-style labels per token. SymageDocs emits exactly that schema from every synthetic form, with coordinates computed from the rendering engine rather than OCR, so there is no label noise. Teams typically use synthetic data for warm-up and augmentation, then fine-tune on a small held-out set of real documents for final calibration.
- How does LiLT compare to Donut for structured extraction?
- LiLT is an encoder that classifies each input token; Donut is an encoder-decoder that generates a JSON string directly from a document image and skips OCR entirely. LiLT is smaller, faster to fine-tune, and easier to debug at the token level, which makes it a strong default when you already have OCR or OCR-free word boxes. Donut wins when OCR is unreliable or when output is heavily structured (receipts, business cards). See our Donut training data page for a side-by-side comparison.
- Can I use LiLT on non-English forms with the same training loop?
- Yes — that is the language-independent part of the name. The published LiLT checkpoints pair the layout backbone with XLM-RoBERTa, which supports 100+ languages out of the box. The training loop is identical: swap the text encoder, regenerate tokens + bboxes + labels for the target language, and fine-tune. Because SymageDocs generates locale-aware entity values (names, addresses, dates) tied to real form schemas, you can produce multilingual training data without reshaping your pipeline.
- What does 'LiLT-ready' actually mean in the output schema?
- Three parallel arrays of equal length per example: tokens (list of strings), bboxes (list of [x0, y0, x1, y1] integers in 0-1000 space), and labels (BIO tags as strings). You pass those three arrays directly into Hugging Face's AutoProcessor for a LiLT checkpoint — no extra preprocessing, no coordinate gymnastics. Our platform also exposes the raw 0-1 normalized bboxes if you prefer to rescale yourself, for example when targeting a custom page size.
Generate LiLT-ready training data today
250 free credits, no credit card. Pair it with the paste-ready Hugging Face snippet above and you’ll have a fine-tuning dataset before your next coffee.