BIO Tagged Synthetic NER Training Data

BIO (sometimes written IOB2) is the canonical token-level label scheme used to train named-entity-recognition models. It is not a model architecture, a dataset, or a benchmark — it is the way you annotate tokens so that a transformer can learn where each entity starts, where it continues, and where it ends. This page shows real BIO-tagged output from the SymageDocs platform for the irs_w2_2024 form, explains the surrounding formats (CoNLL, JSONL, HuggingFace datasets), and drops you into a paste-ready fine-tuning loop.

The sample below was produced from a single generated W-2. It contains 46 BIO-tagged tokens drawn from the nist3 label scheme — scale that up by the size of your batch to get millions of labeled tokens in minutes, not months. If you want the broader picture of how this fits into a document-AI pipeline, start with the synthetic training data for document AI pillar.

1. What BIO tagging is (and why it's called IOB2)

BIO stands for Beginning, Inside, Outside. Every token in a document gets exactly one label:

B-<type> — this token is the first word of an entity of this type.
I-<type> — this token continues the previous entity.
O — this token is outside any entity.

The “IOB2” name comes from a 1999 reformulation by Sang and Veenstra. The original IOB scheme only emitted a B- tag when two entities of the same type were directly adjacent; the “2” variant always tags the first token of every entity with B-. IOB2 is strictly more consistent and it is what every modern library — HuggingFace, spaCy, Flair, AllenNLP — produces and consumes today. When people say “BIO” in 2026, they mean IOB2.

Alice   B-name
Nguyen  I-name
earned  O
55,928.49       B-data
in      O
wages   O
.       O

EIN     O
78-5646026      B-data

A BIO-tagged W-2 fragment: two entities (Alice Nguyen, 55,928.49), one blank line between documents.

2. The canonical formats: CoNLL, JSONL, HuggingFace `datasets`

BIO is the label scheme; you also have to pick a file format. The three that matter in practice:

CoNLL (also “CoNLL-U”, “BIO TSV”) — one token per line, text\tlabel, blank line separates documents. This is what the classic CoNLL-2003 NER dataset used and what every token-classification tutorial starts from.
JSONL — one JSON object per line, each with tokens and labels arrays. Streams nicely, survives arbitrary Unicode, and maps directly to a HuggingFace Dataset.
HuggingFace datasets — technically not a file format but a loader target. If you train with Trainer, this is where your data has to end up. Both of the above formats convert in under ten lines of Python.

SymageDocs emits CoNLL and JSONL directly out of the same batch, so you never have to hand-write a converter. A single export zip gives you the tab-separated file for legacy pipelines and the JSON-per-line file for modern HuggingFace loaders. Everything downstream — the tokenizer, the aligner, the trainer, the metric — treats the two as interchangeable as long as the BIO label contract is preserved.

A few common mistakes when rolling your own BIO data:

Emitting an I- tag without a preceding B-. Every entity must start with a B- tag. HuggingFace metric libraries like seqeval silently drop orphaned inside-tags, so your F1 silently drops too.
Mixing label types inside one entity. A run of B-name → I-addr is malformed. It should either be two separate entities or a single entity under one type.
Forgetting the blank line between documents in CoNLL. Blank lines are document delimiters; without them, every token in the corpus becomes one giant sequence and the attention window explodes.

3. Why synthetic beats real for NER

Document NER has three problems that synthetic data solves cleanly:

Privacy

Real W-2s, CMS-1500s, and I-9s contain SSNs, EINs, addresses, and PHI. You cannot send them to annotation vendors and you probably cannot send them across trust boundaries inside your own company. Synthetic documents have zero real PII — train freely.

Volume

A single human annotator labels a few hundred pages per week. A SymageDocs batch emits thousands of pixel-accurate, BIO-tagged pages per hour. Your NER model stops being data-starved.

Edge cases

Real samples under-represent rare entities and unusual layouts. Synthetic generation lets you choose your distribution: oversample rare classes, inject OCR noise, rotate pages, swap fonts.

There is a fourth, quieter benefit: label consistency. Real-world annotation projects accumulate drift — one annotator tags a middle initial as part of the name, another tags it as a separate entity, and your model learns the disagreement rather than the underlying signal. Synthetic BIO output is emitted by a single deterministic labeler, so the inside/outside boundary is identical across every document in the batch. When your evaluation set is also synthetic you can compute an upper-bound F1 first and know exactly how much headroom remains before you hit annotator noise.

4. Here's real BIO data from our platform

Below is actual output from a synthetic irs_w2_2024 (W-2 Wage and Tax Statement) batch — no synthetic lorem-ipsum, no placeholder labels. The CoNLL tab is what you'd pipe into a CoNLL-2003-style loader; the JSONL tab is what streams into a HuggingFace pipeline; the HF tab shows the three-line load step.

BIO export — irs_w2_2024

Rendered from the nist3 label scheme. Copy a tab to paste into your own notebook.

conll

181-64-1617	B-ssn
78-5646026	B-data
55,928.49	B-data
69,828.34	B-data
Summit	B-data
Analytics	I-data
50,016.68	B-data
14,345.85	B-data
62,268.82	B-data
56,062.17	B-data
45,693.10	B-data
55,789.57	B-data
73,656.96	B-data
45,347.42	B-data
8,325.50	B-data
Avery	B-name
Hassan	B-name
Employee72	B-name
26,053.70	B-data
23,069.31	B-data
25,326.35	B-data
Employee56	B-data
25,395.75	B-data
86,152.06	B-data
66,282.93	B-data

See the first 25 token/label pairs in a table

#	Token	BIO label
1	181-64-1617	B-ssn
2	78-5646026	B-data
3	55,928.49	B-data
4	69,828.34	B-data
5	Summit	B-data
6	Analytics	I-data
7	50,016.68	B-data
8	14,345.85	B-data
9	62,268.82	B-data
10	56,062.17	B-data
11	45,693.10	B-data
12	55,789.57	B-data
13	73,656.96	B-data
14	45,347.42	B-data
15	8,325.50	B-data
16	Avery	B-name
17	Hassan	B-name
18	Employee72	B-name
19	26,053.70	B-data
20	23,069.31	B-data
21	25,326.35	B-data
22	Employee56	B-data
23	25,395.75	B-data
24	86,152.06	B-data
25	66,282.93	B-data

5. Our label schemes explained

The sample above was produced under the nist3 label scheme — a three-bucket taxonomy (PII / data / structure) that the SymageDocs platform ships as the safe default. Every BIO label in this batch is one of:

Label	Meaning
B-data	Beginning of a numeric data value (wage, tax, box amount).
B-name	Beginning of a person-name entity (employee, employer, dependent).
B-ssn	Beginning of a Social Security Number entity.
I-data	Continuation token inside a numeric data value.

Richer forms — invoices, SOAP notes, discharge summaries — expose more label classes at generation time. The inventory above reflects what this particular W-2 needed; the scheme itself is open-ended and will expand as you raise token_budget or mix in additional forms.

6. Drop-in fine-tuning loop (HuggingFace `Trainer`)

The snippet below is the entire training loop: load JSONL, tokenize, align subword labels, fine-tune. It runs on a single GPU, produces a standard AutoModelForTokenClassification checkpoint, and needs zero hand-labeled data.

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    Trainer,
    TrainingArguments,
)

# 1. Load your SymageDocs BIO JSONL export.
import json
rows = [json.loads(l) for l in open("bio_train.jsonl")]
raw = Dataset.from_dict({
    "tokens": [r["tokens"] for r in rows],
    "labels": [r["labels"] for r in rows],
})

# 2. Build the label <-> id map from your fixture's label_inventory.
label_list = sorted({l for row in rows for l in row["labels"]})
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for l, i in label2id.items()}

# 3. Tokenize + align labels to wordpiece tokens.
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_align(batch):
    enc = tokenizer(
        batch["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding=False,
    )
    all_labels = []
    for i, labels in enumerate(batch["labels"]):
        word_ids = enc.word_ids(batch_index=i)
        aligned = []
        prev = None
        for wid in word_ids:
            if wid is None:
                aligned.append(-100)
            elif wid != prev:
                aligned.append(label2id[labels[wid]])
            else:
                aligned.append(label2id[labels[wid]])
            prev = wid
        all_labels.append(aligned)
    enc["labels"] = all_labels
    return enc

ds = raw.map(tokenize_and_align, batched=True, remove_columns=raw.column_names)

# 4. Fine-tune.
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

args = TrainingArguments(
    output_dir="./bio-ner",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer),
)
trainer.train()

Swap bert-base-cased for roberta-base, microsoft/layoutlmv3-base, or any other encoder — the BIO contract stays the same.

7. BIO is the substrate for LiLT and FUNSD too

BIO tokens are not a dead end. They are the input to every layout-aware transformer on the market. If you want to train a model that sees both text and spatial position, the same tokens feed directly into LiLT fine-tuning — SymageDocs emits the BIO labels and the 0-1000 normalized bounding boxes that LiLT expects, generated in a single batch.

The same BIO tokens are also compatible with FUNSD training data. FUNSD wraps its words in a higher-level form-entity JSON, but each word carries a BIO-compatible label — you can convert between the two without regenerating documents. If your pipeline already speaks one format, you already speak the other.

8. Frequently asked questions

Is BIO a model or a tagging format?

BIO is a tagging format, not a model. BIO (short for Beginning-Inside-Outside, also known as IOB2) is the standard token-level label scheme used to train named-entity-recognition models. Any transformer encoder — BERT, RoBERTa, LayoutLM, LiLT — can be fine-tuned on BIO-tagged data.

What's the difference between BIO, IOB, and IOB2?

IOB (the original) only used B- at the start of an entity when the previous token was part of a different entity of the same type. IOB2, which the NLP community now calls BIO, always marks the first token of every entity with B- regardless of context. BIO and IOB2 are interchangeable; BIO is simply the preferred term today.

Can I use this data with HuggingFace Transformers?

Yes. The JSONL export is structured as {"tokens": [...], "labels": [...]} per line, which loads directly into a datasets.Dataset. From there, AutoTokenizer and AutoModelForTokenClassification handle the rest — see the fine-tuning snippet below for a paste-ready loop.

Why is synthetic data better than labeling real documents?

Three reasons. First, privacy: real W-2s, CMS-1500s, and I-9s contain PII and PHI that cannot be shared with annotators or third-party training platforms. Second, volume: synthetic generation produces tens of thousands of labeled documents in the time it takes a human annotator to label a few hundred. Third, edge-case coverage: you can deliberately oversample rare label classes, layout variations, and OCR-noisy conditions that almost never appear in a random real-world sample.

What label schemes are supported?

SymageDocs currently ships the nist3 scheme by default, which collapses entities into three practical buckets — PII (names, SSNs, addresses), data values (numeric amounts and dates), and structural tokens. Forms with richer inherent structure expose more specific labels at generation time; the label inventory on this page is drawn from a real irs_w2_2024 batch.

How does BIO data flow into LiLT and FUNSD pipelines?

BIO tokens are the common substrate. LiLT consumes token-level BIO labels alongside normalized bounding boxes to jointly learn text and layout. FUNSD uses a higher-level form-entity JSON, but every FUNSD word carries a BIO-compatible label, so you can pivot between the two formats without regenerating data.

Ready to generate BIO-tagged synthetic training data?

Start with 250 free credits. No credit card required. Export CoNLL, JSONL, and HuggingFace-ready bundles on the same batch.

Start for free Read the pillar

Synthetic NER training data in BIO format