YOLOv8YOLOv8

YOLOv8 for document region detection — synthetic training data

Paste-ready synthetic training data for YOLOv8: one image plus a matching labels.txt per document, with every header, line item, total, and date region already annotated. Skip the annotation bottleneck and start training the first-stage region detector for your document AI pipeline.

YOLOv8 is the dominant first-stage region detector for modern document AI pipelines. Before a model like Donut or LiLT can extract structured values, something has to draw the boxes — and on invoices, claims, and government forms those boxes are almost never labeled at the scale a modern detector needs.

This page is part of our synthetic training data for document AI pillar. SymageDocs renders realistic, structurally-correct documents and emits the YOLO labels that match every rendered pixel — no manual annotation, no sampling of real PII.

What YOLOv8 is — and why document teams use it

YOLOv8 is the eighth generation of the You Only Look Once family of single-stage object detectors from Ultralytics. It's fast (real-time on commodity GPUs), flexible across model sizes (n / s / m / l / x), and — critically — the labels.txt format it consumes is dead simple. That last property is what put it at the front of every document-AI pipeline shipping today.

Teams don't use YOLOv8 as their extractor. They use it as the upstream region detector: given a rasterized page, crop out the header, the line-items table, the totals block, and the signature region. A narrow, well-trained YOLOv8 model turns a messy 8.5×11 scan into a set of tidy crops that downstream transformer extractors (Donut, LiLT, DocFormerv2) can actually reason about.

This split matters because the constraints on the two stages are so different. A detector needs to be fast, robust to skew and noise, and good at localizing tightly-packed regions on a cluttered page. An extractor needs deep semantic understanding of a single cropped region. Asking a single model to do both jobs is expensive, and in practice it underperforms the two-stage decomposition — which is why YOLOv8 ended up as the default first stage across the modern document-AI literature.

Where YOLOv8 fits

Two-stage document AI pipeline: detect regions first, extract structure second.

  1. Scanned page

    PDF / JPG / PNG

  2. YOLOv8 region detector

    header, line_item, total, date, party

  3. Downstream extractor

    LiLT / Donut / DocFormerv2

The bounding-box data problem

Training a production-grade YOLOv8 document detector requires thousands of pages with every field region correctly boxed. The public benchmarks don't come close: FUNSD, the most cited academic document-understanding dataset, ships 199 total forms across its train and test splits. DocVQA, RVL-CDIP, and the SROIE receipt corpus were designed for different tasks entirely — they rarely include clean per-field bounding boxes at the granularity a YOLOv8 invoice-region detector needs.

In-house annotation runs $0.50 to $2.00 per labeled region on commercial BPO platforms, and an invoice can carry 40+ regions per page. Before a single epoch runs you've spent tens of thousands of dollars just labeling the data — and you're still blocked by PII review for any real customer document.

Synthetic data short-circuits both problems. Because we render each document from a structured template, we already know where every field is on the page down to the sub-pixel. Emitting a YOLO labels.txt is a coordinate transform away from the ground truth that produced the pixels in the first place. The labels are, by construction, perfect — no annotator drift, no missed small regions, no inconsistent class decisions between labelers.

The YOLO label format, briefly

The YOLO label format is designed to be trivially human-readable. Each image has a sibling .txt file containing one line per object — no XML, no JSON, no nested structures.

<class_id> <center_x> <center_y> <width> <height>

class_id — zero-indexed integer matching the names list in your data.yaml.

center_x, center_y — box center, normalized to [0, 1] against the image width and height.

width, height — box dimensions, also normalized to [0, 1].

Pairing the labels.txt with a data.yaml that points at your train/ and val/ image directories is the full dataset contract. That's the whole interface — which is what makes YOLOv8 datasets so portable and what makes them easy for us to emit mechanically.

Here's what our platform outputs for YOLOv8 training

The sample below was generated from our invoice_modern form. A single rendered page produced 54 labeled regions across 6 classes (party, header, line item, total, date, and other) — delivered as a labels.txt, a paste-ready data.yaml, and a class map. Multiply that by thousands of generated pages for your actual dataset.

24 regions overlaid
Rendered synthetic invoice with YOLOv8 region overlays for header, line items, totals, and dates
Overlays derived by parsing the labels.txt below back into pixel coordinates — same file you’d feed YOLOv8.

Paste-ready YOLOv8 dataset output

Three files per dataset — flip between them to see the full contract.

txt
1 0.335000 0.138900 0.212400 0.015200
0 0.747500 0.138900 0.318600 0.015200
0 0.747500 0.157250 0.318600 0.013900
4 0.335000 0.159100 0.212400 0.015200
0 0.747500 0.173650 0.318600 0.013900
4 0.335000 0.179300 0.212400 0.015200
0 0.747500 0.190050 0.318600 0.013900
5 0.335000 0.199500 0.212400 0.015200
0 0.747500 0.206450 0.318600 0.013900
5 0.335000 0.219700 0.212400 0.015200
0 0.285950 0.286600 0.408500 0.015200
0 0.285950 0.304950 0.408500 0.013900
0 0.285950 0.321350 0.408500 0.013900
0 0.285950 0.337750 0.408500 0.013900
0 0.285950 0.354150 0.408500 0.013900
2 0.310500 0.431800 0.441200 0.015200
2 0.592350 0.431800 0.089900 0.015200
2 0.706700 0.431800 0.106200 0.015200
2 0.851300 0.431800 0.117600 0.015200
2 0.310500 0.459600 0.441200 0.015200

From our WordAnnotation to YOLO format

Internally SymageDocs represents every rendered field as a WordAnnotation with pixel-space x, y, width, height. Converting to YOLOv8's normalized center-xywh format is five lines — here's the exact snippet our fixture generator uses:

# From our WordAnnotation (pixel coords) to YOLOv8 normalized labels.
PAGE_W, PAGE_H = 612.0, 792.0  # US Letter points
for w in words:
    cx = (w.x + w.width / 2) / PAGE_W
    cy = (w.y + w.height / 2) / PAGE_H
    nw, nh = w.width / PAGE_W, w.height / PAGE_H
    print(f"{CLASS_MAP[w.field_type]} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}")

That's it. The CLASS_MAP keys come from each WordAnnotation's field_type, so you can taxonomize regions however your downstream extractor expects them.

Training YOLOv8 on your synthetic dataset

Once you've generated a batch and written out the labels, training is a one-liner in the Ultralytics Python API. Point model.train at the data.yaml we emitted in the tabs above and kick it off:

from ultralytics import YOLO

# Start from a pretrained YOLOv8 nano backbone — swap for yolov8s/m/l/x as needed.
model = YOLO("yolov8n.pt")

# Train on the synthetic document-region dataset emitted by SymageDocs.
results = model.train(
    data="data.yaml",
    epochs=100,
    imgsz=1280,       # documents benefit from higher input resolution
    batch=16,
    patience=20,
)

For document images, imgsz=1280 is usually the right default — tiny fields like invoice numbers and decimal-point totals get squashed at the 640 default. If training hardware is the bottleneck, drop to yolov8n.pt as the backbone; if you have the GPU budget, yolov8l.pt is the sweet spot for document-region accuracy in our benchmarks.

When you take the trained model to real documents, hold out a few hundred hand-labeled real pages as your validation split. Synthetic training data is a powerful prior, but the sim-to-real gap is real — tracking mAP on a real validation set from day one is the difference between a detector that looks great in Tensorboard and one that actually ships. Most teams iterate by mixing a small real-document seed set with the bulk of the synthetic data and annealing the synthetic ratio down over training.

Two-stage pipelines: YOLO regions, then transformer extraction

YOLOv8 is rarely the last step. In production document AI stacks, YOLO outputs regions; a transformer extracts the actual values inside each region. The two most common downstream pairings are Donut and LiLT.

For Donut downstream extraction, crop each YOLO region and feed it to an OCR-free encoder-decoder that generates a JSON string directly. Great for invoices where the regions are already semantically meaningful (totals, line items) — Donut just fills in the structured values.

For LiLT layout extraction, feed the YOLO-cropped region along with its layout bbox into a language-independent token classifier. This is the path most multinational teams take: LiLT transfers across languages where Donut would need re-pretraining per locale.

In both cases the YOLOv8 labels you generate here are the upstream enabler. Train the detector once, freeze it, and iterate on your downstream extractor with clean, well-scoped crops.

Not sure which downstream extractor fits? Our Donut vs LiLT vs LayoutLM comparison walks through the decision — license terms, multilingual support, and where each architecture wins on real document benchmarks.

Frequently asked questions

How many synthetic YOLOv8 training examples can I generate?
There is no per-dataset cap. Each generated document produces one image plus a matching YOLO labels.txt with every field region annotated. Teams typically start with 2,000–10,000 pages per form type for a solid first-stage region detector, then scale to tens of thousands as they add layout variants.
What's the license on SymageDocs-generated YOLOv8 training data?
You own the output. Generated documents and their bounding-box labels are yours to train on, redistribute internally, or ship as part of a derived model. There are no data-use clawbacks, no royalty obligations, and no clauses restricting commercial use of downstream models.
Which YOLO class taxonomy should I use for document region detection?
Our fixture uses a compact six-class schema — party, header, line_item, total, date, other — which maps cleanly to invoice, receipt, and form layouts. You can refine this by swapping the CLASS_MAP in the conversion step to match your own downstream extractor's region vocabulary.
How many epochs should I train YOLOv8 on synthetic document data?
Start with 100 epochs at imgsz=1280 with early stopping (patience=20). Synthetic data converges faster than real annotated documents because the labels are noise-free; most teams see validation mAP@0.5 plateau between epochs 40 and 80. Hold a small real-document validation split to track the sim-to-real gap.
Does this work with Ultralytics YOLOv9, YOLOv10, or YOLOv11?
Yes. The YOLO labels.txt format and data.yaml schema are shared across the Ultralytics family, so the same SymageDocs output drops into YOLOv8, YOLOv9, YOLOv10, and YOLOv11 training loops with no changes. Swap the pretrained weights file (e.g. yolov10n.pt) and the rest of the pipeline is identical.

Ready to generate YOLOv8 training data?

Start with 250 free credits. No credit card required. Pipe the output straight into your Ultralytics training loop.