Synthetic CMS-1500 Health Insurance Claim Form Data

Synthetic training data — no real PII, fully coherent identities

healthcare2024

Generate synthetic CMS-1500 health insurance claim forms with realistic patient demographics, diagnosis codes, and procedure data. The standard healthcare claim form used across the U.S. — essential training data for healthcare document AI and claims processing pipelines.

68

Fields per document

1

Page

healthcare

Category

What this document is

The CMS-1500 is the universal health insurance claim form used by physicians, clinics, and healthcare providers to bill Medicare, Medicaid, and private insurers across the United States. Its single-page layout packs 33 numbered fields covering patient information, insured party details, diagnosis codes (ICD-10), procedure codes (CPT/HCPCS), and billing amounts into a dense, red-dropout grid designed for optical scanning.

Why generate synthetically

Healthcare claims processing is one of the largest document automation markets, and the CMS-1500 is its foundational form. Synthetic CMS-1500s enable training OCR, field extraction, and claims validation models without the HIPAA compliance burden of using real patient health information. They are essential for any healthcare IDP pipeline.

What makes synthetic data useful

Each synthetic CMS-1500 generates a coherent patient encounter where diagnosis codes (Box 21) relate to the procedure codes (Box 24D), dates of service are consistent, and billing amounts reflect realistic charge patterns for the given procedures. Provider NPIs follow valid formats, and patient demographics use fabricated but structurally valid data including insurance ID numbers.

Training challenges

The CMS-1500's red-dropout ink is designed for OCR scanners but paradoxically causes problems for modern ML models trained on standard black-and-white documents: the red grid lines appear in color scans but vanish in B&W, creating two very different visual inputs from the same form. Box 24 (the service line table) packs up to 6 rows of procedure codes, modifiers, diagnosis pointers, and charges into a grid where column widths vary and cell boundaries are defined only by the red dropout pattern. Boxes 1-13 (patient/insured information) use a split-column layout where the left and right halves of the page collect parallel data for patient vs. insured party, and models must correctly assign fields to the right entity. The diagnosis pointer field (Box 24E) uses single-letter references (A-L) that cross-reference Box 21's diagnosis codes, requiring relational extraction across non-adjacent form regions.

Generate synthetic CMS-1500 Health Insurance Claim Form data

Start with 250 free credits. No credit card required.

Generate Now

Who uses this data

Healthcare-AI platforms, claims-processing pipelines, RCM (revenue cycle management) vendors, payer adjudication systems, medical billing SaaS, HIPAA-compliant IDP vendors, and fraud-detection teams training on upcoding, unbundling, and fabricated claim patterns. The CMS-1500 is the universal paper claim in US healthcare — an extractor without CMS-1500 support cannot sell into healthcare RCM.

Document complexity profile

212 fields on a single rendered page — an order of magnitude more than any other form in the corpus. Breakdown: 185 text, 27 checkboxes. 440 annotation relations (by far the most of any form we generate). 6 conditional bindings, 56 function-call bindings, 1 arithmetic binding, max expression depth 4. The Box 24 service line table alone contains 6 rows × 13 columns of relational data that must extract correctly with diagnosis pointers (Box 24E) cross-referencing Box 21's ICD-10 codes.

Key stats from our synthetic corpus

Quantitative characteristics of the CMS-1500 Health Insurance Claim Form documents our generator produces.

MetricValueDetail
CMS-1500 population coverage100%100% of synthetic identities in our 1,000-identity corpus are CMS-1500-eligible (i.e., have a plausible healthcare encounter). CMS-1500 has the broadest base of any form we generate — every US resident is a potential patient.
Place of Service 11 (Office)64%64% of synthetic CMS-1500s use Place of Service code 11 (Office). Models trained only on high-volume POS codes must also handle the long tail: POS 22 (Outpatient Hospital) 9%, POS 20 (Urgent Care) 8%, POS 19 (Off-Campus Outpatient) 6%, POS 81 (Independent Lab) 5%.
Employer-group insurance49%49% of synthetic CMS-1500s list EMPLOYER GROUP as the insurance plan name in Box 11c, followed by PRIVATE 19%, ACA MARKETPLACE 18%, MEDICARE 9%, and MEDICAID 5%. This payer-mix matches real-world US commercial-healthcare distributions and exercises Box 1's 12-checkbox payer-selection region.
Single-diagnosis claims72%72% of synthetic Box 24E Diagnosis Pointers are single-letter (just 'A'), while 28% are multi-pointer (e.g., 'AB' pointing at two Box 21 ICD-10 codes). Multi-pointer claims are the cross-reference case that diagnosis/procedure linking models must master.
Zero-patient-payment claims85%85% of synthetic CMS-1500s list $0.00 in Box 29 Amount Paid — the common pattern of a first-party submission where the insurer has not yet paid. The remaining 15% exercise the partial-payment path with small currency values.

How this document co-occurs with others

Rates at which identities in our corpus that produce a CMS-1500 Health Insurance Claim Form also produce other documents.

CorrelationRateDetail
Patient also files a 1040100%Every CMS-1500 patient identity also produces a matching Form 1040 at year-end. Healthcare-AI pipelines use this to link HSA/medical-deduction evidence on the 1040 (Schedule A line 1) back to specific CMS-1500 claims.
Provider files a W-9100%Every CMS-1500 claim in our corpus has an associated provider W-9. The name/TIN on Box 33 of the CMS-1500 reconciles to Part I of the provider's W-9 — a direct cross-form link used by payer credentialing systems.
Patient is also a W-2 wage earner65%65% of synthetic CMS-1500 patients are also W-2 earners (i.e., have employer-sponsored health coverage). This pairing is the anchor for eligibility-verification systems checking employer coverage against claim submissions.
Provider invoicing overlap65%65% of CMS-1500 providers also issue commercial invoices for cash-pay services. This is the direct-pay / cash-practice overlap that healthcare-fintech platforms must handle alongside insurance claims.
Married filers45%45% of synthetic CMS-1500 patients are married. Box 6 (Patient Relationship to Insured) must correctly distinguish self-insured vs spouse-insured cases — a primary eligibility-verification failure mode.
Retired primaries (Medicare track)10%10% of synthetic CMS-1500 patients are retired primaries — the Medicare patient population that exercises Box 1 Medicare checkbox, Box 1a Medicare ID format, and the specific provider-billing rules for Medicare claims.

All stats above are corpus-derived: they were computed on a local synthetic corpus of 1,000 generated identities produced by SymageDocs' World Simulation Engine. No real patient health information (PHI) or provider billing data was used. Regenerate the corpus at any time with `make corpus-stats`.

Frequently asked questions

What data format do synthetic CMS-1500 documents include?
Each generated identity produces a filled PDF and a structured JSON annotation file containing bounding boxes and field values for all 68 fields on the single-page form.
Can I use this data commercially?
Yes. All synthetic data is generated from statistical models, contains no real PHI or PII, and is licensed for commercial use including ML model training and benchmarking.
How does the synthetic data differ from real CMS-1500 forms?
Synthetic CMS-1500s use fabricated patient identities, provider NPIs, and diagnosis/procedure code combinations. The data is clinically plausible but does not come from real patient encounters.
Does the synthetic data comply with HIPAA?
Yes. Because all data is synthetically generated with no real patient information, there are no HIPAA restrictions on storage, sharing, or use in model training.
Are the diagnosis and procedure codes realistic?
Yes. The generator uses valid ICD-10 diagnosis codes and CPT/HCPCS procedure codes in clinically plausible combinations, so models learn real-world code patterns without memorizing synthetic-only sequences.