SymageDocs API User Manual

The SymageDocs API lets you programmatically generate synthetic documents and identities. This manual covers authentication, endpoints, billing, and common workflows.

Python users: Consider the Python SDK (pip install symagedocs) instead of raw HTTP. It handles authentication, retries, pagination, and polling automatically.

Quick Reference

Base URL: https://symagedocs.ai/api/v1 Auth: Authorization: Bearer sk_live_YOUR_KEY Rate limit: Per-key, tier-based sustained rate (free: 10 req/s; pro: 30; scale: 100; enterprise: 200) with 3× short burst

Endpoints at a Glance

These are the public Bearer-authenticated endpoints under /api/v1. Webhook and API-key management are session-authenticated and live under /api/internal/... — see Webhooks and API Key Management.

Method	Endpoint	Scope	Description
`GET`	`/forms`	read	List available forms (filter: `?category=tax`)
`GET`	`/forms/{id}`	read	Form details with field definitions
`POST`	`/generate`	generate	Create an async generation job
`GET`	`/jobs`	read	List your jobs (paginated: `?limit=50&cursor=...`)
`GET`	`/jobs/{id}`	read	Job status and progress
`GET`	`/jobs/{id}/download/{fmt}`	read	Download output (`dataset`, `csv`, or `json`)
`GET`	`/jobs/{id}/downloads`	read	Presigned download URLs for all job output files
`GET`	`/jobs/{id}/items`	read	List per-item records for a job
`GET`	`/jobs/{id}/items/{item_id}/download`	read	Presigned URLs for one item's files
`POST`	`/jobs/{id}/cancel`	generate	Cancel a running job; refund unspent credits
`POST`	`/identities`	generate	Generate raw identities (JSON only, no forms)
`GET`	`/account/balance`	account	Credit usage totals
`GET`	`/account/usage`	account	Usage summary (filter: `?days=30`)
`POST`	`/tabular/parse`	generate	NL description to column schema (LLM-powered)
`POST`	`/tabular/generate`	generate	Generate tabular data from column schema
`GET`	`/tabular/{id}/status`	read	Tabular job progress and ETA
`GET`	`/tabular/{id}/download/{fmt}`	read	Download tabular output (`csv` or `json`)
`GET`	`/pricing/rates`	—	Get current credit rate constants (no auth)
`POST`	`/pricing/estimate`	—	Estimate credit cost for a job (no auth)
`GET`	`/jobs/size-rates`	—	Get calibrated bundle-size rate constants (no auth)
`POST`	`/jobs/size-estimate`	—	Estimate compressed bundle download size (no auth)
`GET`	`/health`	—	Liveness probe (no auth)

Typical Workflow (Jobs)

1. GET  /forms                          → pick a form_id
2. POST /generate  {form_id, quantity}  → get job_id
3. GET  /jobs/{job_id}                  → poll until status = "completed"
4. GET  /jobs/{job_id}/download/dataset → download all artifacts as a single zip
   — or —
   GET  /jobs/{job_id}/downloads        → get presigned URLs for direct S3 download

Typical Workflow (Per-item training data)

1. GET  /forms                                 → pick a form_id
2. POST /generate  {form_id, quantity, output_formats: ["pdf_typed", "bio"]}
   ↑ optional: send `Idempotency-Key: <key>` header for retry-safety
3. GET  /jobs/{job_id}                         → poll until status = "completed"
4. GET  /jobs/{job_id}/items                   → enumerate items (paginated)
5. GET  /jobs/{job_id}/items/{id}/download     → presigned URLs per item

Typical Tabular Workflow

1. POST /tabular/parse  {prompt}           → get column schema
2. POST /tabular/generate  {columns, qty}  → get job_id (status = "processing")
3. GET  /tabular/{job_id}/status           → poll until status = "completed"
4. GET  /tabular/{job_id}/download/csv     → download results

# Tabular jobs use status = "processing" while in flight (not "pending"/"generating",
# which apply only to /jobs).

Key Scopes

Scope	What it allows
`generate`	Create jobs, generate identities, tabular parse/generate
`read`	List forms, view jobs, download output, tabular status/download
`account`	View balance and usage

Credit Cost (per document)

Output	Formula
Typed PDF	`20 + ceil(max(fields - 25, 0) / 25) * 2`
Handwritten PDF	`40 + ceil(max(fields - 25, 0) / 25) * 4`
Filled PDF	`12 + ceil(max(fields - 25, 0) / 25) * 1.2`
Tabular (CSV/JSON)	`ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500))`

JSON and CSV are always free. Every dataset zip includes ground_truth/json/ and ground_truth/data.csv at no extra cost. They are not selectable in output_formats.

Response Envelope

{"data": ..., "meta": {"request_id": "uuid", ...}}

Common Errors

Code	Meaning
401	Invalid or revoked API key
402	Insufficient credits
403	Key missing required scope
404	Resource not found
409	Job not yet completed (for downloads)
429	Rate limit exceeded

Quick Start

1. Create an API Key

Log in to the SymageDocs web application and navigate to Account > API Keys. Click Create Key, give it a name, and copy the key immediately — it will not be shown again.

Your key looks like: sk_live_a1b2c3d4e5f6... (56 characters total).

2. Make Your First Request

curl https://symagedocs.ai/api/v1/forms \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE"

3. Generate a Document

# Create a generation job
curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 10,
    "output_formats": ["pdf_typed"]
  }'

# Poll for completion (replace JOB_ID with the job_id from above)
curl https://symagedocs.ai/api/v1/jobs/JOB_ID \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE"

# Download results when status is "completed" — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/dataset \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
  -o output.zip

Authentication

Every API request must include your key in the Authorization header:

Authorization: Bearer sk_live_YOUR_KEY_HERE

No username or password is needed — your identity is embedded in the key itself.

API Key Scopes

Keys can be created with specific scopes to limit what they can do:

Scope	Grants access to
`generate`	Create generation jobs, generate raw identities
`read`	List forms, check job status, download outputs
`account`	View credit balance and usage statistics

By default, new keys have all three scopes. You can create restricted keys (e.g., a read-only key for monitoring) through the web UI or API.

Key Management

You manage keys through the web UI at Account > API Keys, or programmatically:

Action	Description
Create	Generates a new key. The raw key is shown once — copy it immediately.
List	Shows all your keys with their prefix, name, scopes, and last-used date. Full keys are never shown.
Revoke	Permanently disables a key. Takes effect immediately. Cannot be undone.
Rotate	Creates a new key and revokes the old one in a single operation. The new key inherits the old key's name and scopes.

Rate Limiting

The API enforces a per-key rate limit determined by the key's tier. The sustained rate is the token-bucket refill rate; a fresh client can issue up to 3× the sustained rate as an initial burst before throttling kicks in.

Tier	Sustained	Burst
Free	10 requests/second	30 requests
Pro	30 requests/second	90 requests
Scale	100 requests/second	300 requests
Enterprise	200 requests/second	600 requests

If you exceed your tier's limit, the API returns 429 Too Many Requests with the following response headers so clients can back off intelligently:

Header	Meaning
`Retry-After`	Seconds to wait before retrying (rounded up to at least 1)
`X-RateLimit-Limit`	The configured sustained rate for your tier (requests/second)
`X-RateLimit-Remaining`	Remaining budget within the current window
`X-RateLimit-Reset`	Unix timestamp (seconds) at which the next request slot becomes available

Response Format

Success Responses

All endpoints return a consistent envelope:

{
  "data": { ... },
  "meta": {
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "count": 5,
    ...
  }
}

data contains the response payload (object or array depending on the endpoint).
meta contains request metadata. request_id is always present and useful for support inquiries.

Error Responses

Errors follow a standard format:

{
  "detail": {
    "code": "unknown_form",
    "message": "Unknown form: irs_w99_2024",
    "status": 400
  }
}

HTTP Status Codes

Code	Meaning
200	Success
400	Bad request (invalid parameters)
401	Authentication failed (invalid or revoked key)
402	Insufficient credits
403	Forbidden (missing required scope)
404	Resource not found
409	Conflict (e.g., downloading from an incomplete job)
429	Rate limit exceeded

Endpoints

Forms Catalog

List Forms

GET /api/v1/forms
GET /api/v1/forms?category=tax
GET /api/v1/forms?family=irs_w2

Returns all available forms. Optionally filter by category.

Query parameters:

Parameter	Type	Default	Description
`category`	string	—	Filter by form category (e.g., `tax`, `healthcare`, `business`)
`family`	string	—	Filter by form family (e.g., `irs_w2`, `irs_f1040`)
`limit`	integer	—	Maximum number of forms to return (≥ 1)

Scope required: read

Response fields per form:

Field	Type	Description
`id`	string	Unique form identifier (use this in generation requests)
`name`	string	Human-readable form name
`category`	string	Form category (e.g., `tax`, `healthcare`, `business`)
`year`	integer	Tax year or form version year
`family`	string	Form family (e.g., `1040`, `w2`)
`entity_type`	string	Entity type (`individual`, `business`, `healthcare`)
`credit_cost`	integer	Credits per document (at default PDF output)
`field_count`	integer	Number of fields on the form

Example:

curl https://symagedocs.ai/api/v1/forms?category=tax \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Get Form Details

GET /api/v1/forms/{form_id}

Returns full details for a specific form, including all field definitions.

Scope required: read

Additional response fields:

Field	Type	Description
`version`	string	Form definition version
`page_count`	integer	Number of pages in the form
`fields`	array	List of field definitions
`fields[].id`	string	Field identifier
`fields[].label`	string	Human-readable field label
`fields[].type`	string	Field type (e.g., `text`, `enum`, `ssn`, `currency`)
`fields[].page`	integer	Page number the field appears on

Example:

curl https://symagedocs.ai/api/v1/forms/irs_w2_2025 \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Generation

Create a Generation Job

POST /api/v1/generate

Creates an asynchronous generation job. The job runs in the background — poll the job status endpoint to track progress.

Scope required: generate

Request body:

Provide exactly one of form_id or form_ids (not both).

Field	Type	Required	Default	Description
`form_id`	string	conditional	—	Single form to generate. Mutually exclusive with `form_ids`.
`form_ids`	string[]	conditional	—	List of form IDs to generate coherently for the same identity (e.g., W-2 + 1040 with matching wages, SSN, and name). Mutually exclusive with `form_id`.
`quantity`	integer	no	1	Number of identities to generate (1–10,000,000). Each identity produces one document per form. Large jobs are routed to the bulk worker and may take minutes-to-hours; poll `/jobs/{job_id}` for status.
`output_formats`	string[]	no	`["pdf_typed"]`	Output formats (see below)
`config`	object	no	null	Generation configuration overrides
`seed`	integer	no	null	Seed for reproducible output
`ink_color`	string	no	`"blue"`	Ink color for handwritten output: `black`, `blue`, or `red`
`ink_color_distribution`	object	no	null	Distribution of ink colors across documents. Keys are color names, values are integer weights that must sum to 100. Example: `{"blue": 60, "black": 30, "red": 10}`. Overrides `ink_color` when provided.
`writer_consistency`	string	no	`"per_document"`	Writer consistency: `per_document` or `per_field`
`realism_level`	string	no	`"high"`	Deprecated. Accepted for backward compatibility; ignored. All handwritten output is rendered at high realism.
`webhook_url`	string	no	null	Deprecated. Inline webhook registration; pass a URL to receive `job.completed`/`job.failed` notifications (max 2,048 chars). Still works but will be removed in a future major version — register webhooks via `POST /api/internal/webhooks` instead. See Webhooks.

Output formats:

Format	Description
`pdf_typed`	Filled PDF with typed text
`pdf_handwritten`	Filled PDF with handwritten-style text (not available for all forms)
`pdf_filled`	Completed PDF whose values sit in live, editable AcroForm widgets (fillable forms only)
`png_typed`	Per-page PNG rasterizations of `pdf_typed` (requires `pdf_typed`)
`png_handwritten`	Per-page PNG rasterizations of `pdf_handwritten` (requires `pdf_handwritten`)

pdf_filled delivers editable widgets, not a flattened render. The completed values sit in live, editable AcroForm widgets, so a downstream consumer can open the PDF and change them. It is available only for fillable forms: requesting pdf_filled for a non-fillable form rejects the whole submission with 400 code=format_unsupported_for_form, and error.details.unsupported_form_ids lists every non-fillable form in the request. pdf_filled is not a render surface: it cannot satisfy the ML-format render dependency, is never paired with a PNG, and is never degraded (a degradation_profile has no effect on it). Filled PDFs are delivered under pdfs/filled/ in the bundle.

JSON, CSV, and FUNSD are always included. Every dataset zip automatically contains ground_truth/json/ (per-document structured JSON), ground_truth/data.csv (aggregated tabular CSV), and per-page FUNSD annotations regardless of which formats you request. Do not include "json", "csv", or "funsd" in output_formats — the API will return 400 code=invalid_output_format if you do.

Unknown formats are rejected. Any output_formats token outside the documented set above is rejected with 400 code=invalid_output_format, and the error lists the valid set. This includes typos and bare "png" (use "png_typed" or "png_handwritten"). Invalid tokens are never silently dropped.

PNG output requires its paired PDF. png_typed must be requested together with pdf_typed, and png_handwritten must be requested together with pdf_handwritten. PNG always travels with its paired PDF; PNG-only output is not a supported configuration. Submitting png_typed without pdf_typed (or png_handwritten without pdf_handwritten) is rejected by the API with 400 code=missing_pdf_dependency. Cross-surface pairs do not satisfy the dependency — e.g. ["png_typed", "pdf_handwritten"] is still rejected because typed PNG needs typed PDF specifically.

ML training formats (gated behind the ml-output-formats-enabled feature flag — requesting any of these without the flag returns 400 code=ml_formats_disabled):

Format	Description
`bio`	Token classification (B/I/O) labels
`yolo`	YOLOv8 object detection labels
`coco`	COCO JSON object detection annotations
`donut`	Donut OCR-free `metadata.jsonl` + per-page images

ML formats require a render format. All ML training formats (bio, yolo, coco, donut) derive word-level annotations from the renderer pipeline. You must include at least one render format — pdf_typed, pdf_handwritten, or png_typed — in the same request. Submitting ML formats without a render format returns 400 code=missing_render_dependency. Example: ["bio", "donut"] alone is invalid; use ["pdf_typed", "bio", "donut"].

You can request multiple formats in a single job (e.g., ["pdf_typed", "bio", "yolo"]). Regardless of how many formats you request, the completed job is downloaded as a single dataset zip containing every produced artifact — including JSON and CSV ground truth, which are always present.

Multi-surface jobs (["pdf_typed", "pdf_handwritten"]) run in parallel by default (STAX-1707). Pre-2026-05-15 such jobs were forced through a sequential code path as a defensive workaround. Bulk jobs now use the full bulk-worker core count regardless of how many PDF surfaces were requested. Output bundles separate the two surfaces into pdfs/typed/ and pdfs/handwritten/ (plus images/{typed,handwritten}/ and the per-format annotations/<format>/{typed,handwritten}/ trees); a typed and a handwritten render of the same logical document share one identity (see ADR-058 for the architectural commitment). Filled PDFs (pdf_filled), when requested, are delivered alongside under pdfs/filled/ with no paired images or annotations.

Response:

{
  "data": {
    "job_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "pending",
    "credits_required": 2200
  },
  "meta": {
    "request_id": "...",
    "credits_charged": 2200
  }
}

Example — single form:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 50,
    "output_formats": ["pdf_typed"],
    "seed": 12345
  }'

Example — multi-form (coherent generation):

Generate W-2 and 1040 documents for the same identities. Each identity's W-2 wages will match the 1040 line 1 wages, and SSN/name/address fields are consistent across both forms.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_ids": ["irs_w2_2025", "irs_f1040_2024"],
    "quantity": 10,
    "output_formats": ["pdf_typed"],
    "seed": 42
  }'

Note: Calling /generate twice with the same seed and different form_id values does not produce coherent data — the identity biasing differs per form. Use form_ids to get matching documents.

Coherence Mode (ML ablation control)

Multi-form jobs accept a coherence_mode flag inside config that controls how identities and field values relate across the forms of a single item_index:

coherent (default) — one identity per item_index; every form in that item is filled from it. Cross-form correlations (same SSN on W-2 and 1040, same wages on W-2 box 1 and 1040 line 1) are preserved. Use this for realistic tax-prep datasets and any production workload.
shuffled — identities are still generated one-per-item_index, but each field's final value vector is permuted across items with a seed derived from (seed, form_id, field_id). Marginal distributions and spatial layout stay valid; within-image correlations are broken. Use this for ML ablation experiments that need to measure how much a model leans on cross-field correlations inside one document.
random — each (item_index, form_id) pair gets its own independently generated identity, so W-2 and 1040 in the same item will usually carry different SSNs, names, and addresses. Use this as the strongest ablation baseline, or to stress-test downstream code that should not assume identity continuity across forms.

Reproducibility: passing the same (form_ids, quantity, seed, output_formats, coherence_mode) produces byte-identical output for all three modes. The chosen mode is recorded in manifest.json under dataset_info.generation_params.coherence_mode so dataset consumers can branch on it.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_ids": ["irs_w2_2025", "irs_f1040_2024"],
    "quantity": 100,
    "seed": 42,
    "output_formats": ["pdf_typed"],
    "config": {"coherence_mode": "shuffled"}
  }'

List Jobs

GET /api/v1/jobs
GET /api/v1/jobs?limit=20&status=completed

Returns your generation jobs, newest first. Supports cursor-based pagination.

Scope required: read

Query parameters:

Parameter	Type	Default	Description
`limit`	integer	50	Results per page (1–100)
`cursor`	string	—	Pagination cursor from a previous response
`status`	string	—	Filter by status: `pending`, `generating`, `completed`, `failed`
`form_id`	string	—	Filter to jobs whose primary form is this form id

Pagination:

The response meta includes:

has_more — true if more results exist beyond this page
next_cursor — pass this as the cursor parameter to get the next page

# First page
curl "https://symagedocs.ai/api/v1/jobs?limit=10" \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

# Next page (using next_cursor from the previous response)
curl "https://symagedocs.ai/api/v1/jobs?limit=10&cursor=2026-03-24T10:00:00+00:00" \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Get Job Status

GET /api/v1/jobs/{job_id}

Returns detailed status for a specific job.

Scope required: read

Response fields:

Field	Type	Description
`job_id`	string	Job identifier
`status`	string	`pending`, `generating`, `completed`, or `failed`
`progress`	float	Completion progress (0.0 to 1.0)
`form_id`	string	Form used for generation
`form_name`	string	Human-readable form name
`quantity`	integer	Number of documents requested
`output_formats`	string[]	Requested output formats
`credits_required`	integer	Total credits for this job
`credits_charged`	integer	Credits actually charged
`seed`	integer	Seed used (null if random)
`error`	string	Error message if job failed (null otherwise)
`created_at`	string	ISO 8601 creation timestamp
`completed_at`	string	ISO 8601 completion timestamp (null if not done)
`generation_time_ms`	integer	Total generation time in milliseconds (null if not completed)
`worker_count`	integer	Number of workers used for this job
`peak_workers`	integer	Peak concurrent workers during generation
`throughput_docs_per_min`	float	Generation throughput in documents per minute

Download Job Output

GET /api/v1/jobs/{job_id}/download/{format}

Downloads the generated output in the specified format. The job must be in a terminal status: completed, or — so partial output is recoverable — canceled, failed, or expired. Downloading a job that is still running returns 409.

Scope required: read

Path parameters:

Parameter	Values	Description
`format`	`dataset`, `csv`, `json`	One format per request. Per-format downloads (`pdf_typed`, `pdf_handwritten`, `coco`, `bio`, `yolo`, `funsd`, `donut`, `paperlives`, `hf_datasets`, `coco_layout`, `funsd_layout`) have been retired in favor of the unified `dataset`. Requesting any of those returns `400 code=invalid_format`. `bundle`, the pre-rename token for `dataset`, is still accepted by this endpoint during the migration soak but is deprecated and rejected by the Python SDK — use `dataset`.

Response: Binary file download.

dataset returns a single ZIP archive containing every artifact the job produced (PDFs, PNGs, JSON, CSV, and any ML annotations that were requested) plus a manifest.json describing the contents.
json returns the aggregated structured data as JSON.
csv returns the aggregated data as a CSV file.

Very large jobs are stored as a sharded dataset with no single archive; download/dataset then returns 409 with a pointer to the dataset manifest (GET /jobs/{job_id}/dataset/manifest), from which the preview and shard archives can be fetched individually. The Python SDK's generate.download_dataset() handles both layouts automatically.

For per-file access (e.g., to fetch a single PDF), use GET /jobs/{job_id}/downloads to obtain presigned S3 URLs.

Example:

# Download every artifact as a single ZIP
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/dataset \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -o output.zip

# Download JSON data
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/json \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -o data.json

Presigned Downloads

Get Job Download URLs

GET /api/v1/jobs/{job_id}/downloads

Returns presigned S3 URLs for all output files of a completed job. Files can be downloaded directly from S3 without proxying through the server.

Scope required: read

Response:

{
  "data": {
    "job_id": "550e8400-...",
    "files": [
      {
        "filename": "irs_w2_2025_abc123.pdf",
        "url": "https://s3.amazonaws.com/...presigned",
        "content_type": "application/pdf",
        "item_index": 0
      }
    ],
    "expires_in": 3600
  }
}

URLs expire after 1 hour. Request new URLs if needed.

Cancel a Job

POST /api/v1/jobs/{job_id}/cancel

Cancels a running job. Items already rendered when the cancel takes effect remain on storage and can be downloaded via the dataset download endpoint. Idempotent: cancelling a job that's already CANCELING / CANCELED returns 200 with the current state. Cancelling a terminal-but-not-canceled job (COMPLETED / FAILED / EXPIRED) returns 409.

Scope required: generate

Response: standard envelope with {"job_id", "status"} where status is either canceling (worker is draining its current item) or canceled (job was still PENDING and skipped straight to terminal). The worker observes the cancel flag at the per-item boundary. If the server cannot finalize the cancel inline, the response also includes an optional note field — "Cancellation in progress; will complete within 10 minutes." — indicating a background reconciler will converge the job to canceled.

Refunds: Cancelling a PENDING job (no work performed yet) refunds the full credit debit. Cancelling a GENERATING job refunds only the unspent portion (prorated) — credits already consumed by items rendered before the cancel took effect are not returned. Either way the refund happens automatically.

Idempotency-Key Header

POST /api/v1/generate accepts an optional Idempotency-Key header. When the same (api-key user, key) pair is seen again within 24 hours the original job_id is returned and credits are not debited a second time. Useful for safe network retries.

The key may be at most 255 characters; an empty or longer key returns 400 code=invalid_idempotency_key.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Idempotency-Key: my-retry-safe-key-001" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100}'

The replay response sets meta.idempotent_replay = true and meta.credits_charged = 0.

Per-Item Endpoints

Per-item training-data access lives on the /jobs surface:

Endpoint	Description
`GET /api/v1/jobs/{job_id}/items`	Cursor-paginated list of per-item records. Each item carries presigned download URLs.
`GET /api/v1/jobs/{job_id}/items/{item_id}/download`	Presigned URLs for a single item's files.

Raw Identities

Generate Identities

POST /api/v1/identities

Generates synthetic identities as raw JSON — no form rendering, no PDF output. Useful for testing, schema exploration, or when you need identity data without filled documents.

Scope required: generate

Request body:

Field	Type	Required	Default	Description
`quantity`	integer	no	1	Number of identities (1–10,000)
`config`	object	no	null	Generation configuration
`seed`	integer	no	null	Seed for reproducibility

Response: The standard envelope with data containing an array of identity objects. Each identity has fields like first_name, last_name, ssn, dob, address, employer information, and more.

Example:

curl -X POST https://symagedocs.ai/api/v1/identities \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"quantity": 5, "seed": 42}'

Account

Get Credit Balance

GET /api/v1/account/balance

Returns your credit usage totals.

Scope required: account

Response fields:

Field	Type	Description
`credits_used`	integer	Total credits charged across all completed jobs
`credits_allocated`	integer	Total credits reserved across all jobs (including pending)

Get Usage Summary

GET /api/v1/account/usage?days=30

Returns a usage summary for the specified period.

Scope required: account

Query parameters:

Parameter	Type	Default	Description
`days`	integer	30	Number of days to summarize (1–365)

Response fields:

Field	Type	Description
`period_days`	integer	Period covered
`total_jobs`	integer	Number of jobs created
`total_items_generated`	integer	Total documents generated
`total_credits_used`	integer	Total credits charged
`jobs_by_status`	object	Job count by status (`completed`, `failed`, etc.)

Webhooks

Webhooks let you receive real-time HTTP notifications when events occur — no polling required. Register a URL, and SymageDocs will POST a signed JSON payload whenever a subscribed event fires.

Events

Event	Fires when
`job.completed`	A generation job finishes successfully (includes presigned download URLs)
`job.failed`	A generation job fails

Inline Registration via `webhook_url` (deprecated)

Deprecated. The webhook_url parameter still works but will be removed in a future major version. Prefer pre-registering webhooks via POST /api/internal/webhooks (below).

Instead of pre-registering webhooks, pass a webhook_url parameter when creating a job:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 10,
    "webhook_url": "https://your-server.com/hooks"
  }'

A temporary webhook subscription is auto-created for job.completed and job.failed. This counts against the per-user webhook cap (see below): if you are already at the cap, the job is rejected with 422, and the credit debit for the job is refunded.

Register a Webhook

POST /api/internal/webhooks

Auth: Session (web UI) — uses x-user-id header, not API key auth.

Request body:

Field	Type	Required	Default	Description
`url`	string	yes	—	HTTPS endpoint to receive events (max 2,048 chars)
`events`	string[]	no	`["job.completed", "job.failed"]`	Event types to subscribe to (the two events above)
`description`	string	no	null	Optional label (max 500 chars)

Response:

{
  "webhook_id": "550e8400-...",
  "signing_secret": "whsec_a1b2c3d4e5f6...",
  "secret_prefix": "whsec_a1b2c3d4",
  "url": "https://example.com/hook",
  "events": ["job.completed", "job.failed"],
  "description": null,
  "created_at": "2026-03-26T12:00:00+00:00"
}

Important: The signing_secret is shown once at creation. Store it securely — you'll need it to verify payload signatures.

Per-user cap: Each user may have at most 20 active webhooks. Registering one beyond the cap returns 422. Disable or delete an existing webhook to free a slot.

List Webhooks

GET /api/internal/webhooks

Returns all active webhooks for the current user. The full signing secret is never included — only the secret_prefix for identification.

Update a Webhook

PATCH /api/internal/webhooks/{webhook_id}

Update a webhook's URL, subscribed events, description, or active status. All fields are optional — only include what you want to change.

Request body:

Field	Type	Description
`url`	string	New endpoint URL
`events`	string[]	New event subscriptions
`description`	string	New description
`is_active`	boolean	Enable or disable delivery

Disable a Webhook

DELETE /api/internal/webhooks/{webhook_id}

Soft-deletes the webhook. It stops receiving events immediately but remains in the database for audit purposes.

Payload Format

Every webhook delivery is a POST request with a JSON body:

{
  "event": "job.completed",
  "timestamp": "2026-03-30T12:05:00+00:00",
  "webhook_id": "550e8400-...",
  "data": {
    "job_id": "661e9500-...",
    "status": "completed",
    "form_id": "irs_w2_2025",
    "quantity": 100,
    "credits_charged": 2000,
    "completed_at": "2026-03-30T12:05:00+00:00",
    "download_urls": [
      { "filename": "irs_w2_2025_abc123.pdf", "url": "https://s3...presigned" },
      { "filename": "irs_w2_2025_abc123.json", "url": "https://s3...presigned" }
    ]
  }
}

Headers included with every delivery:

Header	Description
`X-Symagedocs-Signature`	HMAC-SHA256 hex digest of the raw body
`X-Symagedocs-Event`	Event type (e.g., `job.completed`)
`X-Symagedocs-Delivery`	Unique delivery ID for deduplication

Verifying Signatures

Verify the X-Symagedocs-Signature header to ensure the payload came from SymageDocs and wasn't tampered with:

import hashlib, hmac

def verify_webhook(body: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, signature)

const crypto = require("crypto");

function verifyWebhook(body, signature, secret) {
  const expected = crypto.createHmac("sha256", secret).update(body).digest("hex");
  return crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(signature));
}

Retry Policy

Failed deliveries (non-2xx response or network error) are retried up to 3 times with exponential backoff:

Attempt	Delay
1st retry	10 seconds
2nd retry	60 seconds
3rd retry	300 seconds (5 minutes)

After all retries are exhausted, the delivery is marked as failed. Your webhook remains active and will continue receiving future events.

Tabular Generation

The tabular API generates structured CSV/JSON data from either a natural-language description or an explicit column schema. It supports 50+ column types including identity-derived fields (names, SSNs, addresses) and synthetic fields (random numbers, enums, patterns).

Parse NL Description

POST /api/v1/tabular/parse

Converts a natural-language description into a structured column schema using an LLM with heuristic fallback.

Scope required: generate

Request body:

Field	Type	Required	Description
`prompt`	string	yes	Natural language description of desired columns (1–2,000 chars)

Example:

curl -X POST https://symagedocs.ai/api/v1/tabular/parse \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "first name, last name, age, city, state, annual salary"}'

Response:

{
  "data": {
    "columns": [
      { "name": "first_name", "type": "text_first_name", "constraints": {} },
      { "name": "last_name", "type": "text_last_name", "constraints": {} },
      { "name": "age", "type": "identity_age", "constraints": {} },
      { "name": "city", "type": "address_city", "constraints": {} },
      { "name": "state", "type": "address_state", "constraints": {} },
      {
        "name": "annual_salary",
        "type": "number_integer",
        "constraints": { "min": 20000, "max": 200000 }
      }
    ],
    "parser_type": "llm",
    "model": "claude-haiku-4-5-20251001"
  },
  "meta": { "request_id": "...", "count": 6 }
}

You can modify the returned columns (rename, remove, adjust constraints) before passing them to the generate endpoint.

Generate Tabular Data

POST /api/v1/tabular/generate

Creates an async tabular generation job from an explicit column schema. Poll the status endpoint for progress.

Scope required: generate

Request body:

Field	Type	Required	Default	Description
`columns`	object[]	yes	—	Column definitions (1–50 columns, see below)
`quantity`	integer	no	100	Number of rows (1–10,000)
`output_formats`	string[]	no	`["csv"]`	Output formats: `csv`, `json`
`seed`	integer	no	null	Seed for reproducibility

Column definition:

Each column is an object with:

Field	Type	Description
`name`	string	Column header name
`type`	string	Column type (see common types below)
`constraints`	object	Type-specific parameters (min, max, values, weights, etc.)

Common column types:

Type	Description	Constraints
`text_first_name`	First name from synthetic identity	—
`text_last_name`	Last name	—
`text_full_name`	Full name	—
`ssn`	SSN (###-##-####)	—
`identity_age`	Age in years	—
`identity_gender`	Gender (M/F)	—
`address_city`	City	—
`address_state`	State abbreviation	—
`address_zip`	ZIP code	—
`phone`	Phone number	—
`email`	Email address	—
`number_integer`	Random integer	`min`, `max`, `distribution_type`, `distribution_params`
`number_decimal`	Random decimal	`min`, `max`, `decimals`
`custom_enum`	Weighted random choice	`values: [...]`, `weights: [...]`
`custom_mask`	Pattern-based string	`mask` (A=upper, a=lower, 9=digit, X=any)

Example:

curl -X POST https://symagedocs.ai/api/v1/tabular/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "columns": [
      {"name": "first_name", "type": "text_first_name", "constraints": {}},
      {"name": "last_name", "type": "text_last_name", "constraints": {}},
      {"name": "age", "type": "identity_age", "constraints": {}},
      {"name": "salary", "type": "number_integer", "constraints": {"min": 30000, "max": 150000}}
    ],
    "quantity": 1000,
    "output_formats": ["csv", "json"]
  }'

Response:

{
  "data": {
    "job_id": "550e8400-...",
    "status": "processing",
    "credits_required": 20
  },
  "meta": { "request_id": "...", "credits_charged": 20 }
}

Credits: Base rate is 1 credit per 50 rows (0.02 credits/row). A column surcharge applies only when the schema has more than 10 columns: each 500 "extra cells" (rows × columns beyond 10) costs 1 additional credit. Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)). The example above uses 4 columns and 1,000 rows: base = 1,000/50 = 20 credits, no surcharge (4 < 10), total = 20 credits.

Get Tabular Job Status

GET /api/v1/tabular/{job_id}/status

Returns progress and ETA for a tabular generation job.

Scope required: read

Response: The payload is wrapped in the standard { "data": …, "meta": … } envelope (see Success Responses). The fields below are returned inside data:

Field	Type	Description
`job_id`	string	Job identifier
`status`	string	`processing`, `completed`, or `failed`
`rows_generated`	integer	Rows completed so far
`total_rows`	integer	Total rows requested
`percent`	float	Completion percentage (0.0–100.0)
`estimated_seconds_remaining`	float	ETA in seconds (null if unavailable)
`error`	string	Error message if failed (null otherwise)

{
  "data": {
    "job_id": "tab_3f2a…",
    "status": "processing",
    "rows_generated": 1200,
    "total_rows": 5000,
    "percent": 24.0,
    "estimated_seconds_remaining": 18.0,
    "error": null
  },
  "meta": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
}

Download Tabular Output

GET /api/v1/tabular/{job_id}/download/{format}

Downloads the generated tabular data. The job must be in completed status. Data expires after 48 hours.

Scope required: read

Path parameters:

Parameter	Values	Description
`format`	`csv`, `json`	One format per request.

Response: File download (text/csv or application/json).

Errors:

Code	Meaning
400	Invalid format (not `csv` or `json`)
404	Job not found
409	Job not yet completed
410	Data expired (older than 48 hours)

Pricing

These endpoints require no authentication and can be used to preview costs before creating jobs.

Get Pricing Rates

GET /api/v1/pricing/rates

Returns the current credit rate constants used for cost calculation.

Authentication: None required.

Response:

{
  "csv": {
    "rows_per_credit": 50,
    "extra_column_threshold": 10,
    "extra_cells_per_credit": 500
  },
  "pdf_typed": {
    "base_credits": 20,
    "field_threshold": 25,
    "surcharge_per_band": 2,
    "band_size": 25
  },
  "pdf_handwritten": {
    "base_credits": 40,
    "field_threshold": 25,
    "surcharge_per_band": 4,
    "band_size": 25
  },
  "pdf_filled": {
    "base_credits": 12,
    "field_threshold": 25,
    "surcharge_per_band": 1.2,
    "band_size": 25
  },
  "degradation_multipliers": { ... },
  "rounding": {"free_threshold": 0, "minimum_credits": 1, "method": "ceil"},
  "bundling": "PDF output includes CSV/JSON free"
}

Estimate Credit Cost

POST /api/v1/pricing/estimate

Estimates the credit cost for a job without creating it.

Authentication: None required.

Request body:

Field	Type	Required	Default	Description
`field_count`	integer	yes	—	Number of fields/columns on the form
`output_formats`	string[]	yes	—	Output formats (e.g., `["pdf_typed"]`)
`record_count`	integer	yes	—	Number of records/rows to generate
`degradation_profile`	string	no	null	Degradation profile to apply (see valid values below)

When provided, degradation_profile must be one of clean, scanned, faxed, photographed, or mixed. Any other value returns 400 code=invalid_degradation_profile.

Response:

{
  "credits": 2200
}

Bundle Size Estimation

These endpoints require no authentication and can be used to preview the compressed download size of a job before submitting it. The estimates are calibrated against real staging bundles (see calibration_sample_count and calibrated_at in the response). Expected accuracy is +/- 30% for 80% of jobs.

Get Bundle-Size Rate Constants

GET /api/v1/jobs/size-rates

Returns the calibrated per-format byte constants and tier thresholds used by the size estimator.

Authentication: None required.

Response:

{
  "rates": {
    "pdf_typed_bytes_per_file": 518426,
    "pdf_handwritten_bytes_per_file": 2240759,
    "pdf_filled_bytes_per_file": 518426,
    "png_typed_bytes_per_file": 1549775,
    "png_handwritten_bytes_per_file": 549089,
    "funsd_bytes_per_file": 5093,
    "bio_bytes_per_instance": 43661,
    "yolo_bytes_per_file": 1960,
    "coco_bytes_per_surface": 55508,
    "donut_bytes_per_row": 2703,
    "ground_truth_json_bytes_per_file": 1984,
    "ground_truth_csv_bytes_per_job": 4698,
    "bundle_overhead_bytes_per_job": 2757,
    "funsd_emit_rate": 0.858,
    "bio_emit_rate": 1.0
  },
  "tiers": {
    "small_max_bytes": 104857600,
    "medium_max_bytes": 1073741824
  },
  "render_dpi": {
    "clean": 300,
    "degraded": 200
  },
  "expected_accuracy": "+/- 30%",
  "calibration_sample_count": 296,
  "calibrated_at": "2026-05-08"
}

The render_dpi block reports the source-of-truth raster DPI the bundle ships at for each condition: clean renders are rasterized at a higher DPI than degraded ones (degradation noise and warp inflate the payload, so a lower DPI is used).

Estimate Bundle Download Size

POST /api/v1/jobs/size-estimate

Estimates the compressed zip download size for a job based on its forms, page counts, output formats, and record count.

Authentication: None required.

Request body:

Field	Type	Required	Default	Description
`field_counts`	integer[]	yes	—	Number of fields on each form (one entry per form)
`page_counts`	integer[]	yes	—	Number of pages on each form (must match `field_counts` length)
`output_formats`	string[]	no	`[]`	Requested output format tokens (e.g., `["pdf_typed", "bio"]`)
`record_count`	integer	yes	—	Number of identities to generate (>= 0)
`degradation_profile`	string	no	null	Optional degradation profile (`clean`, `scanned`, `faxed`, `photographed`, `mixed`). Degraded renders are materially larger, so supplying the profile yields a more accurate estimate; omit it for the clean-path estimate.

field_counts and page_counts must have the same length (>= 1). A length mismatch returns 422.

Unknown or rejected format tokens (e.g. "pdf_unicorn") return 400 with code=invalid_output_format.

Response:

{
  "total_bytes": 5255252,
  "human_readable": "~5.0 MB",
  "tier": "small",
  "breakdown": {
    "pdf_typed": 5184260,
    "pdf_handwritten": 0,
    "pdf_filled": 0,
    "png_typed": 0,
    "png_handwritten": 0,
    "funsd": 43697,
    "bio": 0,
    "yolo": 0,
    "coco": 0,
    "donut": 0,
    "ground_truth": 19840,
    "overhead": 7455
  },
  "expected_accuracy": "+/- 30%",
  "calibration_sample_count": 296,
  "calibrated_at": "2026-05-08",
  "notes": []
}

Tier values:

Tier	Range
`small`	< 100 MB
`medium`	100 MB to < 1 GB
`large`	>= 1 GB

Notes field: When record_count is 0, notes includes "record_count = 0; size shown is per-job overhead floor." and total_bytes reflects only the fixed per-job overhead.

FUNSD behavior: FUNSD ground-truth files are always included in the dataset zip when any render format (pdf_typed, pdf_handwritten, png_typed, png_handwritten) is requested. The funsd breakdown entry will be non-zero whenever a render format is present, regardless of whether funsd appears in output_formats.

Health Check

GET /api/v1/health

Lightweight liveness probe for external monitoring (ALB, Pingdom, StatusCake). No authentication required.

Response:

{
  "status": "healthy",
  "git_sha": "a1b2c3d"
}

git_sha is the commit the live backend image was built from ("dev" when unset locally). Use it to confirm which backend image is serving traffic after a deploy.

API Key Management

Manage API keys programmatically. All key management endpoints use session authentication (x-user-id header from the web UI), not Bearer token auth.

Create an API Key

POST /api/internal/api-keys

Creates a new API key. The raw key is returned once — store it securely.

Request body:

Field	Type	Required	Default	Description
`name`	string	yes	—	Key label (1–200 chars)
`scopes`	string[]	no	`["generate", "read", "account"]`	Scopes for this key

Response:

{
  "key_id": "abc-123",
  "api_key": "sk_live_a1b2c3d4e5f6...",
  "prefix": "sk_live_a1b2c3d4",
  "name": "My Key",
  "scopes": ["generate", "read", "account"],
  "created_at": "2026-03-26T12:00:00+00:00"
}

List API Keys

GET /api/internal/api-keys

Returns all API keys for the current user. Full keys are never included — only the prefix for identification.

Response fields per key:

Field	Type	Description
`key_id`	string	Key identifier
`prefix`	string	First 16 characters of the key
`name`	string	Key label
`scopes`	string[]	Granted scopes
`tier`	string	Rate limit tier
`last_used_at`	string	Last usage timestamp (null if never used)
`created_at`	string	Creation timestamp
`is_active`	boolean	Whether the key is active

Revoke an API Key

DELETE /api/internal/api-keys/{key_id}

Permanently disables a key. Takes effect immediately and cannot be undone.

Rotate an API Key

POST /api/internal/api-keys/{key_id}/rotate

Creates a new key and revokes the old one atomically. The new key inherits the old key's name and scopes.

Response:

{
  "old_key_id": "abc-123",
  "new_key": {
    "key_id": "def-456",
    "api_key": "sk_live_x9y8z7w6...",
    "prefix": "sk_live_x9y8z7w6",
    "name": "My Key",
    "scopes": ["generate", "read", "account"],
    "created_at": "2026-04-01T12:00:00+00:00"
  }
}

Training Data (ML)

/generate jobs can emit labeled training data for document AI models. Pass label_scheme inside the request config, request the ML output formats you need (bio, coco, donut, yolo) alongside a render format (pdf_typed/pdf_handwritten), and pull per-item annotations from the /jobs/{id}/items surface.

Label schemes:

Scheme	Description
`field_id`	Labels use form field IDs (e.g., `employee_ssn`)
`semantic_concept`	Labels use semantic concepts (e.g., `social_security_number`)
`nist3`	Labels use NIST Type-3 categories
`field_type`	Labels use field types (e.g., `ssn`, `currency`, `text`)

Workflow:

1. POST /generate  {form_id, quantity, output_formats: ["pdf_typed", "bio", "coco"],
                    config: {label_scheme: "nist3"}}
2. GET  /jobs/{job_id}                                → poll until status = "completed"
3. GET  /jobs/{job_id}/items                          → enumerate items (paginated)
4. GET  /jobs/{job_id}/items/{item_id}/download       → presigned URLs (PDF, JSON, COCO, BIO)

For a complete working example, see the Python SDK's examples/train_kie_model.py, which demonstrates this workflow end to end: creating a job with NIST3 labels and BIO output, iterating training examples via iter_training_examples, and fetching word annotations. The SDK README's Training data section documents the same surface from the client side.

Anti-pattern: list-valued fields in semantic_concept scheme. When two or more form fields share the same semantic_concept without being a checkbox selection group (see below), the Donut exporter collapses their values into a JSON array, and the BIO exporter emits the same B/I tag pair for every token — destroying positional information. This is an anti-pattern. If a form has multiple fixed-position slots for what is conceptually the same data (e.g., the four W-2 Box 12 rows), each slot must have its own per-slot concept (e.g., tax.box12a.code, tax.box12b.code, …).

Checkbox selection groups. Radio-style checkbox groups (one concept, several option boxes — e.g., the CMS-1500 insurance type with seven boxes, or patient sex M/F) are the deliberate exception to the rule above. Each option box carries the shared concept plus an option name, and the exporters emit the selection rather than per-box values. In Donut, the concept's leaf holds the selected option's name as a string ("medicare"), an empty string when no box is marked, or a list of names in field order when several boxes are marked (check-all-that-apply). Plain yes/no question pairs still emit a single JSON boolean. In BIO, each option box's tag carries the option name as a suffix (B-insurance.type/medicare), the same scheme used for /yes and /no on binary pairs.

Donut consumer conventions

Each row in metadata.jsonl has a ground_truth field containing a JSON string of the form {"gt_parse": {...}}. The consumer — typically a HuggingFace Donut training script — is responsible for serializing gt_parse into Donut's tagged-token sequence. Two conventions apply.

Nested list values. When multiple form fields resolve to the same leaf path in the gt_parse tree (e.g., per-slot W-2 Box 12 amounts that share a parent path), DonutExporter collects their values into a JSON array at that leaf. The canonical Donut recursive flattener (used in clovaai/donut's token2json round-trip) emits one sibling tagged block per list element:

// gt_parse input (list-valued leaf)
{ "box12": { "amount": ["1200.00", "450.00"] } }

// Donut token sequence (sibling blocks per element)
<s_box12><s_amount>1200.00</s_amount><s_amount>450.00</s_amount></s_box12>

This is correct Donut convention. A consumer should not assume scalar values at every leaf.

Task-prefix wrapping token. Donut requires a per-task start token (e.g., <s_w2>, <s_invoice>) wrapping every gt_parse sequence. symagedocs does not emit this token — metadata.jsonl rows contain only the bare gt_parse object. The consumer must prepend the task token before serializing the sequence. Derive it from the request's form_id (e.g., irs_w2_2025 → <s_irs_w2_2025>, or <s_w2> for a coarser scheme). The token must also be added to the tokenizer's additional_special_tokens and the model's resized embeddings — standard Donut fine-tuning setup.

Credit Pricing

Credits are charged when a generation job is created. The cost depends on the form's field count, the output format, and the quantity.

Pricing Formulas

Typed PDF (standard):

cost_per_document = 20 + ceil(max(field_count - 25, 0) / 25) * 2
total_credits = cost_per_document * quantity

Handwritten PDF:

cost_per_document = 40 + ceil(max(field_count - 25, 0) / 25) * 4
total_credits = cost_per_document * quantity

Filled PDF (pdf_filled, fillable forms only):

Priced in a tier between typed and CSV at 60% of the typed rate. Values are delivered in live, editable AcroForm widgets with no rasterization.

cost_per_document = 12 + ceil(max(field_count - 25, 0) / 25) * 1.2
total_credits = ceil(cost_per_document * quantity)

When a job also requests pdf_typed or pdf_handwritten, the higher tier is billed and the filled artifact rides free, the same bundling rule that includes CSV/JSON with any PDF.

JSON and CSV ground truth: Every dataset zip automatically includes ground_truth/json/ (per-document structured JSON) and ground_truth/data.csv (aggregated tabular CSV) at no additional cost. These are not selectable in output_formats — passing "json" or "csv" there returns 400 code=invalid_output_format.

Examples

Form	Fields	Quantity	Formats	Credits
W-2	20	100	pdf_typed	2,000
1040	50	100	pdf_typed	2,200
1040	50	10	pdf_typed	220

Tabular data: Base rate is 1 credit per 50 rows. A surcharge of 1 credit per 500 extra cells applies when columns exceed 10 (extra cells = rows × (columns − 10)). Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)).

Data	Columns	Rows	Credits
Tabular	5	1,000	20
Tabular	20	1,000	40
Tabular	10	5,000	100
Tabular	50	10,000	1,000

Use the form catalog's credit_cost field to see the per-document cost for each form at default (typed PDF) output.

Common Workflows

Batch Document Generation

# 1. Find the form you need
curl https://symagedocs.ai/api/v1/forms?category=tax \
  -H "Authorization: Bearer $API_KEY" | jq '.data[] | {id, name, credit_cost}'

# 2. Create a generation job
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
  | jq -r '.data.job_id')

echo "Job created: $JOB_ID"

# 3. Poll until complete
while true; do
  STATUS=$(curl -s https://symagedocs.ai/api/v1/jobs/$JOB_ID \
    -H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
  sleep 5
done

# 4. Download results — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/dataset \
  -H "Authorization: Bearer $API_KEY" -o w2_dataset.zip

# Or just the structured data
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/json \
  -H "Authorization: Bearer $API_KEY" -o w2_data.json

Tabular Data Generation

# 1. Parse a natural language description into a schema
SCHEMA=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/parse \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "employee name, department, hire date, salary, city"}' \
  | jq '.data.columns')

echo "Parsed schema: $SCHEMA"

# 2. Generate 5,000 rows from the schema
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"columns\": $SCHEMA, \"quantity\": 5000, \"output_formats\": [\"csv\"]}" \
  | jq -r '.data.job_id')

echo "Job created: $JOB_ID"

# 3. Poll until complete
while true; do
  STATUS=$(curl -s https://symagedocs.ai/api/v1/tabular/$JOB_ID/status \
    -H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
  sleep 2
done

# 4. Download CSV
curl https://symagedocs.ai/api/v1/tabular/$JOB_ID/download/csv \
  -H "Authorization: Bearer $API_KEY" -o employees.csv

Event-Driven Generation (Webhooks)

Instead of polling for job completion, register a webhook and let SymageDocs notify you:

# 1. Register a webhook (once, via the web UI or the /api/internal/webhooks endpoint)
curl -X POST https://symagedocs.ai/api/internal/webhooks \
  -H "x-user-id: YOUR_USER_ID" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-server.com/webhooks/symagedocs", "events": ["job.completed", "job.failed"]}'

# Save the signing_secret from the response!

# 2. Generate documents — no need to poll
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
  | jq -r '.data.job_id')

echo "Job $JOB_ID created — webhook will notify when done"

# 3. Your server receives a POST to /webhooks/symagedocs with:
#    {"event": "job.completed", "data": {"job_id": "...", "status": "completed", ...}}
#    Verify the X-Symagedocs-Signature header, then download results.

Reproducible Generation

Use the seed parameter to generate identical output across runs:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 10, "seed": 12345}'

Running this request again with the same seed and quantity produces the same documents.

Key Rotation

Rotate keys periodically for security. The old key stops working immediately and the new key takes over:

# Rotate via the web UI, or programmatically:
curl -X POST https://symagedocs.ai/api/internal/api-keys/OLD_KEY_ID/rotate \
  -H "x-user-id: YOUR_USER_ID"

Troubleshooting

Symptom	Cause	Fix
`401 Invalid API key`	Key doesn't exist or was revoked	Create a new key in the web UI
`403 Missing required scopes`	Key lacks the scope for this endpoint	Create a new key with the needed scopes
`404 Job not found`	Job doesn't exist or belongs to another user	Check the job ID; keys can only access their own jobs
`409 Job is not completed`	Trying to download before generation finishes	Poll `GET /jobs/{id}` until status is `completed`
`429 Rate limit exceeded`	Exceeded your tier's per-key request rate	Back off using the `X-RateLimit-Remaining` header and retry

SymageDocs API User Manual

Quick Reference

Endpoints at a Glance

Typical Workflow (Jobs)

Typical Workflow (Per-item training data)

Typical Tabular Workflow

Key Scopes

Credit Cost (per document)

Response Envelope

Common Errors

Quick Start

1. Create an API Key

2. Make Your First Request

3. Generate a Document

Authentication

API Key Scopes

Key Management

Rate Limiting

Response Format

Success Responses

Error Responses

HTTP Status Codes

Endpoints

Forms Catalog

List Forms

Get Form Details

Generation

Create a Generation Job

Coherence Mode (ML ablation control)

List Jobs

Get Job Status

Download Job Output

Presigned Downloads

Get Job Download URLs

Cancel a Job

Idempotency-Key Header

Per-Item Endpoints

Raw Identities

Generate Identities

Account

Get Credit Balance

Get Usage Summary

Webhooks

Events

Inline Registration via webhook_url (deprecated)

Register a Webhook

List Webhooks

Update a Webhook

Disable a Webhook

Payload Format

Verifying Signatures

Retry Policy

Tabular Generation

Parse NL Description

Generate Tabular Data

Get Tabular Job Status

Download Tabular Output

Pricing

Get Pricing Rates

Estimate Credit Cost

Bundle Size Estimation

Get Bundle-Size Rate Constants

Estimate Bundle Download Size

Health Check

API Key Management

Create an API Key

List API Keys

Revoke an API Key

Rotate an API Key

Training Data (ML)

Donut consumer conventions

Credit Pricing

Pricing Formulas

Examples

Common Workflows

Batch Document Generation

Tabular Data Generation

Event-Driven Generation (Webhooks)

Reproducible Generation

Key Rotation

Inline Registration via `webhook_url` (deprecated)