SymageDocs API User Manual

The SymageDocs API lets you programmatically generate synthetic documents and identities. This manual covers authentication, endpoints, billing, and common workflows.

Python users: Consider the Python SDK (pip install symagedocs) instead of raw HTTP. It handles authentication, retries, pagination, and polling automatically.

Quick Reference

Base URL: https://symagedocs.ai/api/v1 Auth: Authorization: Bearer sk_live_YOUR_KEY Rate limit: Per-key, tier-based sustained rate (free: 10 req/s; pro: 30; scale: 100; enterprise: 200) with 3× short burst

Endpoints at a Glance

These are the public Bearer-authenticated endpoints under /api/v1. Webhook and API-key management are session-authenticated and live under /api/internal/... — see Webhooks and API Key Management.

MethodEndpointScopeDescription
GET/formsreadList available forms (filter: ?category=tax)
GET/forms/{id}readForm details with field definitions
POST/generategenerateCreate an async generation job
GET/jobsreadList your jobs (paginated: ?limit=50&cursor=...)
GET/jobs/{id}readJob status and progress
GET/jobs/{id}/download/{fmt}readDownload output (bundle, csv, or json)
GET/jobs/{id}/downloadsreadPresigned download URLs for all job output files
GET/jobs/{id}/itemsreadList per-item records for a job
GET/jobs/{id}/items/{item_id}/downloadreadPresigned URLs for one item's files
POST/jobs/{id}/cancelgenerateCancel a running job; refund unspent credits
POST/identitiesgenerateGenerate raw identities (JSON only, no forms)
GET/account/balanceaccountCredit usage totals
GET/account/usageaccountUsage summary (filter: ?days=30)
POST/tabular/parsegenerateNL description to column schema (LLM-powered)
POST/tabular/generategenerateGenerate tabular data from column schema
GET/tabular/{id}/statusreadTabular job progress and ETA
GET/tabular/{id}/download/{fmt}readDownload tabular output (csv or json)
GET/pricing/ratesGet current credit rate constants (no auth)
POST/pricing/estimateEstimate credit cost for a job (no auth)
GET/jobs/size-ratesGet calibrated bundle-size rate constants (no auth)
POST/jobs/size-estimateEstimate compressed bundle download size (no auth)
GET/healthLiveness probe (no auth)

Typical Workflow (Jobs)

1. GET  /forms                          → pick a form_id
2. POST /generate  {form_id, quantity}  → get job_id
3. GET  /jobs/{job_id}                  → poll until status = "completed"
4. GET  /jobs/{job_id}/download/bundle  → download all artifacts as a single zip
   — or —
   GET  /jobs/{job_id}/downloads        → get presigned URLs for direct S3 download

Typical Workflow (Per-item training data)

1. GET  /forms                                 → pick a form_id
2. POST /generate  {form_id, quantity, output_formats: ["pdf_typed", "bio"]}
   ↑ optional: send `Idempotency-Key: <key>` header for retry-safety
3. GET  /jobs/{job_id}                         → poll until status = "completed"
4. GET  /jobs/{job_id}/items                   → enumerate items (paginated)
5. GET  /jobs/{job_id}/items/{id}/download     → presigned URLs per item

Typical Tabular Workflow

1. POST /tabular/parse  {prompt}           → get column schema
2. POST /tabular/generate  {columns, qty}  → get job_id (status = "processing")
3. GET  /tabular/{job_id}/status           → poll until status = "completed"
4. GET  /tabular/{job_id}/download/csv     → download results

# Tabular jobs use status = "processing" while in flight (not "pending"/"generating",
# which apply only to /jobs).

Key Scopes

ScopeWhat it allows
generateCreate jobs, generate identities, tabular parse/generate
readList forms, view jobs, download output, tabular status/download
accountView balance and usage

Credit Cost (per document)

OutputFormula
Typed PDF20 + ceil(max(fields - 25, 0) / 25) * 2
Handwritten PDF40 + ceil(max(fields - 25, 0) / 25) * 4
Tabular (CSV/JSON)ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500))

JSON and CSV are always free. Every bundle includes ground_truth/json/ and ground_truth/data.csv at no extra cost. They are not selectable in output_formats.

Response Envelope

{"data": ..., "meta": {"request_id": "uuid", ...}}

Common Errors

CodeMeaning
401Invalid or revoked API key
402Insufficient credits
403Key missing required scope
404Resource not found
409Job not yet completed (for downloads)
429Rate limit exceeded

Quick Start

1. Create an API Key

Log in to the SymageDocs web application and navigate to Account > API Keys. Click Create Key, give it a name, and copy the key immediately — it will not be shown again.

Your key looks like: sk_live_a1b2c3d4e5f6... (56 characters total).

2. Make Your First Request

curl https://symagedocs.ai/api/v1/forms \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE"

3. Generate a Document

# Create a generation job
curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 10,
    "output_formats": ["pdf_typed"]
  }'

# Poll for completion (replace JOB_ID with the job_id from above)
curl https://symagedocs.ai/api/v1/jobs/JOB_ID \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE"

# Download results when status is "completed" — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/bundle \
  -H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
  -o output.zip

Authentication

Every API request must include your key in the Authorization header:

Authorization: Bearer sk_live_YOUR_KEY_HERE

No username or password is needed — your identity is embedded in the key itself.

API Key Scopes

Keys can be created with specific scopes to limit what they can do:

ScopeGrants access to
generateCreate generation jobs, generate raw identities
readList forms, check job status, download outputs
accountView credit balance and usage statistics

By default, new keys have all three scopes. You can create restricted keys (e.g., a read-only key for monitoring) through the web UI or API.

Key Management

You manage keys through the web UI at Account > API Keys, or programmatically:

ActionDescription
CreateGenerates a new key. The raw key is shown once — copy it immediately.
ListShows all your keys with their prefix, name, scopes, and last-used date. Full keys are never shown.
RevokePermanently disables a key. Takes effect immediately. Cannot be undone.
RotateCreates a new key and revokes the old one in a single operation. The new key inherits the old key's name and scopes.

Rate Limiting

The API enforces a per-key rate limit determined by the key's tier. The sustained rate is the token-bucket refill rate; a fresh client can issue up to 3× the sustained rate as an initial burst before throttling kicks in.

TierSustainedBurst
Free10 requests/second30 requests
Pro30 requests/second90 requests
Scale100 requests/second300 requests
Enterprise200 requests/second600 requests

If you exceed your tier's limit, the API returns 429 Too Many Requests with the following response headers so clients can back off intelligently:

HeaderMeaning
Retry-AfterSeconds to wait before retrying (rounded up to at least 1)
X-RateLimit-LimitThe configured sustained rate for your tier (requests/second)
X-RateLimit-RemainingRemaining budget within the current window
X-RateLimit-ResetUnix timestamp (seconds) at which the bucket is fully refilled

Response Format

Success Responses

All endpoints return a consistent envelope:

{
  "data": { ... },
  "meta": {
    "request_id": "550e8400-e29b-41d4-a716-446655440000",
    "count": 5,
    ...
  }
}
  • data contains the response payload (object or array depending on the endpoint).
  • meta contains request metadata. request_id is always present and useful for support inquiries.

Error Responses

Errors follow a standard format:

{
  "detail": {
    "code": "unknown_form",
    "message": "Unknown form: irs_w99_2024",
    "status": 400
  }
}

HTTP Status Codes

CodeMeaning
200Success
400Bad request (invalid parameters)
401Authentication failed (invalid or revoked key)
402Insufficient credits
403Forbidden (missing required scope)
404Resource not found
409Conflict (e.g., downloading from an incomplete job)
429Rate limit exceeded

Endpoints

Forms Catalog

List Forms

GET /api/v1/forms
GET /api/v1/forms?category=tax

Returns all available forms. Optionally filter by category.

Scope required: read

Response fields per form:

FieldTypeDescription
idstringUnique form identifier (use this in generation requests)
namestringHuman-readable form name
categorystringForm category (e.g., tax, healthcare, business)
yearintegerTax year or form version year
familystringForm family (e.g., 1040, w2)
entity_typestringEntity type (individual, business, healthcare)
credit_costintegerCredits per document (at default PDF output)
field_countintegerNumber of fields on the form

Example:

curl https://symagedocs.ai/api/v1/forms?category=tax \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Get Form Details

GET /api/v1/forms/{form_id}

Returns full details for a specific form, including all field definitions.

Scope required: read

Additional response fields:

FieldTypeDescription
versionstringForm definition version
page_countintegerNumber of pages in the form
fieldsarrayList of field definitions
fields[].idstringField identifier
fields[].labelstringHuman-readable field label
fields[].typestringField type (e.g., text, enum, ssn, currency)
fields[].pageintegerPage number the field appears on

Example:

curl https://symagedocs.ai/api/v1/forms/irs_w2_2025 \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Generation

Create a Generation Job

POST /api/v1/generate

Creates an asynchronous generation job. The job runs in the background — poll the job status endpoint to track progress.

Scope required: generate

Request body:

Provide exactly one of form_id or form_ids (not both).

FieldTypeRequiredDefaultDescription
form_idstringconditionalSingle form to generate. Mutually exclusive with form_ids.
form_idsstring[]conditionalList of form IDs to generate coherently for the same identity (e.g., W-2 + 1040 with matching wages, SSN, and name). Mutually exclusive with form_id.
quantityintegerno1Number of identities to generate (1–10,000,000). Each identity produces one document per form. Large jobs are routed to the bulk worker and may take minutes-to-hours; poll /jobs/{job_id} for status.
output_formatsstring[]no["pdf_typed"]Output formats (see below)
configobjectnonullGeneration configuration overrides
seedintegernonullSeed for reproducible output
ink_colorstringno"blue"Ink color for handwritten output: black, blue, or red
ink_color_distributionobjectnonullDistribution of ink colors across documents. Keys are color names, values are integer weights that must sum to 100. Example: {"blue": 60, "black": 30, "red": 10}. Overrides ink_color when provided.
writer_consistencystringno"per_document"Writer consistency: per_document or per_field
realism_levelstringno"high"Deprecated. Accepted for backward compatibility; ignored. All handwritten output is rendered at high realism.
webhook_urlstringnonullURL to receive job.completed/job.failed webhook notifications (max 2,048 chars). See Webhooks for details.

Output formats:

FormatDescription
pdf_typedFilled PDF with typed text
pdf_handwrittenFilled PDF with handwritten-style text (not available for all forms)
png_typedPer-page PNG rasterizations of pdf_typed (requires pdf_typed)
png_handwrittenPer-page PNG rasterizations of pdf_handwritten (requires pdf_handwritten)

JSON, CSV, and FUNSD are always included. Every bundle automatically contains ground_truth/json/ (per-document structured JSON), ground_truth/data.csv (aggregated tabular CSV), and per-page FUNSD annotations regardless of which formats you request. Do not include "json", "csv", or "funsd" in output_formats — the API will return 400 code=invalid_output_format if you do.

PNG output requires its paired PDF. png_typed must be requested together with pdf_typed, and png_handwritten must be requested together with pdf_handwritten. PNG always travels with its paired PDF; PNG-only output is not a supported configuration. Submitting png_typed without pdf_typed (or png_handwritten without pdf_handwritten) is rejected by the API with 400 code=missing_pdf_dependency. Cross-surface pairs do not satisfy the dependency — e.g. ["png_typed", "pdf_handwritten"] is still rejected because typed PNG needs typed PDF specifically.

ML training formats (gated behind the ml-output-formats-enabled feature flag — requesting any of these without the flag returns 400 code=ml_formats_disabled):

FormatDescription
bioToken classification (B/I/O) labels
yoloYOLOv8 object detection labels
cocoCOCO JSON object detection annotations
donutDonut OCR-free metadata.jsonl + per-page images

ML formats require a render format. All ML training formats (bio, yolo, coco, donut) derive word-level annotations from the renderer pipeline. You must include at least one render format — pdf_typed, pdf_handwritten, or png_typed — in the same request. Submitting ML formats without a render format returns 400 code=missing_render_dependency. Example: ["bio", "donut"] alone is invalid; use ["pdf_typed", "bio", "donut"].

You can request multiple formats in a single job (e.g., ["pdf_typed", "bio", "yolo"]). Regardless of how many formats you request, the completed job is downloaded as a single bundle zip containing every produced artifact — including JSON and CSV ground truth, which are always present.

Multi-surface jobs (["pdf_typed", "pdf_handwritten"]) run in parallel by default (STAX-1707). Pre-2026-05-15 such jobs were forced through a sequential code path as a defensive workaround. Bulk jobs now use the full bulk-worker core count regardless of how many PDF surfaces were requested. Output bundles separate the two surfaces into pdfs/typed/ and pdfs/handwritten/ (plus images/{typed,handwritten}/ and the per-format annotations/<format>/{typed,handwritten}/ trees); a typed and a handwritten render of the same logical document share one identity (see ADR-058 for the architectural commitment).

Response:

{
  "data": {
    "job_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "pending",
    "credits_required": 2200
  },
  "meta": {
    "request_id": "...",
    "credits_charged": 2200
  }
}

Example — single form:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 50,
    "output_formats": ["pdf_typed"],
    "seed": 12345
  }'

Example — multi-form (coherent generation):

Generate W-2 and 1040 documents for the same identities. Each identity's W-2 wages will match the 1040 line 1 wages, and SSN/name/address fields are consistent across both forms.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_ids": ["irs_w2_2025", "irs_f1040_2024"],
    "quantity": 10,
    "output_formats": ["pdf_typed"],
    "seed": 42
  }'

Note: Calling /generate twice with the same seed and different form_id values does not produce coherent data — the identity biasing differs per form. Use form_ids to get matching documents.

Coherence Mode (ML ablation control)

Multi-form jobs accept a coherence_mode flag inside config that controls how identities and field values relate across the forms of a single item_index:

  • coherent (default) — one identity per item_index; every form in that item is filled from it. Cross-form correlations (same SSN on W-2 and 1040, same wages on W-2 box 1 and 1040 line 1) are preserved. Use this for realistic tax-prep datasets and any production workload.
  • shuffled — identities are still generated one-per-item_index, but each field's final value vector is permuted across items with a seed derived from (seed, form_id, field_id). Marginal distributions and spatial layout stay valid; within-image correlations are broken. Use this for ML ablation experiments that need to measure how much a model leans on cross-field correlations inside one document.
  • random — each (item_index, form_id) pair gets its own independently generated identity, so W-2 and 1040 in the same item will usually carry different SSNs, names, and addresses. Use this as the strongest ablation baseline, or to stress-test downstream code that should not assume identity continuity across forms.

Reproducibility: passing the same (form_ids, quantity, seed, output_formats, coherence_mode) produces byte-identical output for all three modes. The chosen mode is recorded in manifest.json under dataset_info.generation_params.coherence_mode so dataset consumers can branch on it.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_ids": ["irs_w2_2025", "irs_f1040_2024"],
    "quantity": 100,
    "seed": 42,
    "output_formats": ["pdf_typed"],
    "config": {"coherence_mode": "shuffled"}
  }'

List Jobs

GET /api/v1/jobs
GET /api/v1/jobs?limit=20&status=completed

Returns your generation jobs, newest first. Supports cursor-based pagination.

Scope required: read

Query parameters:

ParameterTypeDefaultDescription
limitinteger50Results per page (1–100)
cursorstringPagination cursor from a previous response
statusstringFilter by status: pending, generating, completed, failed

Pagination:

The response meta includes:

  • has_moretrue if more results exist beyond this page
  • next_cursor — pass this as the cursor parameter to get the next page
# First page
curl "https://symagedocs.ai/api/v1/jobs?limit=10" \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

# Next page (using next_cursor from the previous response)
curl "https://symagedocs.ai/api/v1/jobs?limit=10&cursor=2026-03-24T10:00:00+00:00" \
  -H "Authorization: Bearer sk_live_YOUR_KEY"

Get Job Status

GET /api/v1/jobs/{job_id}

Returns detailed status for a specific job.

Scope required: read

Response fields:

FieldTypeDescription
job_idstringJob identifier
statusstringpending, generating, completed, or failed
progressfloatCompletion progress (0.0 to 1.0)
form_idstringForm used for generation
form_namestringHuman-readable form name
quantityintegerNumber of documents requested
output_formatsstring[]Requested output formats
credits_requiredintegerTotal credits for this job
credits_chargedintegerCredits actually charged
seedintegerSeed used (null if random)
errorstringError message if job failed (null otherwise)
created_atstringISO 8601 creation timestamp
completed_atstringISO 8601 completion timestamp (null if not done)
generation_time_msintegerTotal generation time in milliseconds (null if not completed)
worker_countintegerNumber of workers used for this job
peak_workersintegerPeak concurrent workers during generation
throughput_docs_per_minfloatGeneration throughput in documents per minute

Download Job Output

GET /api/v1/jobs/{job_id}/download/{format}

Downloads the generated output in the specified format. The job must be in completed status.

Scope required: read

Path parameters:

ParameterValuesDescription
formatbundle, csv, jsonOne format per request. Per-format downloads (pdf_typed, pdf_handwritten, coco, bio, yolo, funsd, donut, paperlives, hf_datasets, coco_layout, funsd_layout) have been retired in favor of bundle. Requesting any of those returns 400 code=invalid_format.

Response: Binary file download.

  • bundle returns a single ZIP archive containing every artifact the job produced (PDFs, PNGs, JSON, CSV, and any ML annotations that were requested).
  • json returns the aggregated structured data as JSON.
  • csv returns the aggregated data as a CSV file.

For per-file access (e.g., to fetch a single PDF), use GET /jobs/{job_id}/downloads to obtain presigned S3 URLs.

Example:

# Download every artifact as a single ZIP
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/bundle \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -o output.zip

# Download JSON data
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/json \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -o data.json

Presigned Downloads

Get Job Download URLs

GET /api/v1/jobs/{job_id}/downloads

Returns presigned S3 URLs for all output files of a completed job. Files can be downloaded directly from S3 without proxying through the server.

Scope required: read

Response:

{
  "data": {
    "job_id": "550e8400-...",
    "files": [
      {
        "filename": "irs_w2_2025_abc123.pdf",
        "url": "https://s3.amazonaws.com/...presigned",
        "content_type": "application/pdf",
        "item_index": 0
      }
    ],
    "expires_in": 3600
  }
}

URLs expire after 1 hour. Request new URLs if needed.

Cancel a Job

POST /api/v1/jobs/{job_id}/cancel

Cancels a running job. Items already rendered when the cancel takes effect remain on storage and can be downloaded via the bundle endpoint. Idempotent: cancelling a job that's already CANCELING / CANCELED returns 200 with the current state. Cancelling a terminal-but-not-canceled job (COMPLETED / FAILED / EXPIRED) returns 409.

Scope required: generate

Response: standard envelope with {"job_id", "status"} where status is either canceling (worker is draining its current item) or canceled (job was still PENDING and skipped straight to terminal). The worker observes the cancel flag at the per-item boundary; refunding of unspent credits happens automatically.

Idempotency-Key Header

POST /api/v1/generate accepts an optional Idempotency-Key header. When the same (api-key user, key) pair is seen again within 24 hours the original job_id is returned and credits are not debited a second time. Useful for safe network retries.

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Idempotency-Key: my-retry-safe-key-001" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100}'

The replay response sets meta.idempotent_replay = true and meta.credits_charged = 0.

Per-Item Endpoints

Per-item training-data access lives on the /jobs surface:

EndpointDescription
GET /api/v1/jobs/{job_id}/itemsCursor-paginated list of per-item records. Each item carries presigned download URLs.
GET /api/v1/jobs/{job_id}/items/{item_id}/downloadPresigned URLs for a single item's files.

Raw Identities

Generate Identities

POST /api/v1/identities

Generates synthetic identities as raw JSON — no form rendering, no PDF output. Useful for testing, schema exploration, or when you need identity data without filled documents.

Scope required: generate

Request body:

FieldTypeRequiredDefaultDescription
quantityintegerno1Number of identities (1–10,000)
configobjectnonullGeneration configuration
seedintegernonullSeed for reproducibility

Response: The standard envelope with data containing an array of identity objects. Each identity has fields like first_name, last_name, ssn, dob, address, employer information, and more.

Example:

curl -X POST https://symagedocs.ai/api/v1/identities \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"quantity": 5, "seed": 42}'

Account

Get Credit Balance

GET /api/v1/account/balance

Returns your credit usage totals.

Scope required: account

Response fields:

FieldTypeDescription
credits_usedintegerTotal credits charged across all completed jobs
credits_allocatedintegerTotal credits reserved across all jobs (including pending)

Get Usage Summary

GET /api/v1/account/usage?days=30

Returns a usage summary for the specified period.

Scope required: account

Query parameters:

ParameterTypeDefaultDescription
daysinteger30Number of days to summarize (1–365)

Response fields:

FieldTypeDescription
period_daysintegerPeriod covered
total_jobsintegerNumber of jobs created
total_items_generatedintegerTotal documents generated
total_credits_usedintegerTotal credits charged
jobs_by_statusobjectJob count by status (completed, failed, etc.)

Webhooks

Webhooks let you receive real-time HTTP notifications when events occur — no polling required. Register a URL, and SymageDocs will POST a signed JSON payload whenever a subscribed event fires.

Events

EventFires when
job.completedA generation job finishes successfully (includes presigned download URLs)
job.failedA generation job fails

Inline Registration via webhook_url

Instead of pre-registering webhooks, pass a webhook_url parameter when creating a job:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "form_id": "irs_w2_2025",
    "quantity": 10,
    "webhook_url": "https://your-server.com/hooks"
  }'

A temporary webhook subscription is auto-created for job.completed and job.failed.

Register a Webhook

POST /api/internal/webhooks

Auth: Session (web UI) — uses x-user-id header, not API key auth.

Request body:

FieldTypeRequiredDefaultDescription
urlstringyesHTTPS endpoint to receive events (max 2,048 chars)
eventsstring[]noall four eventsEvent types to subscribe to
descriptionstringnonullOptional label (max 500 chars)

Response:

{
  "webhook_id": "550e8400-...",
  "signing_secret": "whsec_a1b2c3d4e5f6...",
  "secret_prefix": "whsec_a1b2c3d4",
  "url": "https://example.com/hook",
  "events": ["job.completed", "job.failed"],
  "description": null,
  "created_at": "2026-03-26T12:00:00+00:00"
}

Important: The signing_secret is shown once at creation. Store it securely — you'll need it to verify payload signatures.

List Webhooks

GET /api/internal/webhooks

Returns all active webhooks for the current user. The full signing secret is never included — only the secret_prefix for identification.

Update a Webhook

PATCH /api/internal/webhooks/{webhook_id}

Update a webhook's URL, subscribed events, description, or active status. All fields are optional — only include what you want to change.

Request body:

FieldTypeDescription
urlstringNew endpoint URL
eventsstring[]New event subscriptions
descriptionstringNew description
is_activebooleanEnable or disable delivery

Disable a Webhook

DELETE /api/internal/webhooks/{webhook_id}

Soft-deletes the webhook. It stops receiving events immediately but remains in the database for audit purposes.

Payload Format

Every webhook delivery is a POST request with a JSON body:

{
  "event": "job.completed",
  "timestamp": "2026-03-30T12:05:00+00:00",
  "webhook_id": "550e8400-...",
  "data": {
    "job_id": "661e9500-...",
    "status": "completed",
    "form_id": "irs_w2_2025",
    "quantity": 100,
    "credits_charged": 2000,
    "completed_at": "2026-03-30T12:05:00+00:00",
    "download_urls": [
      { "filename": "irs_w2_2025_abc123.pdf", "url": "https://s3...presigned" },
      { "filename": "irs_w2_2025_abc123.json", "url": "https://s3...presigned" }
    ]
  }
}

Headers included with every delivery:

HeaderDescription
X-Symagedocs-SignatureHMAC-SHA256 hex digest of the raw body
X-Symagedocs-EventEvent type (e.g., job.completed)
X-Symagedocs-DeliveryUnique delivery ID for deduplication

Verifying Signatures

Verify the X-Symagedocs-Signature header to ensure the payload came from SymageDocs and wasn't tampered with:

import hashlib, hmac

def verify_webhook(body: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, signature)
const crypto = require("crypto");

function verifyWebhook(body, signature, secret) {
  const expected = crypto.createHmac("sha256", secret).update(body).digest("hex");
  return crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(signature));
}

Retry Policy

Failed deliveries (non-2xx response or network error) are retried up to 3 times with exponential backoff:

AttemptDelay
1st retry10 seconds
2nd retry60 seconds
3rd retry300 seconds (5 minutes)

After all retries are exhausted, the delivery is marked as failed. Your webhook remains active and will continue receiving future events.

Tabular Generation

The tabular API generates structured CSV/JSON data from either a natural-language description or an explicit column schema. It supports 50+ column types including identity-derived fields (names, SSNs, addresses) and synthetic fields (random numbers, enums, patterns).

Parse NL Description

POST /api/v1/tabular/parse

Converts a natural-language description into a structured column schema using an LLM with heuristic fallback.

Scope required: generate

Request body:

FieldTypeRequiredDescription
promptstringyesNatural language description of desired columns (1–2,000 chars)

Example:

curl -X POST https://symagedocs.ai/api/v1/tabular/parse \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "first name, last name, age, city, state, annual salary"}'

Response:

{
  "data": {
    "columns": [
      { "name": "first_name", "type": "text_first_name", "constraints": {} },
      { "name": "last_name", "type": "text_last_name", "constraints": {} },
      { "name": "age", "type": "identity_age", "constraints": {} },
      { "name": "city", "type": "address_city", "constraints": {} },
      { "name": "state", "type": "address_state", "constraints": {} },
      {
        "name": "annual_salary",
        "type": "number_integer",
        "constraints": { "min": 20000, "max": 200000 }
      }
    ],
    "parser_type": "llm",
    "model": "claude-haiku-4-5-20251001"
  },
  "meta": { "request_id": "...", "count": 6 }
}

You can modify the returned columns (rename, remove, adjust constraints) before passing them to the generate endpoint.

Generate Tabular Data

POST /api/v1/tabular/generate

Creates an async tabular generation job from an explicit column schema. Poll the status endpoint for progress.

Scope required: generate

Request body:

FieldTypeRequiredDefaultDescription
columnsobject[]yesColumn definitions (1–50 columns, see below)
quantityintegerno100Number of rows (1–10,000)
output_formatsstring[]no["csv"]Output formats: csv, json
seedintegernonullSeed for reproducibility

Column definition:

Each column is an object with:

FieldTypeDescription
namestringColumn header name
typestringColumn type (see common types below)
constraintsobjectType-specific parameters (min, max, values, weights, etc.)

Common column types:

TypeDescriptionConstraints
text_first_nameFirst name from synthetic identity
text_last_nameLast name
text_full_nameFull name
ssnSSN (###-##-####)
identity_ageAge in years
identity_genderGender (M/F)
address_cityCity
address_stateState abbreviation
address_zipZIP code
phonePhone number
emailEmail address
number_integerRandom integermin, max, distribution_type, distribution_params
number_decimalRandom decimalmin, max, decimals
custom_enumWeighted random choicevalues: [...], weights: [...]
custom_maskPattern-based stringmask (A=upper, a=lower, 9=digit, X=any)

Example:

curl -X POST https://symagedocs.ai/api/v1/tabular/generate \
  -H "Authorization: Bearer sk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "columns": [
      {"name": "first_name", "type": "text_first_name", "constraints": {}},
      {"name": "last_name", "type": "text_last_name", "constraints": {}},
      {"name": "age", "type": "identity_age", "constraints": {}},
      {"name": "salary", "type": "number_integer", "constraints": {"min": 30000, "max": 150000}}
    ],
    "quantity": 1000,
    "output_formats": ["csv", "json"]
  }'

Response:

{
  "data": {
    "job_id": "550e8400-...",
    "status": "processing",
    "credits_required": 20
  },
  "meta": { "request_id": "...", "credits_charged": 20 }
}

Credits: Base rate is 1 credit per 50 rows (0.02 credits/row). A column surcharge applies only when the schema has more than 10 columns: each 500 "extra cells" (rows × columns beyond 10) costs 1 additional credit. Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)). The example above uses 4 columns and 1,000 rows: base = 1,000/50 = 20 credits, no surcharge (4 < 10), total = 20 credits.

Get Tabular Job Status

GET /api/v1/tabular/{job_id}/status

Returns progress and ETA for a tabular generation job.

Scope required: read

Response fields:

FieldTypeDescription
job_idstringJob identifier
statusstringprocessing, completed, or failed
rows_generatedintegerRows completed so far
total_rowsintegerTotal rows requested
percentfloatCompletion percentage (0.0–100.0)
estimated_seconds_remainingfloatETA in seconds (null if unavailable)
errorstringError message if failed (null otherwise)

Download Tabular Output

GET /api/v1/tabular/{job_id}/download/{format}

Downloads the generated tabular data. The job must be in completed status. Data expires after 48 hours.

Scope required: read

Path parameters:

ParameterValuesDescription
formatcsv, jsonOne format per request.

Response: File download (text/csv or application/json).

Errors:

CodeMeaning
400Invalid format (not csv or json)
404Job not found
409Job not yet completed
410Data expired (older than 48 hours)

Pricing

These endpoints require no authentication and can be used to preview costs before creating jobs.

Get Pricing Rates

GET /api/v1/pricing/rates

Returns the current credit rate constants used for cost calculation.

Authentication: None required.

Response:

{
  "csv": {
    "rows_per_credit": 50,
    "extra_column_threshold": 10,
    "extra_cells_per_credit": 500
  },
  "pdf_typed": {
    "base_credits": 20,
    "field_threshold": 25,
    "surcharge_per_band": 2,
    "band_size": 25
  },
  "pdf_handwritten": {
    "base_credits": 40,
    "field_threshold": 25,
    "surcharge_per_band": 4,
    "band_size": 25
  },
  "degradation_multipliers": { ... },
  "rounding": {"free_threshold": 0, "minimum_credits": 1, "method": "ceil"},
  "bundling": "PDF output includes CSV/JSON free"
}

Estimate Credit Cost

POST /api/v1/pricing/estimate

Estimates the credit cost for a job without creating it.

Authentication: None required.

Request body:

FieldTypeRequiredDefaultDescription
field_countintegeryesNumber of fields/columns on the form
output_formatsstring[]yesOutput formats (e.g., ["pdf_typed"])
record_countintegeryesNumber of records/rows to generate
degradation_profilestringnonullDegradation profile to apply

Response:

{
  "credits": 2200
}

Bundle Size Estimation

These endpoints require no authentication and can be used to preview the compressed download size of a job before submitting it. The estimates are calibrated against real staging bundles (see calibration_sample_count and calibrated_at in the response). Expected accuracy is +/- 30% for 80% of jobs.

Get Bundle-Size Rate Constants

GET /api/v1/jobs/size-rates

Returns the calibrated per-format byte constants and tier thresholds used by the size estimator.

Authentication: None required.

Response:

{
  "rates": {
    "pdf_typed_bytes_per_file": 518426,
    "pdf_handwritten_bytes_per_file": 2240759,
    "png_typed_bytes_per_file": 1549775,
    "png_handwritten_bytes_per_file": 549089,
    "funsd_bytes_per_file": 5093,
    "bio_bytes_per_instance": 43661,
    "yolo_bytes_per_file": 1960,
    "coco_bytes_per_surface": 55508,
    "donut_bytes_per_row": 2703,
    "ground_truth_json_bytes_per_file": 1984,
    "ground_truth_csv_bytes_per_job": 4698,
    "bundle_overhead_bytes_per_job": 2757,
    "funsd_emit_rate": 0.858,
    "bio_emit_rate": 1.0
  },
  "tiers": {
    "small_max_bytes": 104857600,
    "medium_max_bytes": 1073741824
  },
  "expected_accuracy": "+/- 30%",
  "calibration_sample_count": 296,
  "calibrated_at": "2026-05-08"
}

Estimate Bundle Download Size

POST /api/v1/jobs/size-estimate

Estimates the compressed zip download size for a job based on its forms, page counts, output formats, and record count.

Authentication: None required.

Request body:

FieldTypeRequiredDefaultDescription
field_countsinteger[]yesNumber of fields on each form (one entry per form)
page_countsinteger[]yesNumber of pages on each form (must match field_counts length)
output_formatsstring[]no[]Requested output format tokens (e.g., ["pdf_typed", "bio"])
record_countintegeryesNumber of identities to generate (>= 0)

field_counts and page_counts must have the same length (>= 1). A length mismatch returns 422.

Unknown or rejected format tokens (e.g. "pdf_unicorn") return 400 with code=invalid_output_format.

Response:

{
  "total_bytes": 5255252,
  "human_readable": "~5.0 MB",
  "tier": "small",
  "breakdown": {
    "pdf_typed": 5184260,
    "pdf_handwritten": 0,
    "png_typed": 0,
    "png_handwritten": 0,
    "funsd": 43697,
    "bio": 0,
    "yolo": 0,
    "coco": 0,
    "donut": 0,
    "ground_truth": 19840,
    "overhead": 7455
  },
  "expected_accuracy": "+/- 30%",
  "calibration_sample_count": 296,
  "calibrated_at": "2026-05-08",
  "notes": []
}

Tier values:

TierRange
small< 100 MB
medium100 MB to < 1 GB
large>= 1 GB

Notes field: When record_count is 0, notes includes "record_count = 0; size shown is per-job overhead floor." and total_bytes reflects only the fixed per-job overhead.

FUNSD behavior: FUNSD ground-truth files are always included in the bundle when any render format (pdf_typed, pdf_handwritten, png_typed, png_handwritten) is requested. The funsd breakdown entry will be non-zero whenever a render format is present, regardless of whether funsd appears in output_formats.

API Key Management

Manage API keys programmatically. All key management endpoints use session authentication (x-user-id header from the web UI), not Bearer token auth.

Create an API Key

POST /api/internal/api-keys

Creates a new API key. The raw key is returned once — store it securely.

Request body:

FieldTypeRequiredDefaultDescription
namestringyesKey label (1–200 chars)
scopesstring[]no["generate", "read", "account"]Scopes for this key

Response:

{
  "key_id": "abc-123",
  "api_key": "sk_live_a1b2c3d4e5f6...",
  "prefix": "sk_live_a1b2c3d4",
  "name": "My Key",
  "scopes": ["generate", "read", "account"],
  "created_at": "2026-03-26T12:00:00+00:00"
}

List API Keys

GET /api/internal/api-keys

Returns all API keys for the current user. Full keys are never included — only the prefix for identification.

Response fields per key:

FieldTypeDescription
key_idstringKey identifier
prefixstringFirst 16 characters of the key
namestringKey label
scopesstring[]Granted scopes
tierstringRate limit tier
last_used_atstringLast usage timestamp (null if never used)
created_atstringCreation timestamp
is_activebooleanWhether the key is active

Revoke an API Key

DELETE /api/internal/api-keys/{key_id}

Permanently disables a key. Takes effect immediately and cannot be undone.

Rotate an API Key

POST /api/internal/api-keys/{key_id}/rotate

Creates a new key and revokes the old one atomically. The new key inherits the old key's name and scopes.

Response:

{
  "old_key_id": "abc-123",
  "new_key": {
    "key_id": "def-456",
    "api_key": "sk_live_x9y8z7w6...",
    "prefix": "sk_live_x9y8z7w6",
    "name": "My Key",
    "scopes": ["generate", "read", "account"],
    "created_at": "2026-04-01T12:00:00+00:00"
  }
}

Training Data (ML)

/generate jobs can emit labeled training data for document AI models. Pass label_scheme inside the request config, request the ML output formats you need (bio, coco, donut, yolo) alongside a render format (pdf_typed/pdf_handwritten), and pull per-item annotations from the /jobs/{id}/items surface.

Label schemes:

SchemeDescription
field_idLabels use form field IDs (e.g., employee_ssn)
semantic_conceptLabels use semantic concepts (e.g., social_security_number)
nist3Labels use NIST Type-3 categories
field_typeLabels use field types (e.g., ssn, currency, text)

Workflow:

1. POST /generate  {form_id, quantity, output_formats: ["pdf_typed", "bio", "coco"],
                    config: {label_scheme: "nist3"}}
2. GET  /jobs/{job_id}                                → poll until status = "completed"
3. GET  /jobs/{job_id}/items                          → enumerate items (paginated)
4. GET  /jobs/{job_id}/items/{item_id}/download       → presigned URLs (PDF, JSON, COCO, BIO)

For a complete working example, see the Python SDK's train_kie_model.py example which demonstrates creating a job with NIST3 labels and iterating training examples with BIO labels and spatial annotations.

Anti-pattern: list-valued fields in semantic_concept scheme. When two or more form fields share the same semantic_concept, the Donut exporter collapses their values into a JSON array, and the BIO exporter emits the same B/I tag pair for every token — destroying positional information. This is an anti-pattern. If a form has multiple fixed-position slots for what is conceptually the same data (e.g., the four W-2 Box 12 rows), each slot must have its own per-slot concept (e.g., tax.box12a.code, tax.box12b.code, …).

Donut consumer conventions

Each row in metadata.jsonl has a ground_truth field containing a JSON string of the form {"gt_parse": {...}}. The consumer — typically a HuggingFace Donut training script — is responsible for serializing gt_parse into Donut's tagged-token sequence. Two conventions apply.

Nested list values. When multiple form fields resolve to the same leaf path in the gt_parse tree (e.g., per-slot W-2 Box 12 amounts that share a parent path), DonutExporter collects their values into a JSON array at that leaf. The canonical Donut recursive flattener (used in clovaai/donut's token2json round-trip) emits one sibling tagged block per list element:

// gt_parse input (list-valued leaf)
{ "box12": { "amount": ["1200.00", "450.00"] } }
// Donut token sequence (sibling blocks per element)
<s_box12><s_amount>1200.00</s_amount><s_amount>450.00</s_amount></s_box12>

This is correct Donut convention. A consumer should not assume scalar values at every leaf.

Task-prefix wrapping token. Donut requires a per-task start token (e.g., <s_w2>, <s_invoice>) wrapping every gt_parse sequence. symagedocs does not emit this token — metadata.jsonl rows contain only the bare gt_parse object. The consumer must prepend the task token before serializing the sequence. Derive it from the request's form_id (e.g., irs_w2_2025<s_irs_w2_2025>, or <s_w2> for a coarser scheme). The token must also be added to the tokenizer's additional_special_tokens and the model's resized embeddings — standard Donut fine-tuning setup.

Credit Pricing

Credits are charged when a generation job is created. The cost depends on the form's field count, the output format, and the quantity.

Pricing Formulas

Typed PDF (standard):

cost_per_document = 20 + ceil(max(field_count - 25, 0) / 25) * 2
total_credits = cost_per_document * quantity

Handwritten PDF:

cost_per_document = 40 + ceil(max(field_count - 25, 0) / 25) * 4
total_credits = cost_per_document * quantity

JSON and CSV ground truth: Every bundle automatically includes ground_truth/json/ (per-document structured JSON) and ground_truth/data.csv (aggregated tabular CSV) at no additional cost. These are not selectable in output_formats — passing "json" or "csv" there returns 400 code=invalid_output_format.

Examples

FormFieldsQuantityFormatsCredits
W-220100pdf_typed2,000
104050100pdf_typed2,200
10405010pdf_typed220

Tabular data: Base rate is 1 credit per 50 rows. A surcharge of 1 credit per 500 extra cells applies when columns exceed 10 (extra cells = rows × (columns − 10)). Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)).

DataColumnsRowsCredits
Tabular51,00020
Tabular201,00040
Tabular105,000100
Tabular5010,0001,000

Use the form catalog's credit_cost field to see the per-document cost for each form at default (typed PDF) output.

Common Workflows

Batch Document Generation

# 1. Find the form you need
curl https://symagedocs.ai/api/v1/forms?category=tax \
  -H "Authorization: Bearer $API_KEY" | jq '.data[] | {id, name, credit_cost}'

# 2. Create a generation job
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
  | jq -r '.data.job_id')

echo "Job created: $JOB_ID"

# 3. Poll until complete
while true; do
  STATUS=$(curl -s https://symagedocs.ai/api/v1/jobs/$JOB_ID \
    -H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
  sleep 5
done

# 4. Download results — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/bundle \
  -H "Authorization: Bearer $API_KEY" -o w2_bundle.zip

# Or just the structured data
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/json \
  -H "Authorization: Bearer $API_KEY" -o w2_data.json

Tabular Data Generation

# 1. Parse a natural language description into a schema
SCHEMA=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/parse \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "employee name, department, hire date, salary, city"}' \
  | jq '.data.columns')

echo "Parsed schema: $SCHEMA"

# 2. Generate 5,000 rows from the schema
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"columns\": $SCHEMA, \"quantity\": 5000, \"output_formats\": [\"csv\"]}" \
  | jq -r '.data.job_id')

echo "Job created: $JOB_ID"

# 3. Poll until complete
while true; do
  STATUS=$(curl -s https://symagedocs.ai/api/v1/tabular/$JOB_ID/status \
    -H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
  sleep 2
done

# 4. Download CSV
curl https://symagedocs.ai/api/v1/tabular/$JOB_ID/download/csv \
  -H "Authorization: Bearer $API_KEY" -o employees.csv

Event-Driven Generation (Webhooks)

Instead of polling for job completion, register a webhook and let SymageDocs notify you:

# 1. Register a webhook (once, via the web UI or the /api/internal/webhooks endpoint)
curl -X POST https://symagedocs.ai/api/internal/webhooks \
  -H "x-user-id: YOUR_USER_ID" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-server.com/webhooks/symagedocs", "events": ["job.completed", "job.failed"]}'

# Save the signing_secret from the response!

# 2. Generate documents — no need to poll
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
  | jq -r '.data.job_id')

echo "Job $JOB_ID created — webhook will notify when done"

# 3. Your server receives a POST to /webhooks/symagedocs with:
#    {"event": "job.completed", "data": {"job_id": "...", "status": "completed", ...}}
#    Verify the X-Symagedocs-Signature header, then download results.

Reproducible Generation

Use the seed parameter to generate identical output across runs:

curl -X POST https://symagedocs.ai/api/v1/generate \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"form_id": "irs_w2_2025", "quantity": 10, "seed": 12345}'

Running this request again with the same seed and quantity produces the same documents.

Key Rotation

Rotate keys periodically for security. The old key stops working immediately and the new key takes over:

# Rotate via the web UI, or programmatically:
curl -X POST https://symagedocs.ai/api/internal/api-keys/OLD_KEY_ID/rotate \
  -H "x-user-id: YOUR_USER_ID"

Troubleshooting

SymptomCauseFix
401 Invalid API keyKey doesn't exist or was revokedCreate a new key in the web UI
403 Missing required scopesKey lacks the scope for this endpointCreate a new key with the needed scopes
404 Job not foundJob doesn't exist or belongs to another userCheck the job ID; keys can only access their own jobs
409 Job is not completedTrying to download before generation finishesPoll GET /jobs/{id} until status is completed
429 Rate limit exceededExceeded your tier's per-key request rateBack off using the X-RateLimit-Remaining header and retry
API User Manual | SymageDocs