SymageDocs API User Manual
The SymageDocs API lets you programmatically generate synthetic documents and identities. This manual covers authentication, endpoints, billing, and common workflows.
Python users: Consider the Python SDK (
pip install symagedocs) instead of raw HTTP. It handles authentication, retries, pagination, and polling automatically.
Quick Reference
Base URL: https://symagedocs.ai/api/v1
Auth: Authorization: Bearer sk_live_YOUR_KEY
Rate limit: Per-key, tier-based sustained rate (free: 10 req/s; pro: 30; scale: 100; enterprise: 200) with 3× short burst
Endpoints at a Glance
These are the public Bearer-authenticated endpoints under /api/v1. Webhook and API-key management are session-authenticated and live under /api/internal/... — see Webhooks and API Key Management.
| Method | Endpoint | Scope | Description |
|---|---|---|---|
GET | /forms | read | List available forms (filter: ?category=tax) |
GET | /forms/{id} | read | Form details with field definitions |
POST | /generate | generate | Create an async generation job |
GET | /jobs | read | List your jobs (paginated: ?limit=50&cursor=...) |
GET | /jobs/{id} | read | Job status and progress |
GET | /jobs/{id}/download/{fmt} | read | Download output (bundle, csv, or json) |
GET | /jobs/{id}/downloads | read | Presigned download URLs for all job output files |
GET | /jobs/{id}/items | read | List per-item records for a job |
GET | /jobs/{id}/items/{item_id}/download | read | Presigned URLs for one item's files |
POST | /jobs/{id}/cancel | generate | Cancel a running job; refund unspent credits |
POST | /identities | generate | Generate raw identities (JSON only, no forms) |
GET | /account/balance | account | Credit usage totals |
GET | /account/usage | account | Usage summary (filter: ?days=30) |
POST | /tabular/parse | generate | NL description to column schema (LLM-powered) |
POST | /tabular/generate | generate | Generate tabular data from column schema |
GET | /tabular/{id}/status | read | Tabular job progress and ETA |
GET | /tabular/{id}/download/{fmt} | read | Download tabular output (csv or json) |
GET | /pricing/rates | — | Get current credit rate constants (no auth) |
POST | /pricing/estimate | — | Estimate credit cost for a job (no auth) |
GET | /jobs/size-rates | — | Get calibrated bundle-size rate constants (no auth) |
POST | /jobs/size-estimate | — | Estimate compressed bundle download size (no auth) |
GET | /health | — | Liveness probe (no auth) |
Typical Workflow (Jobs)
1. GET /forms → pick a form_id
2. POST /generate {form_id, quantity} → get job_id
3. GET /jobs/{job_id} → poll until status = "completed"
4. GET /jobs/{job_id}/download/bundle → download all artifacts as a single zip
— or —
GET /jobs/{job_id}/downloads → get presigned URLs for direct S3 download
Typical Workflow (Per-item training data)
1. GET /forms → pick a form_id
2. POST /generate {form_id, quantity, output_formats: ["pdf_typed", "bio"]}
↑ optional: send `Idempotency-Key: <key>` header for retry-safety
3. GET /jobs/{job_id} → poll until status = "completed"
4. GET /jobs/{job_id}/items → enumerate items (paginated)
5. GET /jobs/{job_id}/items/{id}/download → presigned URLs per item
Typical Tabular Workflow
1. POST /tabular/parse {prompt} → get column schema
2. POST /tabular/generate {columns, qty} → get job_id (status = "processing")
3. GET /tabular/{job_id}/status → poll until status = "completed"
4. GET /tabular/{job_id}/download/csv → download results
# Tabular jobs use status = "processing" while in flight (not "pending"/"generating",
# which apply only to /jobs).
Key Scopes
| Scope | What it allows |
|---|---|
generate | Create jobs, generate identities, tabular parse/generate |
read | List forms, view jobs, download output, tabular status/download |
account | View balance and usage |
Credit Cost (per document)
| Output | Formula |
|---|---|
| Typed PDF | 20 + ceil(max(fields - 25, 0) / 25) * 2 |
| Handwritten PDF | 40 + ceil(max(fields - 25, 0) / 25) * 4 |
| Tabular (CSV/JSON) | ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)) |
JSON and CSV are always free. Every bundle includes
ground_truth/json/andground_truth/data.csvat no extra cost. They are not selectable inoutput_formats.
Response Envelope
{"data": ..., "meta": {"request_id": "uuid", ...}}
Common Errors
| Code | Meaning |
|---|---|
| 401 | Invalid or revoked API key |
| 402 | Insufficient credits |
| 403 | Key missing required scope |
| 404 | Resource not found |
| 409 | Job not yet completed (for downloads) |
| 429 | Rate limit exceeded |
Quick Start
1. Create an API Key
Log in to the SymageDocs web application and navigate to Account > API Keys. Click Create Key, give it a name, and copy the key immediately — it will not be shown again.
Your key looks like: sk_live_a1b2c3d4e5f6... (56 characters total).
2. Make Your First Request
curl https://symagedocs.ai/api/v1/forms \
-H "Authorization: Bearer sk_live_YOUR_KEY_HERE"
3. Generate a Document
# Create a generation job
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"form_id": "irs_w2_2025",
"quantity": 10,
"output_formats": ["pdf_typed"]
}'
# Poll for completion (replace JOB_ID with the job_id from above)
curl https://symagedocs.ai/api/v1/jobs/JOB_ID \
-H "Authorization: Bearer sk_live_YOUR_KEY_HERE"
# Download results when status is "completed" — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/bundle \
-H "Authorization: Bearer sk_live_YOUR_KEY_HERE" \
-o output.zip
Authentication
Every API request must include your key in the Authorization header:
Authorization: Bearer sk_live_YOUR_KEY_HERE
No username or password is needed — your identity is embedded in the key itself.
API Key Scopes
Keys can be created with specific scopes to limit what they can do:
| Scope | Grants access to |
|---|---|
generate | Create generation jobs, generate raw identities |
read | List forms, check job status, download outputs |
account | View credit balance and usage statistics |
By default, new keys have all three scopes. You can create restricted keys (e.g., a read-only key for monitoring) through the web UI or API.
Key Management
You manage keys through the web UI at Account > API Keys, or programmatically:
| Action | Description |
|---|---|
| Create | Generates a new key. The raw key is shown once — copy it immediately. |
| List | Shows all your keys with their prefix, name, scopes, and last-used date. Full keys are never shown. |
| Revoke | Permanently disables a key. Takes effect immediately. Cannot be undone. |
| Rotate | Creates a new key and revokes the old one in a single operation. The new key inherits the old key's name and scopes. |
Rate Limiting
The API enforces a per-key rate limit determined by the key's tier. The sustained rate is the token-bucket refill rate; a fresh client can issue up to 3× the sustained rate as an initial burst before throttling kicks in.
| Tier | Sustained | Burst |
|---|---|---|
| Free | 10 requests/second | 30 requests |
| Pro | 30 requests/second | 90 requests |
| Scale | 100 requests/second | 300 requests |
| Enterprise | 200 requests/second | 600 requests |
If you exceed your tier's limit, the API returns 429 Too Many Requests
with the following response headers so clients can back off intelligently:
| Header | Meaning |
|---|---|
Retry-After | Seconds to wait before retrying (rounded up to at least 1) |
X-RateLimit-Limit | The configured sustained rate for your tier (requests/second) |
X-RateLimit-Remaining | Remaining budget within the current window |
X-RateLimit-Reset | Unix timestamp (seconds) at which the bucket is fully refilled |
Response Format
Success Responses
All endpoints return a consistent envelope:
{
"data": { ... },
"meta": {
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"count": 5,
...
}
}
datacontains the response payload (object or array depending on the endpoint).metacontains request metadata.request_idis always present and useful for support inquiries.
Error Responses
Errors follow a standard format:
{
"detail": {
"code": "unknown_form",
"message": "Unknown form: irs_w99_2024",
"status": 400
}
}
HTTP Status Codes
| Code | Meaning |
|---|---|
| 200 | Success |
| 400 | Bad request (invalid parameters) |
| 401 | Authentication failed (invalid or revoked key) |
| 402 | Insufficient credits |
| 403 | Forbidden (missing required scope) |
| 404 | Resource not found |
| 409 | Conflict (e.g., downloading from an incomplete job) |
| 429 | Rate limit exceeded |
Endpoints
Forms Catalog
List Forms
GET /api/v1/forms
GET /api/v1/forms?category=tax
Returns all available forms. Optionally filter by category.
Scope required: read
Response fields per form:
| Field | Type | Description |
|---|---|---|
id | string | Unique form identifier (use this in generation requests) |
name | string | Human-readable form name |
category | string | Form category (e.g., tax, healthcare, business) |
year | integer | Tax year or form version year |
family | string | Form family (e.g., 1040, w2) |
entity_type | string | Entity type (individual, business, healthcare) |
credit_cost | integer | Credits per document (at default PDF output) |
field_count | integer | Number of fields on the form |
Example:
curl https://symagedocs.ai/api/v1/forms?category=tax \
-H "Authorization: Bearer sk_live_YOUR_KEY"
Get Form Details
GET /api/v1/forms/{form_id}
Returns full details for a specific form, including all field definitions.
Scope required: read
Additional response fields:
| Field | Type | Description |
|---|---|---|
version | string | Form definition version |
page_count | integer | Number of pages in the form |
fields | array | List of field definitions |
fields[].id | string | Field identifier |
fields[].label | string | Human-readable field label |
fields[].type | string | Field type (e.g., text, enum, ssn, currency) |
fields[].page | integer | Page number the field appears on |
Example:
curl https://symagedocs.ai/api/v1/forms/irs_w2_2025 \
-H "Authorization: Bearer sk_live_YOUR_KEY"
Generation
Create a Generation Job
POST /api/v1/generate
Creates an asynchronous generation job. The job runs in the background — poll the job status endpoint to track progress.
Scope required: generate
Request body:
Provide exactly one of form_id or form_ids (not both).
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
form_id | string | conditional | — | Single form to generate. Mutually exclusive with form_ids. |
form_ids | string[] | conditional | — | List of form IDs to generate coherently for the same identity (e.g., W-2 + 1040 with matching wages, SSN, and name). Mutually exclusive with form_id. |
quantity | integer | no | 1 | Number of identities to generate (1–10,000,000). Each identity produces one document per form. Large jobs are routed to the bulk worker and may take minutes-to-hours; poll /jobs/{job_id} for status. |
output_formats | string[] | no | ["pdf_typed"] | Output formats (see below) |
config | object | no | null | Generation configuration overrides |
seed | integer | no | null | Seed for reproducible output |
ink_color | string | no | "blue" | Ink color for handwritten output: black, blue, or red |
ink_color_distribution | object | no | null | Distribution of ink colors across documents. Keys are color names, values are integer weights that must sum to 100. Example: {"blue": 60, "black": 30, "red": 10}. Overrides ink_color when provided. |
writer_consistency | string | no | "per_document" | Writer consistency: per_document or per_field |
realism_level | string | no | "high" | Deprecated. Accepted for backward compatibility; ignored. All handwritten output is rendered at high realism. |
webhook_url | string | no | null | URL to receive job.completed/job.failed webhook notifications (max 2,048 chars). See Webhooks for details. |
Output formats:
| Format | Description |
|---|---|
pdf_typed | Filled PDF with typed text |
pdf_handwritten | Filled PDF with handwritten-style text (not available for all forms) |
png_typed | Per-page PNG rasterizations of pdf_typed (requires pdf_typed) |
png_handwritten | Per-page PNG rasterizations of pdf_handwritten (requires pdf_handwritten) |
JSON, CSV, and FUNSD are always included. Every bundle automatically contains
ground_truth/json/(per-document structured JSON),ground_truth/data.csv(aggregated tabular CSV), and per-page FUNSD annotations regardless of which formats you request. Do not include"json","csv", or"funsd"inoutput_formats— the API will return400 code=invalid_output_formatif you do.
PNG output requires its paired PDF.
png_typedmust be requested together withpdf_typed, andpng_handwrittenmust be requested together withpdf_handwritten. PNG always travels with its paired PDF; PNG-only output is not a supported configuration. Submittingpng_typedwithoutpdf_typed(orpng_handwrittenwithoutpdf_handwritten) is rejected by the API with400 code=missing_pdf_dependency. Cross-surface pairs do not satisfy the dependency — e.g.["png_typed", "pdf_handwritten"]is still rejected because typed PNG needs typed PDF specifically.
ML training formats (gated behind the ml-output-formats-enabled feature flag — requesting any of these without the flag returns 400 code=ml_formats_disabled):
| Format | Description |
|---|---|
bio | Token classification (B/I/O) labels |
yolo | YOLOv8 object detection labels |
coco | COCO JSON object detection annotations |
donut | Donut OCR-free metadata.jsonl + per-page images |
ML formats require a render format. All ML training formats (
bio,yolo,coco,donut) derive word-level annotations from the renderer pipeline. You must include at least one render format —pdf_typed,pdf_handwritten, orpng_typed— in the same request. Submitting ML formats without a render format returns400 code=missing_render_dependency. Example:["bio", "donut"]alone is invalid; use["pdf_typed", "bio", "donut"].
You can request multiple formats in a single job (e.g., ["pdf_typed", "bio", "yolo"]). Regardless of how many formats you request, the completed job is downloaded as a single bundle zip containing every produced artifact — including JSON and CSV ground truth, which are always present.
Multi-surface jobs (
["pdf_typed", "pdf_handwritten"]) run in parallel by default (STAX-1707). Pre-2026-05-15 such jobs were forced through a sequential code path as a defensive workaround. Bulk jobs now use the full bulk-worker core count regardless of how many PDF surfaces were requested. Output bundles separate the two surfaces intopdfs/typed/andpdfs/handwritten/(plusimages/{typed,handwritten}/and the per-formatannotations/<format>/{typed,handwritten}/trees); a typed and a handwritten render of the same logical document share one identity (see ADR-058 for the architectural commitment).
Response:
{
"data": {
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"credits_required": 2200
},
"meta": {
"request_id": "...",
"credits_charged": 2200
}
}
Example — single form:
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"form_id": "irs_w2_2025",
"quantity": 50,
"output_formats": ["pdf_typed"],
"seed": 12345
}'
Example — multi-form (coherent generation):
Generate W-2 and 1040 documents for the same identities. Each identity's W-2 wages will match the 1040 line 1 wages, and SSN/name/address fields are consistent across both forms.
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"form_ids": ["irs_w2_2025", "irs_f1040_2024"],
"quantity": 10,
"output_formats": ["pdf_typed"],
"seed": 42
}'
Note: Calling
/generatetwice with the same seed and differentform_idvalues does not produce coherent data — the identity biasing differs per form. Useform_idsto get matching documents.
Coherence Mode (ML ablation control)
Multi-form jobs accept a coherence_mode flag inside config that
controls how identities and field values relate across the forms of
a single item_index:
coherent(default) — one identity peritem_index; every form in that item is filled from it. Cross-form correlations (same SSN on W-2 and 1040, same wages on W-2 box 1 and 1040 line 1) are preserved. Use this for realistic tax-prep datasets and any production workload.shuffled— identities are still generated one-per-item_index, but each field's final value vector is permuted across items with a seed derived from(seed, form_id, field_id). Marginal distributions and spatial layout stay valid; within-image correlations are broken. Use this for ML ablation experiments that need to measure how much a model leans on cross-field correlations inside one document.random— each(item_index, form_id)pair gets its own independently generated identity, so W-2 and 1040 in the same item will usually carry different SSNs, names, and addresses. Use this as the strongest ablation baseline, or to stress-test downstream code that should not assume identity continuity across forms.
Reproducibility: passing the same (form_ids, quantity, seed, output_formats, coherence_mode) produces byte-identical output for
all three modes. The chosen mode is recorded in manifest.json under
dataset_info.generation_params.coherence_mode so dataset consumers
can branch on it.
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"form_ids": ["irs_w2_2025", "irs_f1040_2024"],
"quantity": 100,
"seed": 42,
"output_formats": ["pdf_typed"],
"config": {"coherence_mode": "shuffled"}
}'
List Jobs
GET /api/v1/jobs
GET /api/v1/jobs?limit=20&status=completed
Returns your generation jobs, newest first. Supports cursor-based pagination.
Scope required: read
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
limit | integer | 50 | Results per page (1–100) |
cursor | string | — | Pagination cursor from a previous response |
status | string | — | Filter by status: pending, generating, completed, failed |
Pagination:
The response meta includes:
has_more—trueif more results exist beyond this pagenext_cursor— pass this as thecursorparameter to get the next page
# First page
curl "https://symagedocs.ai/api/v1/jobs?limit=10" \
-H "Authorization: Bearer sk_live_YOUR_KEY"
# Next page (using next_cursor from the previous response)
curl "https://symagedocs.ai/api/v1/jobs?limit=10&cursor=2026-03-24T10:00:00+00:00" \
-H "Authorization: Bearer sk_live_YOUR_KEY"
Get Job Status
GET /api/v1/jobs/{job_id}
Returns detailed status for a specific job.
Scope required: read
Response fields:
| Field | Type | Description |
|---|---|---|
job_id | string | Job identifier |
status | string | pending, generating, completed, or failed |
progress | float | Completion progress (0.0 to 1.0) |
form_id | string | Form used for generation |
form_name | string | Human-readable form name |
quantity | integer | Number of documents requested |
output_formats | string[] | Requested output formats |
credits_required | integer | Total credits for this job |
credits_charged | integer | Credits actually charged |
seed | integer | Seed used (null if random) |
error | string | Error message if job failed (null otherwise) |
created_at | string | ISO 8601 creation timestamp |
completed_at | string | ISO 8601 completion timestamp (null if not done) |
generation_time_ms | integer | Total generation time in milliseconds (null if not completed) |
worker_count | integer | Number of workers used for this job |
peak_workers | integer | Peak concurrent workers during generation |
throughput_docs_per_min | float | Generation throughput in documents per minute |
Download Job Output
GET /api/v1/jobs/{job_id}/download/{format}
Downloads the generated output in the specified format. The job must be in completed status.
Scope required: read
Path parameters:
| Parameter | Values | Description |
|---|---|---|
format | bundle, csv, json | One format per request. Per-format downloads (pdf_typed, pdf_handwritten, coco, bio, yolo, funsd, donut, paperlives, hf_datasets, coco_layout, funsd_layout) have been retired in favor of bundle. Requesting any of those returns 400 code=invalid_format. |
Response: Binary file download.
bundlereturns a single ZIP archive containing every artifact the job produced (PDFs, PNGs, JSON, CSV, and any ML annotations that were requested).jsonreturns the aggregated structured data as JSON.csvreturns the aggregated data as a CSV file.
For per-file access (e.g., to fetch a single PDF), use GET /jobs/{job_id}/downloads to obtain presigned S3 URLs.
Example:
# Download every artifact as a single ZIP
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/bundle \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-o output.zip
# Download JSON data
curl https://symagedocs.ai/api/v1/jobs/JOB_ID/download/json \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-o data.json
Presigned Downloads
Get Job Download URLs
GET /api/v1/jobs/{job_id}/downloads
Returns presigned S3 URLs for all output files of a completed job. Files can be downloaded directly from S3 without proxying through the server.
Scope required: read
Response:
{
"data": {
"job_id": "550e8400-...",
"files": [
{
"filename": "irs_w2_2025_abc123.pdf",
"url": "https://s3.amazonaws.com/...presigned",
"content_type": "application/pdf",
"item_index": 0
}
],
"expires_in": 3600
}
}
URLs expire after 1 hour. Request new URLs if needed.
Cancel a Job
POST /api/v1/jobs/{job_id}/cancel
Cancels a running job. Items already rendered when the cancel takes effect remain on storage and can be downloaded via the bundle endpoint. Idempotent: cancelling a job that's already CANCELING / CANCELED returns 200 with the current state. Cancelling a terminal-but-not-canceled job (COMPLETED / FAILED / EXPIRED) returns 409.
Scope required: generate
Response: standard envelope with {"job_id", "status"} where status is either canceling (worker is draining its current item) or canceled (job was still PENDING and skipped straight to terminal). The worker observes the cancel flag at the per-item boundary; refunding of unspent credits happens automatically.
Idempotency-Key Header
POST /api/v1/generate accepts an optional Idempotency-Key header. When the same (api-key user, key) pair is seen again within 24 hours the original job_id is returned and credits are not debited a second time. Useful for safe network retries.
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Idempotency-Key: my-retry-safe-key-001" \
-H "Content-Type: application/json" \
-d '{"form_id": "irs_w2_2025", "quantity": 100}'
The replay response sets meta.idempotent_replay = true and meta.credits_charged = 0.
Per-Item Endpoints
Per-item training-data access lives on the /jobs surface:
| Endpoint | Description |
|---|---|
GET /api/v1/jobs/{job_id}/items | Cursor-paginated list of per-item records. Each item carries presigned download URLs. |
GET /api/v1/jobs/{job_id}/items/{item_id}/download | Presigned URLs for a single item's files. |
Raw Identities
Generate Identities
POST /api/v1/identities
Generates synthetic identities as raw JSON — no form rendering, no PDF output. Useful for testing, schema exploration, or when you need identity data without filled documents.
Scope required: generate
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
quantity | integer | no | 1 | Number of identities (1–10,000) |
config | object | no | null | Generation configuration |
seed | integer | no | null | Seed for reproducibility |
Response: The standard envelope with data containing an array of identity objects. Each identity has fields like first_name, last_name, ssn, dob, address, employer information, and more.
Example:
curl -X POST https://symagedocs.ai/api/v1/identities \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"quantity": 5, "seed": 42}'
Account
Get Credit Balance
GET /api/v1/account/balance
Returns your credit usage totals.
Scope required: account
Response fields:
| Field | Type | Description |
|---|---|---|
credits_used | integer | Total credits charged across all completed jobs |
credits_allocated | integer | Total credits reserved across all jobs (including pending) |
Get Usage Summary
GET /api/v1/account/usage?days=30
Returns a usage summary for the specified period.
Scope required: account
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
days | integer | 30 | Number of days to summarize (1–365) |
Response fields:
| Field | Type | Description |
|---|---|---|
period_days | integer | Period covered |
total_jobs | integer | Number of jobs created |
total_items_generated | integer | Total documents generated |
total_credits_used | integer | Total credits charged |
jobs_by_status | object | Job count by status (completed, failed, etc.) |
Webhooks
Webhooks let you receive real-time HTTP notifications when events occur — no polling required. Register a URL, and SymageDocs will POST a signed JSON payload whenever a subscribed event fires.
Events
| Event | Fires when |
|---|---|
job.completed | A generation job finishes successfully (includes presigned download URLs) |
job.failed | A generation job fails |
Inline Registration via webhook_url
Instead of pre-registering webhooks, pass a webhook_url parameter when creating a job:
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"form_id": "irs_w2_2025",
"quantity": 10,
"webhook_url": "https://your-server.com/hooks"
}'
A temporary webhook subscription is auto-created for job.completed and job.failed.
Register a Webhook
POST /api/internal/webhooks
Auth: Session (web UI) — uses x-user-id header, not API key auth.
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | yes | — | HTTPS endpoint to receive events (max 2,048 chars) |
events | string[] | no | all four events | Event types to subscribe to |
description | string | no | null | Optional label (max 500 chars) |
Response:
{
"webhook_id": "550e8400-...",
"signing_secret": "whsec_a1b2c3d4e5f6...",
"secret_prefix": "whsec_a1b2c3d4",
"url": "https://example.com/hook",
"events": ["job.completed", "job.failed"],
"description": null,
"created_at": "2026-03-26T12:00:00+00:00"
}
Important: The
signing_secretis shown once at creation. Store it securely — you'll need it to verify payload signatures.
List Webhooks
GET /api/internal/webhooks
Returns all active webhooks for the current user. The full signing secret is never included — only the secret_prefix for identification.
Update a Webhook
PATCH /api/internal/webhooks/{webhook_id}
Update a webhook's URL, subscribed events, description, or active status. All fields are optional — only include what you want to change.
Request body:
| Field | Type | Description |
|---|---|---|
url | string | New endpoint URL |
events | string[] | New event subscriptions |
description | string | New description |
is_active | boolean | Enable or disable delivery |
Disable a Webhook
DELETE /api/internal/webhooks/{webhook_id}
Soft-deletes the webhook. It stops receiving events immediately but remains in the database for audit purposes.
Payload Format
Every webhook delivery is a POST request with a JSON body:
{
"event": "job.completed",
"timestamp": "2026-03-30T12:05:00+00:00",
"webhook_id": "550e8400-...",
"data": {
"job_id": "661e9500-...",
"status": "completed",
"form_id": "irs_w2_2025",
"quantity": 100,
"credits_charged": 2000,
"completed_at": "2026-03-30T12:05:00+00:00",
"download_urls": [
{ "filename": "irs_w2_2025_abc123.pdf", "url": "https://s3...presigned" },
{ "filename": "irs_w2_2025_abc123.json", "url": "https://s3...presigned" }
]
}
}
Headers included with every delivery:
| Header | Description |
|---|---|
X-Symagedocs-Signature | HMAC-SHA256 hex digest of the raw body |
X-Symagedocs-Event | Event type (e.g., job.completed) |
X-Symagedocs-Delivery | Unique delivery ID for deduplication |
Verifying Signatures
Verify the X-Symagedocs-Signature header to ensure the payload came from SymageDocs and wasn't tampered with:
import hashlib, hmac
def verify_webhook(body: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature)
const crypto = require("crypto");
function verifyWebhook(body, signature, secret) {
const expected = crypto.createHmac("sha256", secret).update(body).digest("hex");
return crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(signature));
}
Retry Policy
Failed deliveries (non-2xx response or network error) are retried up to 3 times with exponential backoff:
| Attempt | Delay |
|---|---|
| 1st retry | 10 seconds |
| 2nd retry | 60 seconds |
| 3rd retry | 300 seconds (5 minutes) |
After all retries are exhausted, the delivery is marked as failed. Your webhook remains active and will continue receiving future events.
Tabular Generation
The tabular API generates structured CSV/JSON data from either a natural-language description or an explicit column schema. It supports 50+ column types including identity-derived fields (names, SSNs, addresses) and synthetic fields (random numbers, enums, patterns).
Parse NL Description
POST /api/v1/tabular/parse
Converts a natural-language description into a structured column schema using an LLM with heuristic fallback.
Scope required: generate
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
prompt | string | yes | Natural language description of desired columns (1–2,000 chars) |
Example:
curl -X POST https://symagedocs.ai/api/v1/tabular/parse \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "first name, last name, age, city, state, annual salary"}'
Response:
{
"data": {
"columns": [
{ "name": "first_name", "type": "text_first_name", "constraints": {} },
{ "name": "last_name", "type": "text_last_name", "constraints": {} },
{ "name": "age", "type": "identity_age", "constraints": {} },
{ "name": "city", "type": "address_city", "constraints": {} },
{ "name": "state", "type": "address_state", "constraints": {} },
{
"name": "annual_salary",
"type": "number_integer",
"constraints": { "min": 20000, "max": 200000 }
}
],
"parser_type": "llm",
"model": "claude-haiku-4-5-20251001"
},
"meta": { "request_id": "...", "count": 6 }
}
You can modify the returned columns (rename, remove, adjust constraints) before passing them to the generate endpoint.
Generate Tabular Data
POST /api/v1/tabular/generate
Creates an async tabular generation job from an explicit column schema. Poll the status endpoint for progress.
Scope required: generate
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
columns | object[] | yes | — | Column definitions (1–50 columns, see below) |
quantity | integer | no | 100 | Number of rows (1–10,000) |
output_formats | string[] | no | ["csv"] | Output formats: csv, json |
seed | integer | no | null | Seed for reproducibility |
Column definition:
Each column is an object with:
| Field | Type | Description |
|---|---|---|
name | string | Column header name |
type | string | Column type (see common types below) |
constraints | object | Type-specific parameters (min, max, values, weights, etc.) |
Common column types:
| Type | Description | Constraints |
|---|---|---|
text_first_name | First name from synthetic identity | — |
text_last_name | Last name | — |
text_full_name | Full name | — |
ssn | SSN (###-##-####) | — |
identity_age | Age in years | — |
identity_gender | Gender (M/F) | — |
address_city | City | — |
address_state | State abbreviation | — |
address_zip | ZIP code | — |
phone | Phone number | — |
email | Email address | — |
number_integer | Random integer | min, max, distribution_type, distribution_params |
number_decimal | Random decimal | min, max, decimals |
custom_enum | Weighted random choice | values: [...], weights: [...] |
custom_mask | Pattern-based string | mask (A=upper, a=lower, 9=digit, X=any) |
Example:
curl -X POST https://symagedocs.ai/api/v1/tabular/generate \
-H "Authorization: Bearer sk_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"columns": [
{"name": "first_name", "type": "text_first_name", "constraints": {}},
{"name": "last_name", "type": "text_last_name", "constraints": {}},
{"name": "age", "type": "identity_age", "constraints": {}},
{"name": "salary", "type": "number_integer", "constraints": {"min": 30000, "max": 150000}}
],
"quantity": 1000,
"output_formats": ["csv", "json"]
}'
Response:
{
"data": {
"job_id": "550e8400-...",
"status": "processing",
"credits_required": 20
},
"meta": { "request_id": "...", "credits_charged": 20 }
}
Credits: Base rate is 1 credit per 50 rows (0.02 credits/row). A column surcharge applies only when the schema has more than 10 columns: each 500 "extra cells" (rows × columns beyond 10) costs 1 additional credit. Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)). The example above uses 4 columns and 1,000 rows: base = 1,000/50 = 20 credits, no surcharge (4 < 10), total = 20 credits.
Get Tabular Job Status
GET /api/v1/tabular/{job_id}/status
Returns progress and ETA for a tabular generation job.
Scope required: read
Response fields:
| Field | Type | Description |
|---|---|---|
job_id | string | Job identifier |
status | string | processing, completed, or failed |
rows_generated | integer | Rows completed so far |
total_rows | integer | Total rows requested |
percent | float | Completion percentage (0.0–100.0) |
estimated_seconds_remaining | float | ETA in seconds (null if unavailable) |
error | string | Error message if failed (null otherwise) |
Download Tabular Output
GET /api/v1/tabular/{job_id}/download/{format}
Downloads the generated tabular data. The job must be in completed status. Data expires after 48 hours.
Scope required: read
Path parameters:
| Parameter | Values | Description |
|---|---|---|
format | csv, json | One format per request. |
Response: File download (text/csv or application/json).
Errors:
| Code | Meaning |
|---|---|
| 400 | Invalid format (not csv or json) |
| 404 | Job not found |
| 409 | Job not yet completed |
| 410 | Data expired (older than 48 hours) |
Pricing
These endpoints require no authentication and can be used to preview costs before creating jobs.
Get Pricing Rates
GET /api/v1/pricing/rates
Returns the current credit rate constants used for cost calculation.
Authentication: None required.
Response:
{
"csv": {
"rows_per_credit": 50,
"extra_column_threshold": 10,
"extra_cells_per_credit": 500
},
"pdf_typed": {
"base_credits": 20,
"field_threshold": 25,
"surcharge_per_band": 2,
"band_size": 25
},
"pdf_handwritten": {
"base_credits": 40,
"field_threshold": 25,
"surcharge_per_band": 4,
"band_size": 25
},
"degradation_multipliers": { ... },
"rounding": {"free_threshold": 0, "minimum_credits": 1, "method": "ceil"},
"bundling": "PDF output includes CSV/JSON free"
}
Estimate Credit Cost
POST /api/v1/pricing/estimate
Estimates the credit cost for a job without creating it.
Authentication: None required.
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
field_count | integer | yes | — | Number of fields/columns on the form |
output_formats | string[] | yes | — | Output formats (e.g., ["pdf_typed"]) |
record_count | integer | yes | — | Number of records/rows to generate |
degradation_profile | string | no | null | Degradation profile to apply |
Response:
{
"credits": 2200
}
Bundle Size Estimation
These endpoints require no authentication and can be used to preview the compressed download size of a job before submitting it. The estimates are calibrated against real staging bundles (see calibration_sample_count and calibrated_at in the response). Expected accuracy is +/- 30% for 80% of jobs.
Get Bundle-Size Rate Constants
GET /api/v1/jobs/size-rates
Returns the calibrated per-format byte constants and tier thresholds used by the size estimator.
Authentication: None required.
Response:
{
"rates": {
"pdf_typed_bytes_per_file": 518426,
"pdf_handwritten_bytes_per_file": 2240759,
"png_typed_bytes_per_file": 1549775,
"png_handwritten_bytes_per_file": 549089,
"funsd_bytes_per_file": 5093,
"bio_bytes_per_instance": 43661,
"yolo_bytes_per_file": 1960,
"coco_bytes_per_surface": 55508,
"donut_bytes_per_row": 2703,
"ground_truth_json_bytes_per_file": 1984,
"ground_truth_csv_bytes_per_job": 4698,
"bundle_overhead_bytes_per_job": 2757,
"funsd_emit_rate": 0.858,
"bio_emit_rate": 1.0
},
"tiers": {
"small_max_bytes": 104857600,
"medium_max_bytes": 1073741824
},
"expected_accuracy": "+/- 30%",
"calibration_sample_count": 296,
"calibrated_at": "2026-05-08"
}
Estimate Bundle Download Size
POST /api/v1/jobs/size-estimate
Estimates the compressed zip download size for a job based on its forms, page counts, output formats, and record count.
Authentication: None required.
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
field_counts | integer[] | yes | — | Number of fields on each form (one entry per form) |
page_counts | integer[] | yes | — | Number of pages on each form (must match field_counts length) |
output_formats | string[] | no | [] | Requested output format tokens (e.g., ["pdf_typed", "bio"]) |
record_count | integer | yes | — | Number of identities to generate (>= 0) |
field_counts and page_counts must have the same length (>= 1). A length mismatch returns 422.
Unknown or rejected format tokens (e.g. "pdf_unicorn") return 400 with code=invalid_output_format.
Response:
{
"total_bytes": 5255252,
"human_readable": "~5.0 MB",
"tier": "small",
"breakdown": {
"pdf_typed": 5184260,
"pdf_handwritten": 0,
"png_typed": 0,
"png_handwritten": 0,
"funsd": 43697,
"bio": 0,
"yolo": 0,
"coco": 0,
"donut": 0,
"ground_truth": 19840,
"overhead": 7455
},
"expected_accuracy": "+/- 30%",
"calibration_sample_count": 296,
"calibrated_at": "2026-05-08",
"notes": []
}
Tier values:
| Tier | Range |
|---|---|
small | < 100 MB |
medium | 100 MB to < 1 GB |
large | >= 1 GB |
Notes field: When record_count is 0, notes includes "record_count = 0; size shown is per-job overhead floor." and total_bytes reflects only the fixed per-job overhead.
FUNSD behavior: FUNSD ground-truth files are always included in the bundle when any render format (pdf_typed, pdf_handwritten, png_typed, png_handwritten) is requested. The funsd breakdown entry will be non-zero whenever a render format is present, regardless of whether funsd appears in output_formats.
API Key Management
Manage API keys programmatically. All key management endpoints use session authentication (x-user-id header from the web UI), not Bearer token auth.
Create an API Key
POST /api/internal/api-keys
Creates a new API key. The raw key is returned once — store it securely.
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | — | Key label (1–200 chars) |
scopes | string[] | no | ["generate", "read", "account"] | Scopes for this key |
Response:
{
"key_id": "abc-123",
"api_key": "sk_live_a1b2c3d4e5f6...",
"prefix": "sk_live_a1b2c3d4",
"name": "My Key",
"scopes": ["generate", "read", "account"],
"created_at": "2026-03-26T12:00:00+00:00"
}
List API Keys
GET /api/internal/api-keys
Returns all API keys for the current user. Full keys are never included — only the prefix for identification.
Response fields per key:
| Field | Type | Description |
|---|---|---|
key_id | string | Key identifier |
prefix | string | First 16 characters of the key |
name | string | Key label |
scopes | string[] | Granted scopes |
tier | string | Rate limit tier |
last_used_at | string | Last usage timestamp (null if never used) |
created_at | string | Creation timestamp |
is_active | boolean | Whether the key is active |
Revoke an API Key
DELETE /api/internal/api-keys/{key_id}
Permanently disables a key. Takes effect immediately and cannot be undone.
Rotate an API Key
POST /api/internal/api-keys/{key_id}/rotate
Creates a new key and revokes the old one atomically. The new key inherits the old key's name and scopes.
Response:
{
"old_key_id": "abc-123",
"new_key": {
"key_id": "def-456",
"api_key": "sk_live_x9y8z7w6...",
"prefix": "sk_live_x9y8z7w6",
"name": "My Key",
"scopes": ["generate", "read", "account"],
"created_at": "2026-04-01T12:00:00+00:00"
}
}
Training Data (ML)
/generate jobs can emit labeled training data for document AI models. Pass label_scheme inside the request config, request the ML output formats you need (bio, coco, donut, yolo) alongside a render format (pdf_typed/pdf_handwritten), and pull per-item annotations from the /jobs/{id}/items surface.
Label schemes:
| Scheme | Description |
|---|---|
field_id | Labels use form field IDs (e.g., employee_ssn) |
semantic_concept | Labels use semantic concepts (e.g., social_security_number) |
nist3 | Labels use NIST Type-3 categories |
field_type | Labels use field types (e.g., ssn, currency, text) |
Workflow:
1. POST /generate {form_id, quantity, output_formats: ["pdf_typed", "bio", "coco"],
config: {label_scheme: "nist3"}}
2. GET /jobs/{job_id} → poll until status = "completed"
3. GET /jobs/{job_id}/items → enumerate items (paginated)
4. GET /jobs/{job_id}/items/{item_id}/download → presigned URLs (PDF, JSON, COCO, BIO)
For a complete working example, see the Python SDK's train_kie_model.py example which demonstrates creating a job with NIST3 labels and iterating training examples with BIO labels and spatial annotations.
Anti-pattern: list-valued fields in
semantic_conceptscheme. When two or more form fields share the samesemantic_concept, the Donut exporter collapses their values into a JSON array, and the BIO exporter emits the same B/I tag pair for every token — destroying positional information. This is an anti-pattern. If a form has multiple fixed-position slots for what is conceptually the same data (e.g., the four W-2 Box 12 rows), each slot must have its own per-slot concept (e.g.,tax.box12a.code,tax.box12b.code, …).
Donut consumer conventions
Each row in metadata.jsonl has a ground_truth field containing a JSON string of the form {"gt_parse": {...}}. The consumer — typically a HuggingFace Donut training script — is responsible for serializing gt_parse into Donut's tagged-token sequence. Two conventions apply.
Nested list values. When multiple form fields resolve to the same leaf path in the gt_parse tree (e.g., per-slot W-2 Box 12 amounts that share a parent path), DonutExporter collects their values into a JSON array at that leaf. The canonical Donut recursive flattener (used in clovaai/donut's token2json round-trip) emits one sibling tagged block per list element:
// gt_parse input (list-valued leaf)
{ "box12": { "amount": ["1200.00", "450.00"] } }
// Donut token sequence (sibling blocks per element)
<s_box12><s_amount>1200.00</s_amount><s_amount>450.00</s_amount></s_box12>
This is correct Donut convention. A consumer should not assume scalar values at every leaf.
Task-prefix wrapping token. Donut requires a per-task start token (e.g., <s_w2>, <s_invoice>) wrapping every gt_parse sequence. symagedocs does not emit this token — metadata.jsonl rows contain only the bare gt_parse object. The consumer must prepend the task token before serializing the sequence. Derive it from the request's form_id (e.g., irs_w2_2025 → <s_irs_w2_2025>, or <s_w2> for a coarser scheme). The token must also be added to the tokenizer's additional_special_tokens and the model's resized embeddings — standard Donut fine-tuning setup.
Credit Pricing
Credits are charged when a generation job is created. The cost depends on the form's field count, the output format, and the quantity.
Pricing Formulas
Typed PDF (standard):
cost_per_document = 20 + ceil(max(field_count - 25, 0) / 25) * 2
total_credits = cost_per_document * quantity
Handwritten PDF:
cost_per_document = 40 + ceil(max(field_count - 25, 0) / 25) * 4
total_credits = cost_per_document * quantity
JSON and CSV ground truth: Every bundle automatically includes ground_truth/json/ (per-document structured JSON) and ground_truth/data.csv (aggregated tabular CSV) at no additional cost. These are not selectable in output_formats — passing "json" or "csv" there returns 400 code=invalid_output_format.
Examples
| Form | Fields | Quantity | Formats | Credits |
|---|---|---|---|---|
| W-2 | 20 | 100 | pdf_typed | 2,000 |
| 1040 | 50 | 100 | pdf_typed | 2,200 |
| 1040 | 50 | 10 | pdf_typed | 220 |
Tabular data: Base rate is 1 credit per 50 rows. A surcharge of 1 credit per 500 extra cells applies when columns exceed 10 (extra cells = rows × (columns − 10)). Final cost = ceil(rows/50 + ceil(rows × max(cols−10, 0) / 500)).
| Data | Columns | Rows | Credits |
|---|---|---|---|
| Tabular | 5 | 1,000 | 20 |
| Tabular | 20 | 1,000 | 40 |
| Tabular | 10 | 5,000 | 100 |
| Tabular | 50 | 10,000 | 1,000 |
Use the form catalog's credit_cost field to see the per-document cost for each form at default (typed PDF) output.
Common Workflows
Batch Document Generation
# 1. Find the form you need
curl https://symagedocs.ai/api/v1/forms?category=tax \
-H "Authorization: Bearer $API_KEY" | jq '.data[] | {id, name, credit_cost}'
# 2. Create a generation job
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
| jq -r '.data.job_id')
echo "Job created: $JOB_ID"
# 3. Poll until complete
while true; do
STATUS=$(curl -s https://symagedocs.ai/api/v1/jobs/$JOB_ID \
-H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
echo "Status: $STATUS"
[ "$STATUS" = "completed" ] && break
[ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
sleep 5
done
# 4. Download results — every artifact in one zip
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/bundle \
-H "Authorization: Bearer $API_KEY" -o w2_bundle.zip
# Or just the structured data
curl https://symagedocs.ai/api/v1/jobs/$JOB_ID/download/json \
-H "Authorization: Bearer $API_KEY" -o w2_data.json
Tabular Data Generation
# 1. Parse a natural language description into a schema
SCHEMA=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/parse \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "employee name, department, hire date, salary, city"}' \
| jq '.data.columns')
echo "Parsed schema: $SCHEMA"
# 2. Generate 5,000 rows from the schema
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/tabular/generate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{\"columns\": $SCHEMA, \"quantity\": 5000, \"output_formats\": [\"csv\"]}" \
| jq -r '.data.job_id')
echo "Job created: $JOB_ID"
# 3. Poll until complete
while true; do
STATUS=$(curl -s https://symagedocs.ai/api/v1/tabular/$JOB_ID/status \
-H "Authorization: Bearer $API_KEY" | jq -r '.data.status')
echo "Status: $STATUS"
[ "$STATUS" = "completed" ] && break
[ "$STATUS" = "failed" ] && echo "Job failed!" && exit 1
sleep 2
done
# 4. Download CSV
curl https://symagedocs.ai/api/v1/tabular/$JOB_ID/download/csv \
-H "Authorization: Bearer $API_KEY" -o employees.csv
Event-Driven Generation (Webhooks)
Instead of polling for job completion, register a webhook and let SymageDocs notify you:
# 1. Register a webhook (once, via the web UI or the /api/internal/webhooks endpoint)
curl -X POST https://symagedocs.ai/api/internal/webhooks \
-H "x-user-id: YOUR_USER_ID" \
-H "Content-Type: application/json" \
-d '{"url": "https://your-server.com/webhooks/symagedocs", "events": ["job.completed", "job.failed"]}'
# Save the signing_secret from the response!
# 2. Generate documents — no need to poll
JOB_ID=$(curl -s -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"form_id": "irs_w2_2025", "quantity": 100, "output_formats": ["pdf_typed"]}' \
| jq -r '.data.job_id')
echo "Job $JOB_ID created — webhook will notify when done"
# 3. Your server receives a POST to /webhooks/symagedocs with:
# {"event": "job.completed", "data": {"job_id": "...", "status": "completed", ...}}
# Verify the X-Symagedocs-Signature header, then download results.
Reproducible Generation
Use the seed parameter to generate identical output across runs:
curl -X POST https://symagedocs.ai/api/v1/generate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"form_id": "irs_w2_2025", "quantity": 10, "seed": 12345}'
Running this request again with the same seed and quantity produces the same documents.
Key Rotation
Rotate keys periodically for security. The old key stops working immediately and the new key takes over:
# Rotate via the web UI, or programmatically:
curl -X POST https://symagedocs.ai/api/internal/api-keys/OLD_KEY_ID/rotate \
-H "x-user-id: YOUR_USER_ID"
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
401 Invalid API key | Key doesn't exist or was revoked | Create a new key in the web UI |
403 Missing required scopes | Key lacks the scope for this endpoint | Create a new key with the needed scopes |
404 Job not found | Job doesn't exist or belongs to another user | Check the job ID; keys can only access their own jobs |
409 Job is not completed | Trying to download before generation finishes | Poll GET /jobs/{id} until status is completed |
429 Rate limit exceeded | Exceeded your tier's per-key request rate | Back off using the X-RateLimit-Remaining header and retry |