SymageDocs Python SDK

Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.

Installation

pip install symagedocs

For progress bars during long jobs:

pip install symagedocs[progress]

Quick Start

from symagedocs import Client

client = Client(api_key="sk_live_...")

# List available forms
forms = client.forms.list()
for f in forms:
    print(f"{f.id}: {f.name} ({f.credit_cost} credits)")

# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    output_formats=["pdf_typed"],
    # Augmentation knobs. `degradation_profile` affects credit cost —
    # `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
    # `coherence_mode` controls cross-form identity correlation in multi-form jobs.
    degradation_profile="scanned",
    coherence_mode="coherent",
)
result = client.generate.wait(job.job_id)  # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")

# Per-item training data
job = client.generate.create(
    form_id="irs_w2_2025",
    quantity=10,
    output_formats=["pdf_typed", "bio"],
    idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
    print(example.item_id, len(example.bio.tokens))

# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")

# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")

Authentication

Get your API key at symagedocs.ai/account?tab=api.

# Pass directly
client = Client(api_key="sk_live_...")

# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client()  # reads from env

Async Support

from symagedocs import AsyncClient

async with AsyncClient(api_key="sk_live_...") as client:
    forms = await client.forms.list()
    job = await client.generate.create("irs_w2_2025", quantity=10)
    result = await client.generate.wait(job.job_id)

Configuration

client = Client(
    api_key="sk_live_...",
    base_url="https://symagedocs.ai",  # custom server
    timeout=30.0,                       # request timeout (seconds)
    max_retries=3,                      # retry on 429/5xx
)

Method Reference

Forms

MethodDescription
forms.list(category=None)List available forms, optionally filtered by category
forms.get(form_id)Get detailed form info including field definitions

Generation

MethodDescription
generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None)Create an async generation job. Pass either form_id (single form) or form_ids (coherent multi-form generation across the same identity). ink_color_distribution (when set) must be a per-color weight map summing to exactly 100. degradation_profile and coherence_mode are typed kwargs over what used to live inside config={...} — see the augmentation knobs section. idempotency_key attaches an Idempotency-Key header so retries within 24 hours return the original job_id and don't double-charge. The deprecated realism_level API field is intentionally not exposed; call the REST API directly if you need it.
generate.list_jobs(limit=50, cursor=None, status=None)List generation jobs (cursor-paginated)
generate.get_job(job_id)Get full job status and progress
generate.list_downloads(job_id)List per-artifact presigned download URLs for a completed job
generate.download(job_id, format, path)Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable.
generate.wait(job_id, poll_interval=3.0)Poll until job completes or fails
generate.cancel(job_id)Cancel a running job. Idempotent. Items rendered before the cancel observed remain downloadable via download(format="bundle").
generate.list_items(job_id, limit=50, cursor=None)List per-item records for a job. Cursor-paginated; each item carries its presigned download URLs.
generate.download_item(job_id, item_id)Presigned S3 URLs for one item's files.
generate.get_bio_labels(job_id, item_id)Client-side helper: fetches the item's _bio.json sidecar and returns a parsed BioDataset.
generate.get_word_annotations(job_id, item_id)Client-side helper: fetches the item's _words.json sidecar and returns parsed WordAnnotations.
generate.iter_training_examples(job_id, format="bio")Client-side helper: iterates all items, yielding training examples in the chosen format ("bio" (default), "funsd", "donut").

client.generation alias. client.generation and client.generate reference the same resource — use whichever name you prefer.

Identities

MethodDescription
identities.generate(quantity=1, config=None, seed=None)Generate raw synthetic identities as JSON

Tabular

MethodDescription
tabular.parse(prompt)Convert natural language to a column schema (LLM-powered)
tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None)Create a tabular generation job
tabular.status(job_id)Get tabular job progress and ETA
tabular.download(job_id, format, path)Download tabular output to a local file
tabular.wait(job_id, poll_interval=2.0)Poll until tabular job completes or fails

Account

MethodDescription
account.balance()Get credit balance (credits_used, credits_allocated)
account.usage(days=30)Get usage summary for the specified period

Pricing

The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.

MethodDescription
pricing.rates()Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …)
pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None)Estimate the credit cost of a hypothetical job before submitting it

Health

MethodDescription
client.health()Lightweight reachability probe (GET /api/v1/health). Returns the parsed JSON body. Works on both Client and AsyncClient.

Augmentation knobs

Two of the most-used keys in the freeform config={...} dict on generate.create are also exposed as typed kwargs:

  • degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | None
  • coherence_mode: Literal["coherent", "shuffled", "random"] | None

Why bother? Two reasons:

  1. degradation_profile affects credit cost. Non-clean profiles need extra rendering work (rasterization, noise, paper warp), so the billing engine applies a multiplier: scanned/faxed are billed at 1.2×, mixed at 1.25×, and photographed at 1.3×. A typo on the freeform config={...} form silently falls back to the default 1.0× multiplier — meaning you don't get the degradation you asked for AND the typo isn't caught until you notice the artifacts (or don't). The typed kwarg form catches typos at type-check time.
  2. Pre-flight validation. The Literal types fence off unknown values at edit time in any IDE that supports type checking. The backend also rejects unknown values with 400 for both knobs, so even untyped callers get a fast failure — but the typed form catches the mistake before the network round-trip.

The SDK exports the canonical value tuples too:

from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES

assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES

If you pass a value via both forms (e.g. config={"degradation_profile": "X"} AND degradation_profile="Y"), the value in config wins and a RuntimeWarning is emitted so the conflict isn't silent.

# Typed kwarg form — recommended.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    degradation_profile="scanned",   # billed at 1.2× — see above
    coherence_mode="coherent",
)

# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)

Error Handling

The SDK raises typed exceptions for API errors and retries automatically on 429 and 5xx:

from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError

try:
    forms = client.forms.list()
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests — SDK retries automatically")
except NotFoundError:
    print("Resource not found")

All error classes:

ExceptionHTTP CodeDescription
SymageDocsErrorBase exception for all SDK errors
AuthenticationError401Invalid or revoked API key
PermissionDeniedError403Key missing required scope
NotFoundError404Resource not found
ValidationError400Invalid request parameters
InsufficientCreditsError402Not enough credits for the operation
ConflictError409Resource in unexpected state (e.g., downloading incomplete job)
RateLimitError429Rate limit exceeded (SDK retries automatically)
ServerError5xxServer-side error (SDK retries automatically)

Examples

See examples/ in the downloaded SDK for complete working scripts:

  • list_forms.py — Browse available forms and credit costs
  • generate_w2s.py — Full pipeline: create job, wait, download PDF + JSON
  • tabular_dataset.py — Parse NL description, generate 5k rows, download CSV
  • train_kie_model.py — Create job with NIST3 labels, iterate training examples with BIO labels and spatial annotations

Documentation

License

MIT

Python SDK | SymageDocs