v0.1.0 — Now on PyPI

Synthetic data that
understands your schema

SynthForge combines diffusion models, Gaussian copulas, and LLM-powered intelligence to generate production-grade synthetic tabular data from a 2,500-row sample. Detect PII. Preserve correlations. Ship as a pip-installable library.

$ pip install synthforge
98.5% Quality Score
6 Synthesizers
5-Layer Evaluation
100+ LLM Providers

The Approach

Five-stage pipeline, one line of code

Most synthetic data tools treat generation as a black box. SynthForge decomposes it into five observable, configurable stages — each one improvable independently.

01
Profile
Auto-detect schema types. LLM infers column semantics, relationships, and business rules from names + sample values.
02
Detect
3-layer PII detection: regex heuristics → Microsoft Presidio NER → LLM catches non-obvious patterns. MNPI flagging for financial data.
03
Fit
Auto-select synthesizer by data type and hardware. Reversible transforms handle nulls, outliers, mixed types. Constraints baked in via CAG.
04
Generate
Batch synthesis at configurable scale (1K–10M rows). PII columns auto-replaced with Faker. Min/max enforced from original.
05
Evaluate
5-layer quality gate: diagnostics, KS/TV/correlation/C2ST fidelity, TSTR ML utility, MIA privacy, LLM semantic validation.

Generation Engines

Six models, from seconds to state-of-the-art

Auto-strategy engine selects the optimal model based on your data characteristics and hardware. CUDA enforced for neural models — no accidental 3-hour CPU training runs.

Gaussian Copula
Default · CPU
Fits marginal distributions per column + Gaussian correlation structure. Trains in seconds. No GPU needed. Best for numerical data.
CPU · Seconds
CTGAN
NeurIPS 2019 · GPU
Conditional GAN with mode-specific normalization and training-by-sampling. Best for imbalanced categoricals and high-cardinality columns.
CUDA · Minutes
TVAE
NeurIPS 2019 · GPU
Tabular Variational Autoencoder. More stable than CTGAN, fewer hyperparameters. Strong default for mixed-type data.
CUDA · Minutes
TabDDPM
ICML 2023 · GPU
Denoising diffusion with dual noise processes — Gaussian for continuous, multinomial for categorical. Decisive quality leap over GANs.
CUDA · 10min
TabSyn
ICLR 2024 Oral · GPU
Latent diffusion: VAE encodes mixed types to unified latent space, score-based diffusion learns it. 86% better marginals, 93% faster sampling than TabDDPM.
CUDA · SOTA
GReaT
ICLR 2023 · GPU
Fine-tunes GPT-2 on text-serialized rows. Leverages pretrained semantic knowledge. Supports conditional generation without retraining.
LLM · Hours

The Differentiator

LLM intelligence at every stage

No existing library systematically uses LLMs for schema understanding, privacy detection, and semantic validation. SynthForge does, through any provider — Claude, GPT, Ollama, vLLM — via LiteLLM.

Schema Enrichment
Infers that "fname" means first_name, "amt" means currency, and that city-state-zip form a hierarchical group. Maps columns to Faker providers for realistic replacement.
PII Detection
Layer 1: regex on column names. Layer 2: Presidio NER on values. Layer 3: LLM catches non-obvious PII — a column named "cust_ref" containing SSNs, quasi-identifiers that together re-identify individuals.
MNPI Detection
Flags material non-public information in financial data: unreleased earnings, M&A deal values, strategic plans. Classifies risk level (low/medium/high/critical) per column.
Semantic Validation
LLM-as-judge pattern: reviews batches of synthetic rows for impossible combinations — a 5-year-old with a PhD, Japanese names with Mexican zip codes, shipping dates before order dates.

Quality Benchmarks

Evaluation is first-class, not an afterthought

Every generate() call can return a quality report. Built-in pass/fail thresholds configurable per use case. MIA-based privacy replaces the discredited DCR metric.

DatasetScoreKS Compl.CorrelationC2STTV Compl.
Sensor (numerical) 98.5% 0.9920.9870.973
HR (mixed + PII) 78.3% 0.9670.9840.8560.739
Financial (complex) 75.7% 0.4960.9830.811
E-commerce (categorical) 73.2% 0.7550.8650.682

Gaussian Copula on CPU. Quality increases substantially with TabSyn/TabDDPM on GPU.

Usage

Three lines to production

Python from synthforge import SynthForge # Load a 2,500-row sample from Redshift / any warehouse df = pd.read_csv("production_sample.csv") # One line: profile → fit → generate forge = SynthForge(llm_provider="anthropic", llm_model="claude-sonnet-4-20250514") synthetic = forge.fit_generate(df, num_rows=100_000) # Quality report with pass/fail gates report = forge.evaluate(df, synthetic) print(report.summary()) # → Overall: 98.34% PASS

Brief

The elevator pitch

Copy-ready description
SynthForge is a pip-installable Python library for generating high-fidelity synthetic tabular data from small production samples. It combines six generation backends — from fast Gaussian Copula (seconds, CPU) to state-of-the-art TabSyn latent diffusion (ICLR 2024) and TabDDPM denoising diffusion (ICML 2023) — with an LLM-augmented pipeline that automatically detects PII and MNPI, infers column semantics, and validates generated data for logical consistency. The library auto-selects the optimal synthesizer based on data characteristics (numerical, categorical, time-series, mixed) and hardware (CUDA enforced for neural models), supports configurable scale from thousands to millions of rows via batch generation, and ships with a 5-layer evaluation pipeline covering statistical fidelity, ML utility, and privacy metrics. LLM integration is provider-agnostic via LiteLLM, supporting Claude, OpenAI, Ollama, and 100+ providers.

Architecture

Modular, extensible, typed

Structure synthforge/ ├── forge.py # Orchestrator — public API ├── config.py # Pydantic v2 models ├── metadata.py # Schema detection + semantic types ├── synthesizers/ │ ├── gaussian_copula # Fast default (CPU) │ ├── ctgan + tvae # GAN/VAE (GPU, CUDA enforced) │ ├── tabddpm # Diffusion (ICML 2023) │ ├── tabsyn # Latent diffusion (ICLR 2024) │ └── great # LLM fine-tuning (ICLR 2023) ├── transforms/ # Reversible: Num/Cat/DateTime/Bool ├── constraints/ # CAG: Inequality/Positive/Range ├── llm/ # Schema/PII/MNPI/Validator via LiteLLM ├── evaluation/ # KS/TV/Corr/C2ST/TSTR/MIA └── strategies/ # Auto-select by data type + hardware