v0.1.0 — Now on PyPI

Synthetic data that
understands your schema

SynthForge combines diffusion models, Gaussian copulas, and LLM-powered intelligence to generate production-grade synthetic tabular data from a 2,500-row sample. Detect PII. Preserve correlations. Ship as a pip-installable library.

$ pip install synthforge

98.5% Quality Score

6 Synthesizers

5-Layer Evaluation

100+ LLM Providers

The Approach

Five-stage pipeline, one line of code

Most synthetic data tools treat generation as a black box. SynthForge decomposes it into five observable, configurable stages — each one improvable independently.

Profile

Auto-detect schema types. LLM infers column semantics, relationships, and business rules from names + sample values.

Detect

3-layer PII detection: regex heuristics → Microsoft Presidio NER → LLM catches non-obvious patterns. MNPI flagging for financial data.

Fit

Auto-select synthesizer by data type and hardware. Reversible transforms handle nulls, outliers, mixed types. Constraints baked in via CAG.

Generate

Batch synthesis at configurable scale (1K–10M rows). PII columns auto-replaced with Faker. Min/max enforced from original.

Evaluate

5-layer quality gate: diagnostics, KS/TV/correlation/C2ST fidelity, TSTR ML utility, MIA privacy, LLM semantic validation.

Generation Engines

Six models, from seconds to state-of-the-art

Auto-strategy engine selects the optimal model based on your data characteristics and hardware. CUDA enforced for neural models — no accidental 3-hour CPU training runs.

Gaussian Copula

Default · CPU

Fits marginal distributions per column + Gaussian correlation structure. Trains in seconds. No GPU needed. Best for numerical data.

CPU · Seconds

CTGAN

NeurIPS 2019 · GPU

Conditional GAN with mode-specific normalization and training-by-sampling. Best for imbalanced categoricals and high-cardinality columns.

CUDA · Minutes

TVAE

NeurIPS 2019 · GPU

Tabular Variational Autoencoder. More stable than CTGAN, fewer hyperparameters. Strong default for mixed-type data.

CUDA · Minutes

TabDDPM

ICML 2023 · GPU

Denoising diffusion with dual noise processes — Gaussian for continuous, multinomial for categorical. Decisive quality leap over GANs.

CUDA · 10min

TabSyn

ICLR 2024 Oral · GPU

Latent diffusion: VAE encodes mixed types to unified latent space, score-based diffusion learns it. 86% better marginals, 93% faster sampling than TabDDPM.

CUDA · SOTA

GReaT

ICLR 2023 · GPU

Fine-tunes GPT-2 on text-serialized rows. Leverages pretrained semantic knowledge. Supports conditional generation without retraining.

LLM · Hours

The Differentiator

LLM intelligence at every stage

No existing library systematically uses LLMs for schema understanding, privacy detection, and semantic validation. SynthForge does, through any provider — Claude, GPT, Ollama, vLLM — via LiteLLM.

Schema Enrichment

Infers that "fname" means first_name, "amt" means currency, and that city-state-zip form a hierarchical group. Maps columns to Faker providers for realistic replacement.

PII Detection

Layer 1: regex on column names. Layer 2: Presidio NER on values. Layer 3: LLM catches non-obvious PII — a column named "cust_ref" containing SSNs, quasi-identifiers that together re-identify individuals.

MNPI Detection

Flags material non-public information in financial data: unreleased earnings, M&A deal values, strategic plans. Classifies risk level (low/medium/high/critical) per column.

Semantic Validation

LLM-as-judge pattern: reviews batches of synthetic rows for impossible combinations — a 5-year-old with a PhD, Japanese names with Mexican zip codes, shipping dates before order dates.

Quality Benchmarks

Evaluation is first-class, not an afterthought

Every generate() call can return a quality report. Built-in pass/fail thresholds configurable per use case. MIA-based privacy replaces the discredited DCR metric.

Dataset	Score	KS Compl.	Correlation	C2ST	TV Compl.
Sensor (numerical)	98.5%	0.992	0.987	0.973	—
HR (mixed + PII)	78.3%	0.967	0.984	0.856	0.739
Financial (complex)	75.7%	0.496	0.983	—	0.811
E-commerce (categorical)	73.2%	0.755	0.865	—	0.682

Gaussian Copula on CPU. Quality increases substantially with TabSyn/TabDDPM on GPU.

Usage

Three lines to production

      Python
from synthforge import SynthForge

# Load a 2,500-row sample from Redshift / any warehouse
df = pd.read_csv("production_sample.csv")

# One line: profile → fit → generate
forge = SynthForge(llm_provider="anthropic", llm_model="claude-sonnet-4-20250514")
synthetic = forge.fit_generate(df, num_rows=100_000)

# Quality report with pass/fail gates
report = forge.evaluate(df, synthetic)
print(report.summary())
# → Overall: 98.34% PASS
    

Brief

The elevator pitch

Copy-ready description

SynthForge is a pip-installable Python library for generating high-fidelity synthetic tabular data from small production samples. It combines six generation backends — from fast Gaussian Copula (seconds, CPU) to state-of-the-art TabSyn latent diffusion (ICLR 2024) and TabDDPM denoising diffusion (ICML 2023) — with an LLM-augmented pipeline that automatically detects PII and MNPI, infers column semantics, and validates generated data for logical consistency. The library auto-selects the optimal synthesizer based on data characteristics (numerical, categorical, time-series, mixed) and hardware (CUDA enforced for neural models), supports configurable scale from thousands to millions of rows via batch generation, and ships with a 5-layer evaluation pipeline covering statistical fidelity, ML utility, and privacy metrics. LLM integration is provider-agnostic via LiteLLM, supporting Claude, OpenAI, Ollama, and 100+ providers.

Architecture

Modular, extensible, typed

      Structure
synthforge/
├── forge.py              # Orchestrator — public API
├── config.py             # Pydantic v2 models
├── metadata.py           # Schema detection + semantic types
├── synthesizers/
│   ├── gaussian_copula   # Fast default (CPU)
│   ├── ctgan + tvae      # GAN/VAE (GPU, CUDA enforced)
│   ├── tabddpm           # Diffusion (ICML 2023)
│   ├── tabsyn            # Latent diffusion (ICLR 2024)
│   └── great             # LLM fine-tuning (ICLR 2023)
├── transforms/           # Reversible: Num/Cat/DateTime/Bool
├── constraints/          # CAG: Inequality/Positive/Range
├── llm/                  # Schema/PII/MNPI/Validator via LiteLLM
├── evaluation/           # KS/TV/Corr/C2ST/TSTR/MIA
└── strategies/           # Auto-select by data type + hardware
    

Synthetic data thatunderstands your schema