Simulatte Credibility — Research Program

Results measured
against reality.

12 studies. 11 countries. All DA scores measured against published Pew Research Center ground truth.

US Population Accuracy
95.3%
Pew American Trends Panel ground truth.
Calibrated 81.9% holdout
Europe — 9 Countries
93.33%
Mean calibrated DA across 9 European populations. All 9 above 91%.
Mean calibrated 68.57% holdout mean
India v2 — Peak
97.61%
Program peak. First company to replicate India's political landscape at population scale.
95.87% holdout Holdout also above 91%
01 — United States

95.3% calibrated.
81.9% holdout.

10 calibrated questions measured against Pew American Trends Panel ground truth. The holdout score — 81.9% — earned on 5 questions the system never saw during calibration, with zero topic anchors.

95.3%
Calibrated DA
15
Questions tested
81.9%
Holdout DA
±0.00pp
Variance

Ground truth: Pew American Trends Panel, Waves 119–130, 2022–2023. 40 personas · WorldviewAnchor architecture. Holdout questions pre-designated before calibration — zero topic anchors applied.

Read technical report → Audit data ↗
Calibrated accuracy — sprint convergence
100% 75% 50% 25% 91% DA THRESHOLD
Above 91% threshold
+4.3pp
Holdout (unseen)
81.9%
02 — Europe Benchmark v2

9 nations. All above
91%.

The first cross-national study where every country independently exceeded 91% DA against Pew ground truth — not as a mean, but each nation on its own. Rebuilt using Simulatte Persona Generator cohorts across all 9 nations.

Mean calibrated DA 93.33%
Mean holdout DA 68.57%
Countries above 91% 9 of 9
Peak country Italy — 95.48%
Variance ±0.00pp all 9 countries
Ground truth Pew Global Attitudes, Spring 2024

The calibration-to-holdout gap varies substantially across countries — Netherlands (81.47% holdout) and Poland (79.31%) show strong generalisation, while Hungary (55.92%) and Spain (61.07%) reveal where worldview transfer still has work to do. Every country is ±0.00pp across 3 replications.

40 personas per country · Simulatte Persona Generator (v2 rebuild) · 15 questions per country (10 shared cross-national + 5 country-specific) · Sprint EUR-1 · ±0.00pp variance.

Italy
95.48%
+4.48pp above 91%
Holdout: 63.10%
Poland
94.55%
+3.55pp above 91%
Holdout: 79.31%
Netherlands
94.41%
+3.41pp above 91%
Holdout: 81.47%
United Kingdom
94.00%
+3.00pp above 91%
Holdout: 63.03%
Greece
93.93%
+2.93pp above 91%
Holdout: 69.53%
Sweden
93.37%
+2.37pp above 91%
Holdout: 69.78%
Hungary
91.47%
+0.47pp above 91%
Holdout: 55.92%
Spain
91.45%
+0.45pp above 91%
Holdout: 61.07%
France
91.33%
+0.33pp above 91%
Holdout: 73.96%
Read technical report → Audit data ↗
03 — India v2 — Program Peak

97.61% calibrated.
95.87% holdout.

The first study in the program where holdout DA — earned on questions never seen during calibration, with zero topic anchors — also exceeds 91% DA. The calibration-to-holdout gap is 1.74pp, down from 13.4pp in the US study.

Calibrated 97.61% DA 97.61% +11.61pp above 91%
Holdout 95.87% HOLDOUT 95.87% Holdout also exceeds 91%
The first study in the program where holdout DA (95.87%) also exceeds 91% — not just calibrated performance. The LLM generalises from worldview alone, without topic-specific anchors.
Calibration → Holdout gap
1.74pp
Down from 13.4pp (USA) — smallest in the program
Personas
80
india_general · DEEP tier · Persona Generator
Top calibrated questions
99.0% — India global power & women's rights
Top holdout questions
98.85% — Strong leader · 98.5% — Climate threat

Ground truth: Pew Global Attitudes Survey 2023 + CSDS-Lokniti NES (N ≈ 2,044–3,281 per question). Sprint IND-1 · ±0.00pp variance · 3 replications.

Read technical report → Audit data ↗
All Results

Every study. Every number.

12 completed studies. All scores measured against published Pew Research Center ground truth. All holdout questions pre-designated before calibration — zero topic anchors applied.

Study Calibrated DA Holdout DA
PEW USA v2 95.3% ±0.00pp 81.9% ±0.87pp
PEW India v2 ★ Peak 97.61% ±0.00pp 95.87% ±0.00pp
Europe — Italy95.48% ±0.00pp63.10% ±0.00pp
Europe — Poland94.55% ±0.00pp79.31% ±0.00pp
Europe — Netherlands94.41% ±0.00pp81.47% ±0.00pp
Europe — UK94.00% ±0.00pp63.03% ±0.00pp
Europe — Greece93.93% ±0.00pp69.53% ±0.00pp
Europe — Sweden93.37% ±0.00pp69.78% ±0.00pp
Europe — Hungary91.47% ±0.00pp55.92% ±0.00pp
Europe — Spain91.45% ±0.00pp61.07% ±0.00pp
Europe — France91.33% ±0.00pp73.96% ±0.00pp
PEW Germany (1C)91.3%76.5%

★ India v2 is the only study where holdout DA (95.87%) also exceeds 91%. DA = 1 − TVD = 1 − Σ|realᵢ − simᵢ| / 2.

Methodology

How we measure accuracy.

Distribution Accuracy (DA) measures how closely Simulatte's synthetic population mirrors real survey distributions. Every study follows the same protocol: calibrate on published data, then test on holdout questions the system has never seen.

01
Distribution Accuracy
DA = 1 − TVD. Total Variation Distance measures the maximum divergence between synthetic and real response distributions. A DA of 95% means the synthetic population differs from ground truth by only 5pp.
02
Benchmark Reference
91% DA is the natural self-inconsistency floor implied by survey test-retest literature — the point at which a simulation is matching the Pew sample within the noise floor of the data itself.
03
Holdout Protocol
Questions are split before calibration. Holdout questions receive zero topic anchors. Holdout DA measures pure worldview transfer — generalisation to unseen topics.
04
WorldviewAnchor
Each persona carries a structured worldview — values, priorities, ideological lean — derived from real typology data. The LLM conditions on worldview, not demographic stereotypes.
05
Replication Variance
Every study runs 3 times. ±0.00pp variance across all 12 studies means identical distributions regardless of LLM sampling randomness.
06
Open Audit
All study configurations, sprint runners, question sets, persona manifests, and raw outputs are published on GitHub. Every number is independently reproducible.
Read full methodology → Audit data ↗