Every prediction we make is logged. Every outcome is scored. Honest receipts for a discipline that lives or dies by them.
How well our calibrated populations replicate known attitude and behaviour distributions on benchmark surveys.
What we have shipped: 88.7% on Pew US benchmarks, 85.3% on Pew India (the first system to replicate India at population scale), 4.9× closer to the human self-consistency ceiling than the average frontier LLM across 5,878 SHA-256 verified API calls.
Whether a decision we predicted to succeed or fail in market actually did.
What we have shipped: structure for tracking. Data populates with each shipped engagement and Blind Replay. Distribution accuracy is necessary but not sufficient — outcome accuracy is the discipline's load-bearing claim, and it must be earned with every engagement.
A Blind Replay is a forecast made on a public 2024–25 consumer launch where the outcome is now known. We publish the prediction, the actual outcome, and the verdict — hit, miss, or partial. The empty seats are the point.
A platform that doesn't publish a scope statement is selling intuition, not engineering. Here's where calibration is live, partial, or absent.
● Live · ◐ Partial · ○ Not yet calibrated