The Experiment That Could Kill It
Description
The fastest way to respect an idea is to try to disprove it as cheaply as possible. For criterion sortition that test exists, it borrows a method psychologists have used for decades, and the first version could be run within a year. Everything rests on one empirical bet: that there are real human capacities you can raise only by genuinely acquiring them, where you can't fake the score without doing the work. The quantity to measure is the gap between how much a gaming regime raises a test and how much the same effort raises real, out-of-sample performance — the part of the score gain that doesn't show up in reality, which you want near zero. Psychometricians already do this: Daniel Koretz built score-inflation auditing to catch inflated school-test scores by comparing a high-stakes score against a low-stakes audit of the same skill; the divergence is the inflation. The clean version is a randomized trial — one group left alone, one genuinely trained, one coached only to beat the test — plus a bounty paid to anyone who can find the most effective way to game it, because gaming is what optimizers do, not average students; notably, almost no one has actually randomized test-prep conditions before. Start with forecasting, because we know the most there: Tetlock's tournaments show training that raises the score also raises real resolved accuracy, so the leak looks small; reuse existing forecasting platforms, run the groups over one cycle of real questions, and put a number on it — months, not years. And deploy small: not parliaments but oversight roles (auditors, inspectors — Bagg's case that random selection is best aimed at watching power), paired with the oldest accountability tools we have, the Athenian practice of scrutinizing randomly chosen officials before they served and auditing them after with real penalties. The honest end is not a verdict but a gate: one measurement decides whether the lead merit is real — does the score move without the skill? If it does, drop it; if it doesn't, you have a foundation and you build outward, one measured capacity at a time.
- Auditing for Score Inflation Using Self-Monitoring Assessments (Koretz) — Educational Assessmenthttps://www.tandfonline.com/doi/abs/10.1080/10627197.2016.1236674
- Teaching to the test: coaching effects, and the random-assignment gap — JESLAhttps://euroslajournal.org/articles/10.22599/jesla.74
- The Good Judgment Project — Wikipediahttps://en.wikipedia.org/wiki/The_Good_Judgment_Project
- Sortition as Anti-Corruption: Popular Oversight against Elite Capture (Bagg) — AJPShttps://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12704
- Athenian democracy (dokimasia and euthyna) — Wikipediahttps://en.wikipedia.org/wiki/Athenian_democracy
Script
Cold open
The fastest way to respect an idea is to try to disprove it on purpose, as cheaply as you can. For this one, that test exists. It borrows a method psychologists have used for decades, and you could run the first version of it this year.
Frame
Everything we've built rests on one empirical bet: that there are real human capacities you can raise only by genuinely acquiring them — where you can't fake the score without doing the work. If that's true for even a few capacities, the menu has an anchor. If it's true for none, the whole idea collapses — and far better to learn that from a small study than from a large failure.
What's the single number you'd actually measure?
So here's the number to measure. Take one candidate merit. See how much a gaming regime raises the test — and how much that same effort raises real, out-of-sample performance. The gap between those two, the part of the score that goes up while reality doesn't, is what you want close to zero. And psychometricians already do this. Daniel Koretz built the method to catch inflated school-test scores: put the high-stakes score next to a low-stakes audit of the very same skill. Where they diverge, that's the inflation. That's the leak.
How do you measure it cleanly?
How do you measure it cleanly? A trial. Three groups: one left alone, one genuinely trained in the skill, one coached only to beat the test. And then a bounty — real money to anyone who can find the most effective way to game it — because gaming is something optimizers do, not something average students do. Here's the surprising part: for all the decades of arguing about coaching, almost nobody has actually randomized this. If the gaming group's score jumps but their real performance doesn't, you've measured the leak. If you can't raise the score without raising the skill — that merit is solid.
Where do you start, and why there?
Where do you start? Forecasting — because we already know the most there. Tetlock's tournaments found that training which lifts the score also lifts real, resolved accuracy on questions nobody had seen. The leak, for forecasting, looks small. So reuse the forecasting platforms that already exist, run the three groups across one cycle of real questions, and put an actual number on it. This is cheap. Months, not years.
And when you first put it into the world, where?
And when you first put it into the world — don't reach for parliaments. Use it where the cost of a mistake is low and the fit is best: oversight. Auditors, inspectors — the people who watch power rather than wield it, which is exactly where Samuel Bagg argues random selection belongs. Then pair it with the oldest accountability tools we have. Athens scrutinized its randomly chosen officials before they served, and audited them after — with real penalties. Small. Reversible. Watched.
Turn
So the honest end of all this isn't a verdict. It's a gate. One measurement decides whether the lead merit is real: does the score move without the skill? If it does — drop it. If it doesn't, you've got a foundation, and you build outward, one measured capacity at a time. Notice what this refuses to do. It does not ask you to believe the idea. It asks you to run the single test that could destroy it, and then to act on whatever comes back. The whole design was arranged, from the start, so that one experiment can falsify its first claim.
Closer
We started at a wall. You cannot build a definition of merit that can't be gamed. We end somewhere narrower, and far more useful — a specific, testable bet: that you can make gaming expensive, and the cost of trying worth paying, with a cheap experiment that tells you whether the bet holds. Not a system to believe in. A question precise enough to answer. Which is the only kind worth asking.