Measure More Things
Description
Once you've watched a single-number merit get gamed, the instinct is to measure more — the whole person, not one score. This episode argues that the richer fixed definition fails for a precise reason: not that the dimensions are wrong, but that all of them stand still. Holistic, multi-factor admissions are the richest fixed definition we've built, and they produced a multi-front arms race — essay consultants, curated activities, the manufactured well-rounded applicant — because each added dimension is one more thing the resourced can prepare for and coaching reliably inflates scores. Measuring judgment rather than knowledge helps only modestly: situational-judgment tests predict performance with useful, generalizable but unspectacular validity (around 0.34), and the moment one gates a prize, the prep industry appears. The most hopeful fix — that cognitive diversity itself out-reasons narrow expertise (Landemore) — rests on the 'diversity trumps ability' theorem, which mathematicians have shown holds only under fragile assumptions, so the hope is real but the math is not load-bearing. What every failed fix shares is visible in the structure of Goodhart's law (its regressive, extremal, causal, and adversarial forms, including that the ablest gamers exploit a proxy best): each assumes a target that holds still long enough to be studied. The missing move is therefore not a better or wider definition of merit but a definition that does not sit still — one you cannot study in advance because it isn't fixed until the moment it is used.
- Teaching to the test: coaching effects on proficiency scores — JESLAhttps://euroslajournal.org/articles/10.22599/jesla.74
- Situational judgement test validity for selection: a systematic review and meta-analysis — Medical Educationhttps://asmepublications.onlinelibrary.wiley.com/doi/full/10.1111/medu.14201
- Deliberation, Cognitive Diversity, and Democratic Inclusiveness (Landemore) — Synthesehttps://link.springer.com/article/10.1007/s11229-012-0062-6
- Fatal errors and misuse of mathematics in the Hong-Page Theorem — arXivhttps://arxiv.org/html/2307.04709
- The Problem with Metrics / Goodhart failure taxonomy — arXivhttps://arxiv.org/pdf/2002.08512
Script
Cold open
The obvious repair, once you've watched merit get gamed, is to measure more. Don't rank people on one number — look at the whole person. The grades, and the essay, and the interview, the leadership, the character. It feels like the grown-up answer. It is also how you get a coaching industry for each new thing you decided to measure.
Frame
Last time we hit a wall: any filter for competence can be gamed, and the only un-gameable selection is the one that filters for nothing. The natural response is to make the filter richer — more dimensions, more nuance, harder to fake. This episode is why that doesn't work. And the reason is precise. It isn't that we picked the wrong dimensions. It's that all of them sit still.
What happens when you measure the whole person?
Take the richest fixed definition of merit we've actually built: holistic admissions. Don't reduce a person to a score — weigh everything. And the result was an arms race on every front at once. Essay consultants. Curated extracurriculars. The carefully manufactured well-rounded teenager. Coaching reliably lifts the measured score without lifting the thing it was supposed to measure. Each dimension you add is one more thing the people with resources can prepare for, and the people without can't. The whole person became a whole-person checklist.
Doesn't measuring judgment instead of knowledge fix it?
Fine — measure judgment, not just knowledge. Surely that's harder to fake. Partly true. Situational-judgment tests do predict real performance, and the effect holds across settings — but the size of it is about a third on the usual scale. Useful. Not un-gameable. And the moment a test like that becomes the gate, the preparation grows up around it, the same way it always does. A better dimension is still a dimension with a prize sitting on it.
What about the hope that diversity itself out-reasons expertise?
Here's the most hopeful version. Maybe variety itself is the fix — a wide, mixed group out-reasons a narrow room of experts. Helene Landemore makes exactly that case for opening selection up. It may well be right. But the theorem people quote to prove it — 'diversity trumps ability' — has been picked apart by mathematicians, who showed it holds only under fragile, convenient assumptions. The hope is real. The proof isn't. Don't build the floor out of it.
What do all these fixes secretly share?
So step back and ask what all of these fixes share. Goodhart's law isn't one failure — it's a catalogue: the proxy ignores what it left out, it snaps at the extremes, and, worst, the ablest people game it best. Now notice the thread running through every entry. Each one assumes a target that holds still long enough to be studied. Holistic, multi-factor, judgment, diversity — every one of them, fixed. And you can always learn a target that stands still.
Turn
That's the thing the failures have been pointing at the whole time. The problem was never that our definition of merit was too narrow. We widened it — as far as it would go. The problem is that it stands still. A fixed definition, however rich, is a target. And a target that doesn't move can be studied, prepared for, and captured by whoever can afford the studying. Every repair so far has changed what we measure. Not one of them changed the single property the gamers actually depend on — that you can know, ahead of time, what's going to count.
Closer
Which finally puts a name to the missing move. Not a better definition of merit. Not more dimensions bolted on. A definition that doesn't sit still — one you can't study in advance, because it isn't fixed until the moment it's used. It sounds like a dodge, or a trick. It is neither. It's the one property none of the fixes had. Next: what it actually means to let the definition move.