Research, analysis, and perspectives on AI capability.
Same model, different harness, fifteen points of benchmark. Opus 4.7 launched yesterday and landed twelve points below Opus 4.6-on-a-better-harness on the same benchmark. The AI industry has learned that the scaffolding around a model does most of the work. It has not yet noticed the same is true of the humans.
If a harness around the human is what's missing, the engineering question is what one actually looks like. Item response theory has been building exactly this kind of scaffolding since the 1970s. The mathematics doesn't care whether the thing being assessed is a nurse, a child, or Claude Opus 4.7.