Anthropic trains Claude copies to teach a stronger model — can AI learn to align AI?

April 14, 2026
Team in high-visibility vests conversing in a metal warehouse.
Photo by James Richardson on Pexels

What the team built

Anthropic has rolled out an experiment that reads like a meta-science project: give several copies of a large language model the tools of a researcher, and see if they can invent alignment techniques. The company started with nine copies of Claude Opus 4.6, each placed in its own sandbox and outfitted with a shared forum, code storage, and a remote scoring server that returned a "performance gap recovered" (PGR) metric. The setup is deliberately simple and a bit clever: a weaker model acts as the teacher, offering guidance to a stronger base model, and the question is whether that weaker signal can be interpreted and amplified by the stronger model into real gains.

The experiment and the stakes

This is an exercise in "weak-to-strong supervision" — a concrete proxy for the thorny problem of scalable oversight: how do we keep AI aligned as it outpaces human expertise? Anthropic asked each Automated Alignment Researcher (AAR) to propose, run, and analyze its own experiments with minimal direction; prompts nudged them toward different approaches (interpretability, data reweighting, etc.) but didn’t script the work. The PGR score measures how much of the performance gap between the weak teacher and the strong upper bound the AARs can recover. Simple on paper. High stakes in practice. If models can meaningfully bootstrap alignment research, that changes the tempo of safety work — for better or worse.

Early takeaways — cautious, not celebratory

It has been reported that Anthropic’s AARs could autonomously generate ideas and run experiments that in some cases improved PGR. Allegedly, some strategies discovered by the Claudes produced measurable uplift; other trials fell flat or simply replicated the teacher’s limitations. The headline: AI can accelerate aspects of alignment research, but it doesn’t magically solve scalable oversight. This feels like watching a promising intern find a trick that saves hours of grunt work — useful, but not a replacement for judgment.

Why you should care

Why does this matter beyond lab curiosity? Because frontier models are already writing vast swaths of code and shaping systems humans must later check. If future models are much smarter than us, we’ll need techniques that let weaker signals (including human feedback) be amplified reliably. Anthropic’s work is an early, practical probe of that idea. It’s a start, not the finish line. Still, the image is striking: machines nudging machines toward our goals. Exciting? Absolutely. A little nerve-wracking? You tell me.

Sources: anthropic.com