Qwen’s tiny local model beat Anthropic’s Opus at the “pelican on a bicycle” test — and folks are surprised

A silly benchmark, serious implications
It has been reported that Simon Willison’s long-running joke benchmark — asking models to draw a pelican riding a bicycle — favored Alibaba’s Qwen3.6-35B-A3B over Anthropic’s new Claude Opus 4.7. Willison allegedly ran a 20.9GB quantized Qwen3.6-35B-A3B-UD-Q4_K_S.gguf model by Unsloth on a MacBook Pro M5 via LM Studio (with the llm-lmstudio plugin) and got a cleaner bicycle frame than he did from Opus 4.7. Short and sweet: Qwen drew the bike right. Opus, not so much.
A rematch and a surprise encore
Willison reportedly tried Opus again with thinking_level set to max; it didn’t close the gap. He then burned a “secret backup” test — an SVG flamingo on a unicycle — and again gave the nod to Qwen, partly because of a cheeky SVG comment in the output (). These are clearly small, idiosyncratic tasks. Still, they’re oddly revealing. A 21GB quantized model running locally producing better art than a state-of-the-art cloud model? That’s a headline-grabber.
Why this matters (and why to take it with salt)
The pelican benchmark started as a gag. But Willison notes there’s been a loose correlation between how pleasing the pelicans looked and a model’s general usefulness. The correlation held through several releases — until today. He also cautions that he “very much doubt[s]” a 21GB quantized Qwen is truly more capable overall than Anthropic’s latest proprietary Opus. Fair point. Anecdote ≠ benchmark suite.
Bigger picture: progress on laptops, questions about evaluation
Still — this is a reminder that quantized, efficient models running on consumer hardware are getting surprisingly competent. Cute SVGs aside, the episode raises a pragmatic question: how should we evaluate models when tiny local setups sometimes outdraw cloud behemoths? Is it just luck, or are we witnessing steady wins for optimization and accessibility? Either way, the pelican joke keeps flying — and it’s become unexpectedly good at flapping wings in the right direction.
Sources: simonwillison.net, Hacker News
Comments