The M×N problem of tool calling and open-source models

April 14, 2026
An east Asian woman expressing frustration during a phone call at the workplace.
Photo by Felicity Tai on Pexels

One smooth API — until you switch the engine

Tool calling with closed-source models feels like magic. You hand the API a list of functions, the model invokes them, and you get back tidy JSON. No muss, no fuss. Then you plug in an open model and the magic unravels. Suddenly the wire format — the exact token marks and serialization the model emits — matters. Different models choose different conventions. Different token vocabularies, different boundary markers, different ways to encode arguments. The result: the same semantic action can look completely different across models.

A Tower of Babel in model output

Want an example? The same logical call — search(query="GPU") — can be wrapped as a channel-commentary envelope in one family, fenced triple-quoted JSON in another, or as XML-like arg_key/arg_value pairs in a third. In short: incompatible wire formats. That means every runtime or inference engine that wants to support N models ends up writing M custom parsers — M engines × N model formats. Parsers must not just find markers; they must survive decoder quirks, reasoning tokens leaking into arguments, and end-of-generation collisions. Painful. Predictable. Exhausting.

Bugs, PRs, and an endless maintenance treadmill

It has been reported that these are not hypothetical headaches but real maintenance drama: reasoning tokens stripped too early, parser-visible leaks, and dedicated implementations replacing attempted generic autoparsers. Allegedly, projects like vLLM and llama.cpp have had to ship model-specific workarounds to keep things functioning. The pace of model releases — and the fact that wire format is a training-time, unconstrained choice — means a long tail of tricky, per-model bugs. Generic heuristics help with the common cases, sure, but they can’t close the gap where the nastiest failures hide.

Two teams, one truth — and no shared map

Here’s the kicker: the same format knowledge is needed in two places. Grammar engines must constrain generation so the model emits the right envelope at the right time. Output parsers must reverse that process and produce clean API responses. Different teams, different repos, different release schedules. No single source of truth. So what’s the fix? A shared convention or model-published grammar metadata would help — or developers will keep reinventing parsers and nursing brittle glue. Is that where we want the open-model future to go? It’s a maintenance nightmare waiting to happen, unless the community decides to speak the same language.

Sources: thetypicalset.com, Hacker News