From 6.75% to 99.8%: What Type-Constrained Verification Delivers

Qwen 3.5 can call functions correctly on its first attempt 6.75% of the time.

That number should end any conversation about deploying LLM function calling without verification. But it does not end the conversation about function calling itself. Because with a type-constrained harness wrapping the same model, that number becomes 99.8%.

Not a different model. Not fine-tuning. Not a larger context window. A compiler.

We have documented the 42% to 78% performance swing from harness differences on general coding tasks. We have named the discipline and traced its lineage. This article extends both with a specific, measurable claim: for function calling, the harness tax is negative. The verification layer costs less than the failures it prevents.

The AutoBe Data

Jeongho Nam and the Wrtn Technologies team ran five Qwen models through a function calling benchmark. The baseline results are brutal. Qwen 3.5 at 6.75% first-try success. The other models were not much better.

Their solution was not prompt engineering. It was not few-shot examples. It was a TypeScript type system that acts simultaneously as schema definition, runtime validator, and structured prompt.

The type is the schema. The type is the validator. The type is the prompt. One artifact, three functions.

When the LLM generates a function call, the harness compiles it against the type schema. If it fails compilation, the error message goes back to the model as structured feedback. The model tries again with the compiler output as context. This loop runs until the call compiles or a retry limit is hit.

The result across five Qwen models: 99.8%+ success.

This is not an incremental improvement. It is a category change. The model went from unusable to production-grade without any change to its weights.

Why This Works Better Than Prompt Engineering

The instinct when function calling fails is to add more instructions. More examples. Longer system prompts. Detailed parameter descriptions. This approach has a ceiling, and the ceiling is low.

NESTFUL, published at EMNLP 2025, tested GPT-4o on nested tool calls. Success rate: 28%. JSONSchemaBench at ICLR 2025 tested frontier models on complex JSON schema generation. Coverage: 3% to 41%. These are the best models available, with the best prompts their creators could write. The failure rates are still catastrophic for production use.

The reason is that prompt engineering treats the model as the entire system. You are asking a probabilistic text generator to produce deterministically correct structured output. Sometimes it does. Mostly it does not.

Type-constrained verification treats the model as one component in a pipeline. The model proposes. The compiler verifies. Failed proposals get structured feedback. The model proposes again. Each iteration narrows the error space because the compiler output is specific: “parameter X expected type Y, received type Z.”

Compare that to a prompt instruction: “make sure all parameters are correctly typed.” One is actionable. The other is a wish.

The Pretext Pattern

Nikola Balic’s analysis of Pretext, a TypeScript text measurement tool built by Cheng Lou, surfaces a complementary principle: architectural constraints as verification.

Pretext separates computation into two phases. The prepare() function can be expensive. It accesses the DOM, measures fonts, does layout calculations. The layout() function must be arithmetic only. No DOM access. No side effects. Pure computation on numbers.

This is not a suggestion. It is enforced by the type system. If layout() tries to access the browser, it will not compile.

The constraint creates a verification layer that is invisible at runtime but absolute at build time. You cannot ship code that violates the architectural boundary. The types prevent it.

Balic’s observation is precise: “The AI does not make the engineering rigorous. The loop does.” The constrain-measure-isolate-classify-test cycle produces reliable output regardless of whether a human or an LLM writes the code. The rigor lives in the system, not in the author.

This is the same principle as AutoBe’s function calling harness, applied to a different domain. Types as constraints. Compilers as validators. Structured feedback as the correction mechanism.

The Harness Tax Is Negative

There is a persistent belief that adding verification layers slows down development. You have to write the types. You have to build the compiler integration. You have to handle the feedback loop. This is the “harness tax,” and teams treat it as overhead.

The math does not support this.

At 6.75% first-try success, you need roughly 15 attempts per successful function call. Each attempt costs tokens, latency, and often requires human review when the output looks plausible but is subtly wrong. The cost of failure is not just the retry. It is the debugging when a malformed function call produces a downstream error three steps later.

At 99.8%, you need roughly 1.002 attempts per successful call. The verification loop adds a compilation step and occasionally a second model call. The net cost is lower than the baseline, not higher.

The tax is negative. You pay less with verification than without it. The investment in types, schemas, and compilers returns more than it costs on the first function call, not the hundredth.

This explains why the organizations with the most sophisticated harnesses are also the fastest. As we explored in What Is an Agent Harness?, the harness is not a brake. It is the road surface. Better surface, higher speed, fewer crashes.

What This Means for Agent Architecture

If you are building agents that call functions, APIs, or tools, the evidence points to a specific architectural recommendation: put a type system between the model and the outside world.

Not a regex validator. Not a JSON schema checker that runs after generation. A type system that participates in the generation loop. The model generates. The type system checks. Failures produce structured error messages. The model regenerates with those errors as context.

This is what generator-evaluator loops look like when the evaluator is a compiler. The pattern is not new. The application to LLM function calling is.

Three implementation principles emerge from the data.

First, types should be the single source of truth. Do not maintain separate schema definitions, validation rules, and prompt descriptions. Derive all three from the type. When the type changes, everything changes. When the type is correct, everything is correct.

Second, compiler errors are better prompts than natural language instructions. “Expected string, received number at parameter config.timeout” gives the model more actionable information than “please ensure all parameters match their expected types.” Precision beats eloquence in feedback loops.

Third, the retry budget should be small. AutoBe’s data shows most corrections happen in one or two iterations. If the model cannot produce a valid call in three tries, additional attempts have diminishing returns. Fail fast, escalate early.

The Uncomfortable Implication

The 6.75% baseline means that most function calling in production today is failing silently. Organizations shipping LLM-powered tools without type-constrained verification are operating at single-digit reliability. Some of those failures are caught by downstream error handling. Many are not.

The fix is known. The fix is measured. The fix works across multiple model families. And the fix costs less than the problem it solves.

Building the harness is not the expensive choice. Skipping it is.

This analysis synthesizes Function Calling Harness: From 6.75% to 100% (Jeongho Nam, Wrtn Technologies, March 2026), What Pretext Reinforced About AI Loops (Nikola Balic, March 2026), NESTFUL benchmark results (EMNLP 2025), and JSONSchemaBench coverage data (ICLR 2025).

Victorino Group builds the verification and governance layers that turn probabilistic model output into production-grade reliability. Let’s talk.