AI Reads Text. It Guesses Charts.

A team at Mercor gave frontier AI models 25 tasks pulled from real financial documents. Investor decks, earnings reports, analyst presentations. The kind of material that hits an executive’s desk before a capital allocation decision.

When the tasks involved reading text, the models performed about how you would expect. Gemini 3.1 Pro scored 80%. Claude Opus 4.6 scored 76%. GPT-5.4 came in at 72%. Reasonable. Not perfect, but within the range where a human reviewer could catch errors and maintain a productive workflow.

Then they switched to charts.

The same models, the same documents, the same financial questions. Gemini dropped to 64%. Claude dropped to 56%. GPT dropped to 56%. Across the board, accuracy fell 16 to 20 percentage points the moment the information moved from text to a visual format.

A sanity check confirmed this was not memorization. When models were tested with no document at all (just parametric knowledge), accuracy hovered between 0% and 4%. The models were genuinely reading the documents. They were reading the text portions competently. And they were guessing at the charts.

The Failure Taxonomy

Mercor’s researchers categorized the failures into three types, and the distribution tells you more than the aggregate numbers.

The first type is visual extraction errors. The model looks at a bar chart, reads a value of $4.2 billion, and the actual value is $3.8 billion. The reasoning that follows is correct. The arithmetic is sound. But the input was wrong, so the output is wrong. This is the most common failure mode, and the most insidious. Because everything downstream of the extraction looks right. The logic checks out. The conclusion follows from the premises. The premise itself is fabricated from a misread axis label.

The second type is reasoning failures. The model extracts the correct values from the chart but applies the wrong formula, misidentifies what is being compared, or draws an inference the data does not support. These are easier to catch because the math visibly breaks. A human reviewer who knows what the right answer should look like will spot the error.

The third type compounds both. The model misreads the chart and then reasons incorrectly about the misread values. These produce outputs that are wrong in ways that resist detection, because neither the input nor the logic maps to anything a reviewer would expect.

The researchers put it plainly: “The model knew exactly how to compute the final value but couldn’t reliably read the value off the chart.”

This is not a limitation of one model or one provider. Every frontier model tested showed the same pattern. The wall is not vendor-specific. It is structural.

Why This Matters for Finance

Consider what financial documents actually look like. An investor deck is not a text file. It is a designed artifact. Revenue trends live in line charts. Market share sits in pie charts. Competitive positioning is conveyed through quadrant diagrams. Capital structure is visualized in waterfall charts. The text on the slides provides context, but the data lives in the visuals.

When we analyzed CEMEX’s Luca agent, we noted that its 82% accuracy on financial analysis should give executives pause. One in five analytical outputs contained errors. The Mercor data adds a dimension to that concern. If a financial AI agent can read text at 76% accuracy but charts at 56%, the question becomes: what percentage of the queries it handles involve visual data? Because the aggregate accuracy number obscures a bimodal distribution. The agent is competent at one type of input and unreliable at another.

Most financial documents mix both types freely. A single slide might have three bullet points of text and two charts. The AI reads the bullets well and fumbles the charts. The resulting analysis blends accurate extraction with fabricated numbers, and the blend is seamless. Nothing in the output flags which parts came from text (relatively reliable) and which came from charts (coin-flip territory).

This is the domain competence wall. Not a gradual decline in capability as tasks get harder. A discrete drop when the input format changes. The model does not get slightly worse at charts. It loses a fifth of its accuracy in one step.

The Infrastructure Corollary

The competence wall is not just about accuracy. It shows up in infrastructure too.

Gergely Orosz published a detailed analysis of GitHub’s reliability this month. The numbers are stark. GitHub has been running at roughly 90% availability. That is one nine. Industry standard for developer infrastructure is four nines (99.99%). GitHub is delivering one.

Three outage days per thirty. Not minutes of degradation. Days.

The cause is not a mystery. Claude Code contributions to GitHub increased sixfold in three months. A project called Pierre Computer sustained peaks above 15,000 repository creations per minute, generating over nine million repositories in thirty days. Agent traffic is overwhelming infrastructure that was designed for human developers typing at human speeds.

As we documented in The Agent Operations Paradox, the promise of AI agents creating more output runs headfirst into the reality that infrastructure was not built for agent-scale throughput. GitHub’s degradation is that paradox made concrete. More agents produced more code, which produced more load, which produced less reliability for everyone, including the agents themselves.

The connection to the Mercor data is this: both findings describe the same phenomenon at different layers of the stack. At the application layer, AI models hit a competence wall when inputs move from text to visual formats. At the infrastructure layer, platforms hit a capacity wall when users move from humans to agents. Both walls were invisible in benchmarks and demos. Both became obvious in production.

The Benchmark Problem, Again

Mercor’s study used 25 tasks with 50 evaluations per model. That is directionally useful but not statistically powerful. The confidence intervals are wide. A skeptic could argue that 25 financial tasks do not represent the full distribution of real-world financial analysis.

The skeptic would be right about sample size and wrong about signal. Twenty-five tasks is small. A 16-to-20 percentage point drop that appears consistently across three competing models from three different companies is not random variation. It is a structural finding. The specific numbers will shift with larger samples. The direction will not.

Mercor also has a conflict of interest worth naming. They operate an AI talent marketplace. Research showing AI fails at specialized tasks supports their business model (human experts remain necessary). This does not invalidate the data. The methodology is transparent, the models are named, and anyone with API access can reproduce the experiment. But the incentive structure deserves acknowledgment. As we explored in The Benchmark Paradox, who designs the test shapes what the test reveals.

GitHub’s availability numbers carry a similar caveat. Gergely Orosz derived the 90% figure from public incident data and status page records. GitHub has not published an official SLA number. The derived figure is an estimate, not a measurement. It could be worse. It could be slightly better. The pattern of frequent, significant outages is not in dispute.

What the Wall Means for Governance

Organizations deploying AI in finance (or any domain where visual data matters) need to confront a specific question: does our governance framework account for input-format-dependent accuracy?

Most do not. Most AI governance frameworks treat accuracy as a single number. The model is 80% accurate, or 76%, or whatever the aggregate score says. The Mercor data shows that aggregate accuracy is misleading when the model’s performance varies dramatically by input type.

A governance framework that accounts for this would look different in three ways.

First, it would classify inputs before routing them. Text-heavy queries go through one validation path. Chart-heavy queries go through a stricter one, with mandatory human review of extracted values. This is not a technical limitation to work around. It is an operational discipline to build.

Second, it would measure accuracy by input type, not in aggregate. A dashboard showing “model accuracy: 74%” is less useful than one showing “text accuracy: 76%, chart accuracy: 56%, mixed accuracy: 64%.” The disaggregated view reveals where human oversight is non-negotiable versus where automation can run with lighter supervision.

Third, it would design for the failure mode, not just the failure rate. Visual extraction errors produce plausible-looking wrong answers. The governance response to plausible-looking wrong answers is fundamentally different from the response to obviously broken outputs. You need domain experts reviewing the numbers, not just format validators checking the output structure.

The infrastructure side demands its own governance layer. If your AI agents depend on GitHub for code storage and CI/CD, a platform running at 90% availability means your agent workforce has a 10% failure rate before it writes a single line of code. That is not an edge case to plan for. That is a baseline operating condition to build around.

The Uncomfortable Pattern

The pattern across both findings is consistent. AI capabilities degrade at boundaries. The boundary between text and visual. The boundary between human-scale traffic and agent-scale traffic. The boundary between benchmark conditions and production conditions.

These are not the same boundary, but they share a property: they are invisible from the inside. If you only test AI on text, you never discover the chart problem. If you only measure GitHub uptime during low-traffic hours, you never discover the agent-traffic problem. If you only evaluate models on benchmarks, you never discover the production accuracy problem.

The competence wall is not a bug to be fixed in the next model release. It is a characteristic of the technology at this stage of maturity. Models that process language are fundamentally better at text than at visual data, for the same reason that humans who process visual data are fundamentally better at charts than at parsing dense numerical tables. The architecture has strengths, and those strengths create corresponding blind spots.

The organizations that will deploy AI effectively in finance, in infrastructure management, in any domain with mixed-format data are the ones that map these blind spots before they deploy. Not after the first wrong number reaches a board presentation. Not after the third outage disrupts a release cycle. Before.

The wall is already there. The question is whether you find it on your terms or on production’s terms.

This analysis synthesizes AI can’t read an investor deck (April 2026) and Does GitHub still merit “top Git platform” status? (April 2026).

Victorino Group helps enterprises govern AI where it meets domain complexity. Let’s talk.