Latest News

The Data Behind the Model: How Chart-Reasoning Datasets Are Exposing What AI Cannot Understand

Multimodal AI is expanding at a pace that outstrips the infrastructure built to support it. The global multimodal AI market, valued at roughly $2.5 billion in 2025, is projected to exceed $42 billion by 2034, with enterprises across finance, healthcare, and scientific research racing to deploy systems that can reason across text, images, and structured data simultaneously. Yet one of the most common visual formats in professional decision-making, the chart, remains among the least understood by the models tasked with interpreting it. Chart and data-visual comprehension sits in a stubborn blind spot within multimodal AI, not because of algorithmic failure, but because the training data combining image, code, table, and chain-of-thought reasoning at scale has not previously existed. That absence has quietly constrained AI performance in the very domains where visual data matters most.

This is the gap Zihan Wang has spent years working to close. Wang is the Co-founder and Chief Research Officer at Abaka AI and the Founder and Lead Director of the 2077AI open research foundation, a researcher whose work sits at the intersection of data creation, evaluation design, and public benchmarks. Through cross-institutional collaboration with MIT and MIT-IBM Watson Lab, his research has produced open datasets and benchmarks now integrated into the training pipelines of major commercial foundation models, including IBM Granite Vision 4.0 and Microsoft’s Phi model family.

We spoke with Wang about why chart understanding has lagged behind other multimodal capabilities, how the absence of rigorous training data has shaped the field’s blind spots, and what it takes to build evaluation infrastructure that commercial AI systems actually rely on.

Chart comprehension seems like a basic capability for AI. Why has it remained so difficult for multimodal models?

The difficulty is less about the models themselves and more about what they were trained on.

The global AI training dataset market surpassed $3.5 billion in 2025, yet most of that investment flows toward text and natural image data. Image and video datasets account for over 40% of the training data market, but chart-specific data, the kind that pairs a rendered visualization with its underlying data table, source code, and reasoning chain, has been almost entirely absent at scale. Charts combine spatial layout, symbolic notation, color encoding, and numerical relationships into a single image. To interpret one correctly, a model needs to do more than recognize objects. It needs to parse axes, infer relationships between visual elements, extract precise values, and sometimes reverse-engineer the logic that produced the visualization in the first place.

Most training corpora treat charts as flat images with captions. That teaches a model to describe a chart in general terms, but not to reason about what the chart actually contains. The gap between describing and understanding is where most failures happen, and it shows up immediately when you move from curated benchmarks to real enterprise documents where chart layouts are messy, inconsistent, and embedded in complex page structures.

You’ve described this as a training data problem, not an architecture problem. Can you expand on that distinction?

Architecture improvements are necessary, but they are insufficient without the right data underneath them.

The Document AI market is projected to grow from $14.66 billion in 2025 to nearly $28 billion by 2030, driven by the demand for systems that can process visually complex, mixed-content documents. Multimodal document types, those combining text, tables, charts, and images on a single page, are the fastest-growing segment of that market. Yet the training infrastructure behind these systems has not kept pace. Most vision-language models learn chart interpretation from relatively small, homogeneous datasets that cover a narrow range of chart types and plotting conventions. When those models encounter charts rendered in unfamiliar styles, or charts embedded within dense financial reports or scientific papers, accuracy degrades quickly.

“The bottleneck was never the model’s capacity to learn. It was the absence of data that could teach it what chart reasoning actually looks like,” Wang explains. “If you train a model on captioned chart images, it learns to caption. If you train it on aligned code, data tables, rendered images, and reasoning chains together, it learns to reason. We had to build that second kind of dataset from scratch, because it did not exist.”

Your chart-reasoning dataset has been integrated into commercial foundation models. How did that come about?

It came from building something the field needed and releasing it openly.

Private AI investment hit a record $581 billion in 2025, more than doubling the prior year. Yet only a fraction of that capital reaches data creation and evaluation infrastructure. Most of it flows to compute, models, and applications. That imbalance creates a situation where well-funded teams are training sophisticated architectures on thin, undertested data layers. Through Abaka AI, 2077AI, and through cross-institutional collaboration with MIT, MIT-IBM Watson Lab, and IBM Research, we developed a CVPR 2026-accepted chart-reasoning dataset, ChartNet, that contains 1.7 million diverse chart samples spanning 24 chart types and six plotting libraries. Each sample aligns five components: the plotting code that generated the chart, the rendered image, the underlying data table, a natural language summary, and question-answer pairs. That level of alignment gives models a deeply cross-modal view of what a chart means, not just what it looks like.

The dataset was designed to be open, stress-tested in public, and immediately usable by other research teams. Its impact has extended well beyond the original project: the work was featured by MIT News, incorporated into IBM’s Granite Vision 4.0 and 4.1 training pipeline, and has informed infrastructure used by Microsoft’s Phi model family. That kind of third-party adoption is the strongest form of validation in this field. It shows that openly published research infrastructure, built through academic and industry collaboration, can directly influence how major commercial models learn to process visual data.

“When a company like IBM uses your dataset to train a flagship model, that tells you the work addressed a real gap,” Wang notes. “They had the resources to build their own data pipeline. They chose to build on ours because the alignment between code, image, and data table was something that did not exist elsewhere at that scale. That’s the difference between creating a dataset and creating research infrastructure.”

Beyond charts, how does this work connect to the broader challenge of AI evaluation?

Chart reasoning is one piece of a larger evaluation problem that the field still underestimates.

The business intelligence market is valued at nearly $35 billion in 2025, and more than 78% of global enterprises have implemented at least one BI or analytics platform. These organizations generate enormous volumes of charts, dashboards, and visual reports daily. When an AI system misreads a chart in a financial report or a clinical dashboard, the downstream consequences are not academic. They affect decisions, risk assessments, and resource allocation.

At 2077AI, the foundation has published over 20 peer-reviewed papers at top-tier venues including CVPR and NeurIPS, turning evaluation design into a sustained research agenda rather than a one-time effort. Projects like OmniDocBench were designed around the messiness of real documents: diverse layouts, ambiguous structure, and multimodal signals that do not resolve cleanly. The chart-reasoning work extends that philosophy to a specific and underserved domain.

“Chart understanding is not a perception problem,” Wang says. “It is a reasoning problem. A system that can read the numbers off a chart is not the same as one that understands what those numbers mean in context.”

Looking forward, where does the field go from here on visual AI understanding?

The field is shifting from performance races to evidence-based deployment, and visual understanding will be a proving ground for that shift.

AI is expected to contribute trillions of dollars to global economic output over the next decade, but that upside is conditional. The data visualization tools market alone is projected to reach over $18 billion by 2030, and every one of those tools generates visual outputs that downstream AI systems will be expected to interpret. High-stakes domains, finance, medicine, infrastructure, scientific research, will not rely on systems they cannot audit or explain. If a model cannot reliably extract the correct value from a bar chart in a quarterly earnings report, no amount of conversational fluency will compensate for that failure. By keeping our benchmarks and datasets open, we are inviting the kind of scrutiny that makes progress real. Weak assumptions get challenged. Edge cases get surfaced. Overfitting becomes visible. That pressure is uncomfortable, but it is how evaluation becomes scientific infrastructure rather than marketing collateral.

“The future of multimodal AI will not be decided solely by what models can generate,” Wang reflects. “It will be decided by what the field can prove. Chart reasoning is a clear test case. Either a model can extract the right number from a visualization and explain why, or it cannot. There is no partial credit in a financial audit or a clinical decision. That is the standard the field is moving toward, and the data we build today is what will determine whether models can meet it.”

Comments

TechBullion

FinTech News and Information

Copyright © 2026 TechBullion. All Rights Reserved.

To Top

Pin It on Pinterest

Share This