Blog

Trusting your LLM: AI’s handling of complex statistical data

Mar 18, 2026

Andrés Nigenda, Dallas Dotter, and Mike Burns

Angled mockup of a Mathematica position paper titled ‘Framework for Evaluating LLMs on Complex Data Sets,’ showing the teal-and-navy cover with abstract network lines over hands on a laptop alongside open pages labeled ‘Executive Summary’ and ‘Problem Statement.’

Large language models (LLMs) are rapidly being integrated into analytic workflows across government, health, education, and workforce systems. But as organizations look to deploy artificial intelligence (AI) for policy-relevant analysis, a foundational question remains:

Can these systems generate results that leaders can rely on for funding, oversight, and program decisions?

At Mathematica, we set out to answer that question by structuring an evaluation framework to test LLM performance in tool-enabled, multistep analytic workflows that use complex survey data.

For agency, foundation, and program leaders considering AI-assisted analytics, understanding whether these tools generate defensible, policy-relevant estimates is not merely theoretical. AI tools directly affect real-world decisions. Funding allocations, compliance reporting, and assessments of program effectiveness increasingly depend on the integrity of analytic outputs.

Why existing benchmarks fall short

Most public LLM benchmarks assess general reasoning, coding ability, or knowledge retrieval. Few evaluate whether a model can:

Filter and transform microdata
Apply survey weights correctly
Compute valid population estimates
Conduct causal inference tasks
Produce reproducible analytic outputs

These are core competencies in statistical practice, especially in policy analysis. Without testing these capabilities directly, organizations risk overestimating model readiness for real-world analytic applications.

Our approach

We developed a cloud-native evaluation framework designed specifically for applied statistical tasks that use complex survey data.

Using the American Community Survey Public Use Microdata Sample, we:

Built 35 validated analytic tasks spanning weighted population estimates and causal inference methods commonly used in policy analysis
Compared model outputs to validated benchmark estimates based on those 35 structured tasks
Evaluated performance under two conditions:

Reasoning only (no data access, no code execution)
Code-execution enabled (models could analyze data directly)

Executed all runs in a secure, serverless cloud environment

This design enabled us to test not only whether models could reason about statistics but whether they could execute end-to-end analytic workflows correctly.

What we found

1. Code execution is not optional; it’s foundational.

Across models, performance improved dramatically when LLMs were allowed to access data and execute code.

When limited to reasoning alone, error rates were substantial, often large enough to change the conclusions a decision maker might draw. In fact, reasoning-only workflows functioned primarily as data-leakage checks rather than as a viable analytic approach.

When code execution was enabled, accuracy increased significantly. For example, ChatGPT 5.1 achieved 100 percent accuracy on causal inference tasks under code execution (though not for population estimates), compared with dramatically higher error rates under reasoning-only constraints.

The implication is clear: LLMs performing statistical analysis must be embedded in executable, tool-enabled workflows.

Our evaluation focused on LLMs operating in tool-enabled environments, where models access data, generate and run code, and complete multistep analyses. Emerging AI systems are increasingly designed to operate this way, moving beyond simple chat-based interactions.

But autonomy does not equal rigor. Even sophisticated agent workflows do not guarantee methodological correctness. Our findings reinforce that structured evaluation frameworks—grounded in real data, controlled tool use, and benchmark validation—remain essential for assessing analytic reliability on complex, domain-specific tasks.

2. Statistical competence varies meaningfully across models.

Even with code execution enabled, differences persisted:

On descriptive tasks, model accuracy ranged from 75 percent to 92 percent .
On causal inference tasks—often more complex—the spread was even wider.

Small differences in reasoning behavior translated into large differences in analytic outcomes. Causal inference proved particularly sensitive to model capability, reinforcing our finding that advanced statistical reasoning remains uneven across models.

3. Data governance and documentation matter.

Our evaluation also yielded an operational insight: Poorly labeled variables, ambiguous metadata, and weak documentation degraded performance across models.

AI systems do not overcome weak data governance. They amplify it.

Organizations seeking trustworthy AI-assisted analysis must treat data documentation, metadata clarity, and reproducibility standards as first-order priorities.

What this means for decision makers

As AI tools move closer to operational deployment, structured validation becomes a necessity for governance.

For organizations considering LLM-powered analytic workflows, three principles stand out:

Tool invocation and data access are prerequisites for trust.
Stand-alone reasoning is insufficient for applied statistical work.
Model choice matters, especially for causal analysis.
Differences in statistical competence remain substantial.
Structured evaluation frameworks are essential.
Real data, controlled tooling, reproducible execution, and benchmark comparisons enable evidence-based assessment of analytic readiness.

For organizations evaluating AI adoption, validation should be built into implementation plans, not added after deployment.

In short, AI-assisted analysis should be tested, not assumed.

What’s next

We are expanding this framework to:

Develop prompting templates for multistep causal inference tasks
Automate evaluation workflows for scalable testing across data sets and models
Further advance evidence-based methods for responsible AI deployment

As modernization efforts accelerate, organizations need more than enthusiasm about AI. They need rigorous, transparent evaluation of analytic capability.

At Mathematica, we remain committed to helping clients apply AI responsibly—as described in our AI Principles and Position Statement—to strengthen data systems, advance analytics, and improve public well-being.

As organizations move from AI exploration to implementation, structured evaluation will be central to maintaining trust, transparency, and accountability. Mathematica helps agencies and foundations build that validation into their AI strategies from the outset.

Trust in AI isn’t built on promises. It’s built on evidence.

Visit Mathematica’s AI in Action page to learn more about our approach—or contact us to discuss how structured evaluation can strengthen your AI strategy.

About the Authors

Andrés Nigenda

View More by this Author

Dallas Dotter

View More by this Author

Mike Burns

Senior Director, Communications and Public Affairs

View More by this Author

Analytics & Operations

Data Solutions

Research & Evaluation

Federal

Foundations

Global Health and Development

Healthcare and Life Sciences

State and Local

Trusting your LLM: AI’s handling of complex statistical data

Why existing benchmarks fall short

Our approach

What we found

What this means for decision makers

What’s next

About the Authors

Andrés Nigenda

Dallas Dotter

Mike Burns

Explore

Engage

Content Libraries

Trusting your LLM: AI’s handling of complex statistical data

Why existing benchmarks fall short

Our approach

What we found

What this means for decision makers

What’s next

About the Authors

Andrés Nigenda

Dallas Dotter

More like this from Mathematica