Trusting your LLM

A framework for evaluating AI models on complex statistical data

Published: Mar 06, 2026

Publisher: Mathematica

Download

Full Publication

Position Paper

Authors

Andrés Nigenda

Dallas Dotter

Camille Shao

Key Findings

Code-execution workflows substantially improved accuracy, with performance varying by model family.
Reasoning-only workflows produced large errors and served primarily as a data-leakage check.
Statistical competence differed meaningfully across models, particularly for causal inference tasks.
Clear metadata, structured prompts, and controlled workflows were critical to achieving reliable results.

Angled mockup of a Mathematica position paper titled ‘Framework for Evaluating LLMs on Complex Data Sets,’ showing the teal-and-navy cover with abstract network lines over hands on a laptop alongside open pages labeled ‘Executive Summary’ and ‘Problem Statement.’

This paper introduces a cloud-native framework for evaluating how accurately and transparently large language models (LLMs) perform applied statistical analysis on complex, survey-based data. Most public benchmarks assess reasoning on text-based tasks and do not test a model’s ability to filter microdata, apply survey weights, or generate valid population estimates—core requirements in policy research. To address this gap, we developed and tested a structured evaluation process using 35 validated descriptive and causal questions based on the American Community Survey (ACS) Public Use Microdata Sample.

We compared model performance under two workflows: reasoning-only and code execution. In the code-execution setting, models generated and ran Python code against real data in a secure, serverless environment, enabling iterative refinement and reproducible analysis. Results show that data access and tool invocation substantially improve accuracy, particularly for descriptive tasks, while causal inference performance varies meaningfully across models. Findings underscore that trustworthy AI-driven analytics depend not only on model capability but also on metadata quality, prompt design, and controlled analytic workflows.

Efficiency Meets Impact.
That's Progress Together.

To solve their most pressing challenges, organizations turn to Mathematica for deeply integrated expertise. We bring together subject matter and policy experts, data scientists, methodologists, and technologists who work across topics and sectors to help our partners design, improve, and scale evidence-based solutions.

Work With Us

Evidence Library

Trusting your LLM

Download

Authors

Andrés Nigenda

Dallas Dotter

Camille Shao

Key Findings

Efficiency Meets Impact.
That's Progress Together.

Explore

Engage

Content Libraries

Trusting your LLM

Download

Authors

Andrés Nigenda

Dallas Dotter

Camille Shao

Key Findings

More like this from Mathematica

The rise of Medicare Advantage payviders

Meeting the Moment: Using Data to Drive Medicaid Community Engagement Implementation

Independent Evaluation of the Kosovo Threshold Program’s Transparent and Accountable Governance Project

A Head-to-Head Comparison of Alternative Voting Rules

Efficiency Meets Impact. That's Progress Together.

Efficiency Meets Impact.
That's Progress Together.