Trusting Your LLM
Download
Key Findings
- Code-execution workflows substantially improved accuracy, with performance varying by model family.
- Reasoning-only workflows produced large errors and served primarily as a data-leakage check.
- Statistical competence differed meaningfully across models, particularly for causal inference tasks.
- Clear metadata, structured prompts, and controlled workflows were critical to achieving reliable results.
This paper introduces a cloud-native framework for evaluating how accurately and transparently large language models (LLMs) perform applied statistical analysis on complex, survey-based data. Most public benchmarks assess reasoning on text-based tasks and do not test a model’s ability to filter microdata, apply survey weights, or generate valid population estimates—core requirements in policy research. To address this gap, we developed and tested a structured evaluation process using 35 validated descriptive and causal questions based on the American Community Survey (ACS) Public Use Microdata Sample.
We compared model performance under two workflows: reasoning-only and code execution. In the code-execution setting, models generated and ran Python code against real data in a secure, serverless environment, enabling iterative refinement and reproducible analysis. Results show that data access and tool invocation substantially improve accuracy, particularly for descriptive tasks, while causal inference performance varies meaningfully across models. Findings underscore that trustworthy AI-driven analytics depend not only on model capability but also on metadata quality, prompt design, and controlled analytic workflows.
Efficiency Meets Impact.
That's Progress Together.
To solve their most pressing challenges, organizations turn to Mathematica for deeply integrated expertise. We bring together subject matter and policy experts, data scientists, methodologists, and technologists who work across topics and sectors to help our partners design, improve, and scale evidence-based solutions.
Work With Us