As technological advances have made complex computational processes easier to perform, some policy researchers are exploring an approach to statistical analysis that could provide greater clarity about which programs or policies work, and for whom. On the latest episode of On the Evidence, Mariel Finucane, John Deke, and Tim Day discuss why there’s growing demand for alternative ways of assessing the impact of policies, and how a modernized, evidence-informed update to Bayes’s Rule can help decision makers assess whether a policy or program works and whether it would work for certain groups of interest.
Mariel Finucane is a principal statistician at Mathematica who has served as the lead Bayesian statistician on a number of large-scale evaluations for the Centers for Medicare & Medicaid Services.
John Deke is an economist and senior fellow at Mathematica whose work has focused primarily on impact evaluations in K-12 education.
Tim Day is a health services researcher at the Center for Medicare & Medicaid Innovation, where he works on evaluations of new approaches to paying for and delivering health care that aim to improve quality while reducing expenditures.
Listen to the full interview below.
A version of the full episode with closed captioning is also available on Mathematica’s YouTube channel here.
So, I think a type of question that Bayes can really help us answer better is the question of what works for whom. So, not just is a program working overall on average, but really, what particular subgroups of beneficiaries or subgroups of schools or subgroups of primary care practices are benefiting the most?
I am J.B. Wogan from Mathematica and welcome back to On the Evidence, a show that examines what we know about today’s most urgent challenges and how we can make progress in addressing them. On today’s episode, we’re going to talk about a shift in the way that some statisticians approach policy analysis, which opens new doors in terms of the questions we can answer when trying to understand which policies and programs work and for whom.
I’m going to start with the hypothetical example that two of my guests for this episode, John Deke and Mariel Finucane, outlined in a 2019 issue brief for the Office of Planning, Research and Evaluation in the Administration for Children and Families. John is an economist by training, who serves in a senior fellow role at Mathematica. Mariel is a principal statistician at Mathematica. I’ll link to their brief for anyone who is interested in learning more.
In the example, John and Mariel ask readers to imagine that a federal grant program funds 100 locally-developed intervention programs to reduce drug dependency, a very plausible scenario. And the fact that it’s focused on drug dependency is not really important. We could be talking about a federal grant program to increase early childhood literacy or to reduce child food insecurity. The point is some kind of federal initiative is funding 100 local programs with the same basic, laudable social objective. Hopefully you’re with me so far.
So, now, in the hypothetical example, we select one of these local programs at random and we evaluate it. Again, this is fairly common. Mathematica conducts these kinds of evaluations, as do many other organizations. And in this hypothetical example, we take a fairly standard approach to designing the study, which is to make it large enough that we would have an 80-percent probability of detecting the desired impact of seven percentage points. And in this study, we would declare an impact estimate to be “statistically significant” at a pretty standard threshold, which is if the p value was less than 0.05.
Now, if you have not taken a statistics class or it’s been a long time since you have, the phrases “statistically significant” and “p value” may not mean much to you, but suffice to say that, in many policy research publications, the way we report whether a program worked or not is by reporting first the estimated size and direction of impact, and then reporting whether it was statistically significant at a given p value threshold.
So, to get back to the hypothetical study of a grant-funded program to reduce drug dependency, John and Mariel lay out four possible scenarios for this evaluation. One is that the program is truly effective and the impact estimate is statistically significant. Yes. That’s what everyone would hope would happen. The next most desirable scenario is that the program is truly ineffective and the impact estimate is not statistically significant. It’s unfortunate that the program doesn’t work, but it’s good that the analysis tells us it doesn’t work.
But there are two other less ideal possible scenarios. One is that the program is truly effective but the impact estimate is not statistically significant. Another is that the program is truly not effective but the impact estimate is statistically significant. Another way of describing those latter two scenarios is a false negative and a false positive. In the hypothetical example, we have some information that policymakers in the real world don’t have, which is that we know 90 of the 100 programs in truth have zero impact, and that ten of the 100 programs do, in fact, reduce drug dependency by the desired seven percentage points.
With that information, John and Mariel tell us that there is a 38-percent probability that the true effect is actually zero when the impact estimate is statistically significant. In the brief, they show how they arrive at that number. It’s not particularly complex, especially with the handy graphic they use that involves counting color-coded marbles, but it’s probably not worth the time it would take to explain here. The point is there’s a 38-percent chance the program is ineffective when the statistically significant impact estimate might lead you to think otherwise. And I wanted to start there because my non-statistician’s takeaway from this brief is that people like me are often drawing the wrong conclusions or, at the very least, drawing conclusions with too much confidence based on a shaky understanding of what the terms “statistical significance” and “p values” mean. And there are times when alternative methods of analyzing the data could provide more useful information to policymakers, the media, and the general public.
One of these alternatives is the application of Bayes’ rule to policy research, which will be at the heart of today’s discussion. Along with John and Mariel from Mathematica, we’re lucky enough to have as our third guest Tim Day, a health services researcher at the Center for Medicare and Medicaid Services Innovation Center. All three have been exploring this general topic of how to make research results more intuitive and useful for policymakers. I hope you enjoy the conversation.
Tim, I’m hoping you can start us off with some definitions and context. What is Bayes’ rule and why should it matter to people who care about and want to know if a policy or program works?
Yeah, great question. So, Bayes’ rule is fundamentally a way of relating our prior beliefs to new evidence we receive. And if we were to look at it as a formula, it’s essentially expressing the relationship between a series of probabilities. And there are a few specific probabilities that I pull out as important terms we might talk about later.
So, the first one is what we call the posterior probability, and this is the main thing we’re trying to estimate in a Bayesian analysis. And fundamentally, this is the probability of the event we want to measure given our new evidence that we observe. So, we’re running a new test and we want to see how likely is this event to happen, and we observe some set of data and this is our probability given our new evidence.
One of the key inputs in Bayes’ rule is called the prior probability, and this is the probability of our event based on what we know before we look at the data. And so it’s looking at – think about prior evidence before we’ve even run the tests and making some estimate of the probability of observing our event prior to the data. So, Bayes’ rule is a way of relating those sort of three major things, the prior probability based on the evidence, the data we observe, and then the posterior probability. So, we’re pulling that all together and giving us a new estimate of the probability of our event, given our new evidence.
So, why is that important in policy? It sounds, like, very abstract, but I promise it’s not. Fundamentally, this sort of mirrors how we, as humans, make inferences about the world. So, we all walk around with some set of prior beliefs. We make new observations and we update those beliefs to, you know, take into account our priors that we’ve had before as well as the new data that we observe in the world. And policy analysis works in a very similar way.
Every policy we test has some set of prior evidence. It may be weak, may be strong. There may be a lot of evidence, may not be a lot of evidence, but it has some set of prior evidence. And in order to better understand the effect of any given policy, we like to run a test or we have a natural experiment of a policy, and we try to figure out how well that test worked, how well the intervention works. We use Bayes’ rule to come up with an estimate of the effect of that test given what we already knew and given what we observed in our test. Hopefully that’s a somewhat understandable introduction. And I don’t know if John or Mariel have additions to sort of clean that up.
Yeah, I think that’s a great explanation, Tim. And I saw a really perfect example just yesterday actually. There’s this amazing economist at Brown University. Her name is Emily Oster and she writes this fabulous newsletter. And yesterday’s newsletter happened to be about this new surprising research finding that’s been getting a lot of press lately. It’s this new, high-quality randomized trial that shows, surprisingly, that enrolling your kids in pre-K actually might hurt their future test scores rather than helping.
And in this newsletter, she was helping us try to understand what’s the best way to think about a new finding like that. She pointed out that we have this tendency to kind of treat this new piece of evidence, as you were saying, Tim, as if it were, like, the first or the only or the best because it’s new information that we have on the topic, but that, of course, is not the case; right? And she very compellingly said, I think, that it can be helpful – it is helpful to situate that surprising finding into a broader context, what you were calling the “prior,” into some broader evidence base. And in this case, it turns out we actually have a very good, broad, robust prior evidence base about the effects of pre-K enrollment. And that evidence base, when she combined it with this new data, allowed her, as you were saying, to make an estimate of the posterior probability.
So, after seeing the data and combining it with this prior evidence, she was left feeling pretty skeptical of this surprising new finding, pretty skeptical that pre-K is actually a bad thing. And, you know, as she was writing about this, she didn’t formally invoke Bayes’ rule. She didn’t do any, like, math in the newsletter or anything, but there was definitely this very Bayesian flavor to her thinking, and it left her feeling, you know, like the true probability that pre-K is a bad thing is probably pretty low.
And I think if we, like, take it back to J.B.’s example from the intro, it’s really the exact same kind of thinking that needs to happen; right? It’s like, in that example, we were wondering, what’s the probability that this statistically significant finding – that’s our new piece of evidence – reflects a meaningful true effect on drug dependency. And in that case, the prior evidence that was really important to bring into the picture was that it’s very hard to move the needle on drug dependency. In this case, we know that only ten percent of interventions that try to move the needle on drug dependency actually work in the example. So, it’s like we take that prior evidence, we put it in the context of this rather flimsy new statistically significant finding, and then, using Bayes’ rule, we can figure out that there’s actually a pretty low probability that we’ve actually moved the needle.
So, I think, just to wrap up, like, coming back to J.B.’s original question, why should it matter, why should Bayes matter to somebody who wants to know whether programs work or not, the reason is that Bayes’ rule should matter because it helps us take our new findings and put them into this nice, sturdy contextual evidence base and have more confidence as a result in our end conclusions.
That’s terrific. Thank you, Tim and Mariel. John – well actually, I’ll just add one thing before I turn to John. I know that Emily Oster is a big fan of Bayes’ rule. She actually has a newsletter that is titled Bayes Rule is my Faves Rule. And she actually talks about, you know, what happens when you have new evidence that comes about and how you think about that in the larger context of a body of literature and when it can move – when it can change your priors, when it probably won’t change your priors much. Bu that’s a really excellent way of putting some meat on the bones in terms of what that would look like in practice, and very recently. John, is there anything that we’ve left on the table there that you want to talk about?
Well, those are great answers. They’re very thorough answers and I think that covers a lot of ground. I would – so, I guess I’ll just take it a slightly different tack to answering the question and I’ll just try to boil it down to what I think the core of it is. I think Bayes’ rule is a tool that allows us to calculate probabilities about things we cannot see given everything that we can see. That is the crux of Bayes’ rule. And the challenge of Bayesian modeling is to come up with a sensible way to connect the things we can’t see to the things we do see. And that’s what a Bayesian model really does is it – it’s a mathematical bridge between what we see and what we can’t see, and then Bayes’ rule is a tool that lets us calculate probabilities about what we can’t see given what we can.
And one example that I often use is the probabilities you see reported on the FiveThirtyEight website. So, for example, Super Bowl is coming up. They are reporting a 68-percent probability that the Rams are going to win that Super Bowl. The winner of the Super Bowl is something we can’t see right now. We don’t know that. That’s uncertain, but what we do know is we know the performance of the Rams and the Bengals over this past season. We know how other teams who have performed similarly during seasons do in the Super Bowl. And so, given all of that information that we see, we can calculate a probability about something we can’t see. I think that’s the crux of it.
John, Mariel, in the example that I mentioned in my intro, your example from the brief, you conclude that there would be a 38-percent probability that a statistically significant finding is a false positive. I think this also means that there is a 62-percent probability that a statistically significant finding is a true positive, that the program is truly effective at reducing drug dependency when the impact estimate from the study is statistically significant. And so, thinking about that number, 62 percent is better than 50 percent or ten percent or zero percent, but it’s not 95 percent or 99 percent.
So, could you talk a little bit about how consumers of policy research should think about a study that arrives at this kind of result. If you’re a policymaker or an agency director that’s actually tasked with deciding whether to replicate or expand or defund a program that has been under evaluation, what’s the threshold for making that kind of decision. And Tim, I’d love for you to weigh in on this question as well.
I’d be happy to take a first shot at it and then hand off to my colleagues here. So, I think there are a couple of parts to that question that are really important. One is, as we saw in the example, p less than .05 doesn’t mean that there is a five-percent probability of a false positive. It could be much higher. And we really can’t just say, “Well, let’s change the cutoff on the p value in order to arrive at a five-percent false positive rate.” We can’t do that in a universal way because, in order to know the false positive rate, you need that missing piece of information that was provided in the example, which is the proportion of interventions that are truly effective. That’s that prior evidence. And if you don’t have that, you can’t calculate the false positive rate. So, there’s no way to change the cutoff on a p value in a universal way in order to arrive at a desired false positive rate. You have to have Bayes’ rule in the prior evidence to figure that out.
But then the second part of the question is, okay, p less than .05, it’s not telling us the false positive rate. There’s no way we can get it from that, but what is the false positive rate that we should accept? And it’s not at all clear that we should insist upon a false positive rate of just five percent or, conversely, being 95 percent sure that something worked. What level of confidence we need really depends on the decision that we’re trying to make.
And if I went back to that football analogy, I’m not going to never bet on a football game if there’s less than a 95-percent chance at being right. Like, if somebody offers me even money on that game, even though there’s just a 68-percent chance that the Rams will win, I’ll take an even money bet on that game. If I do that over and over again, I’m going to make money. And so, in a lot of ways, you can leave money on the table if you’re going to insist on 95-percent confidence to drive decision making. But it really depends on the nature of the decision and you need to customize that cutoff to the objectives that a decisionmaker is trying to make.
Mariel, I saw you nodding your head. Is there anything you would add to John’s answer there?
No, I think that sounds great. I actually call this the John Deke rule of statistical inference. He’s like, you have to take the context of the decision into account in order to decide what level of confidence is really required. So, I can’t say it better than him.
Okay. I mean, so it sounds to me like – so, the context matters. The stakes would matter. And it sounds like it’s up to the decisionmaker a little bit as well, right, what they would – is a high enough probability to move forward with something or to not move forward with something?
Yeah, you know, having briefed a fair number of policymakers and talking them through results, I think you really do see this in practice, where the contextual information for them does carry a lot of weight. And so, you know, in both Mariel and John’s example here, when we’re bringing them results from an area that is very well researched and, you know, where there’s a high chance of success based on prior research, policymakers will think about that and may be much more readily accepting of a favorable finding. But in cases where there’s a lot less research or, you know, in the drug dependency example where we know from prior research that it’s pretty hard to do this sort of thing, they do naturally bring in some skepticism of results.
You know, I think, with that said, what we’re asking them to do when we present them with traditional results in terms of a p value is sort of naturally build their own heuristic and think about their prior results. And I think where we move in a Bayesian framework is being able to build some sense of that prior evidence based on actual research, hard numbers, and give them a probability that sort of builds in the context already and then they can further sort of refine based on their own risk tolerance in the sense of the importance of the policy.
Okay. Terrific. Mariel, here’s where I’m going to ask the “why now” question. Bayesian methods have been around for a long time. Why are we talking about them now? And John and Tim, please jump in here as well.
Yeah, that’s a great question. I’m happy to take the first crack. I think there are a lot of reasons why we’re having this conversation right now. I think one that’s maybe most obvious is that we are in desperate need of a replacement for p values. I think all of the examples we’ve already talked about show why p values are failing us. And that line of thinking has, in fact, moved along so far that, in 2016, the American Statistical Association put out a totally unprecedented memo on the widespread misinterpretation and misuse of p values and statistical significance. This is like a professional society that’s never taken a stance like this on an issue before. That 2016 memo was followed up in 2019 by a commentary in the wonderful journal Nature that had over 800 researchers sign onto it. And they went so far as to say that research fields broadly should completely abandon the use of statistical significance. It’s a really big deal. So, there’s widespread agreement, we got to get rid of the p values. That’s one reason.
Another reason is that Bayesian computation has really come an amazingly long way in recent years. So, it used to be that conducting these analyses in practice was an extremely computationally expensive undertaking and now, with better hardware and also with better software algorithms for fitting these models, we can do it practically much more easily.
A third reason has a conceptual kind of flavor to it. So, I think, historically, there’s been hesitancy, rightly so, around using Bayesian methods to inform high-stakes decisions, like policy decisions, for example. And that hesitancy I think stemmed from the fact that we’ve often thought of Bayesian methods kind of in the same breath as the sort of subjective, squishy personal belief. So, like, in the pre-K example, I might have, you know, sent my kiddos to a wonderful pre-K around the corner from my school and, therefore, have, like, positive vibes about pre-K; right? But that is, of course, not relevant for making policy decisions; right?
And the conceptual step forward that we’ve made recently is really due in large part to a fantastic professor at Columbia. His name is Andrew Gelman. He’s, I believe, the kind of preeminent Bayesian thinker of our time. And he has really moved the field forward in terms of reframing this focus on personal squishy belief towards a focus on hard, cold prior evidence. So, it’s not like, “I sent my kids to a good pre-K that I have warm feelings about.” It’s like, “I have read the literature and digested it and meta analyzed it, and I really understand rigorously what the context is for this new finding.”
So, one last reason I think that we’re really talking about Bayes right now, and this one I think perhaps is the most important of all, is that there is an urgent need to understand better the effect of our programs and policies on equity outcomes. And I am, of course, not the first person to say that. President Biden, on his very first day in office, on his inauguration day, signed in an executive order on advancing racial equity and really urged across federal agencies for folks to recognize and work towards redressing inequities in their programs and policies.
And then at Austin University there is an amazing professor named Ibram X. Kendi who’s, like, taken this idea even a step further and proposed a constitutional amendment that would block any policy that could add to racial inequality. And he has this fabulous line of thinking where he points out that when we propose a new program, like a new tax bill, for example, we, of course, take it to the Congressional Budget Office and they tell us what would the effect of this tax bill be on the American economy; right? And he points out that, while that’s very natural and something we routinely do and think very rigorously and analytically about, we don’t – historically, we haven’t given the same level of rigor and analytic thought to the question of what would this tax bill do to equity in the United States. Like, would it widen the income gap? Would it widen the racial and ethnic wealth gap in our country, or is it something that could potentially help to close those kinds of gaps?
And I want to be very clear that I don’t think these kinds of huge, systemic historical problems have an easy statistical solution, like, by any stretch of the imagination, but I do think it’s important to realize that, as we’ve been saying, understanding impacts is really much better done with Bayes than with p values. And especially when we’re thinking about equity, when we’re kind of trying to look inside the black box and figure out what works for whom and what is the effect of this program on specific racial and ethnic groups, that’s really where p values fail especially, and where Bayesian-flavored analyses can be particularly helpful.
That’s interesting about Kendi’s analogy where there’s a “do no harm” approach to the deficit, could there be a new “do no harm” approach to racial inequities or inequities at large. Tim and John, anything that you were thinking about as Mariel was talking? Anything you would like to add about the “why now” question?
I can add just a little something from my perspective. I agree with everything she said. And one thing that resonates particularly for me is the aspect of it that’s conceptual and the contributions of Gelman changing the story of Bayes from “I had a prior belief, I learned something new, I changed my belief.” That was the old Bayes story. The new story is, which I think is a much better story, “There was prior evidence, we have new evidence, and now we have an updated understanding of the probability that something works.”
And especially in our area, in our field of program evaluation and informing policy decisions, taking personal beliefs regarding the answers to research questions out of it is absolutely essential. Nobody wants researchers injecting their personal beliefs into the answers to critically important questions about policy. Our beliefs don’t matter. Decisionmakers’ beliefs may matter. Voters’ beliefs may matter, but our beliefs don’t. And so, we really need to be evidence based. So, that, I think, is huge.
And I think I’m probably the oldest person on this call. I was in grad school in the previous century, which makes me feel a little bit like Grandpa Simpson, but when I was being trained, I remember reading an appendix to what at the time was an old econometrics textbook from the 1980s. And in that appendix, it was talking about Bayesian method because you never talked about Bayes in the main part of an econometrics textbook. You always put it in an appendix if you mentioned it at all. And what it said was something along the lines of, “Well, you know, with modern computers” – in the 1980s – “we now have the technology to estimate Bayes. You know, we can do this now, but, conceptually, like, where does this prior belief stuff come from? That’s really weird.” You know, they didn’t use that language exactly, but it, you know, in a very sort of academic way, kind of cast a lot of doubt on the validity of these prior beliefs. Where would they come from? Whose beliefs are they? So, like, that conceptual change I think is just – for me, it was a game-changer. That’s what enabled me to, you know, in mid-career, change horses from the frequentist horse over to the Bayesian horse. That was what enabled me to do it personally. So, I think that’s just a huge issue, for me at least.
That’s really interesting. And just in case listeners aren’t familiar with the phrase “frequentist,” a frequentist is – when we were talking about null hypothesis significance testing or p values, that’s what we’re talking about is the frequentist approach.
Yeah. Well, and I almost kick myself for using the term because I don’t actually like the term, but it is commonly used. So, that’s why I used it, but I actually don’t love the term. But you’re right, the better way to refer to it is the null hypothesis significance testing framework.
Okay. Well, which I am purely pulling from your brief. So, it’s ultimately credit back to you. The other thing that I’m noticing as you’re talking is it sounds like part of what made you gravitate to Bayes later in your career is this shift towards a more evidence-based approach. So, it used to be that it was, you know, consistent with Mathematica’s mission about trying to bring evidence to bear to policy decisions that would improve people’s lives, that, you know, before there would have been this reliance on squishy prior beliefs, and now there’s a kind of an evidence-based way of approaching Bayesian methods.
[John Deke]Yeah, that’s right. For me, it was the combination of really having to confront, as that example you mentioned in the intro illustrates, to really having to confront how wrong statistical significance can be relative to what we want it to be. That combined with the realization that we don’t have to use personal beliefs in order to use Bayes’ rule, that it can be an evidence-based exercise, it was the combination of those two things that really made me rethink a lot of things.
Okay. Mariel, I want to pick up on something that you mentioned earlier when you were talking about the equity implications, the possibilities with Bayes. So, my laymen’s understanding is that Bayesian methods allows researchers to answer more questions and different kinds of questions than the more common approach of null hypothesis significance testing. So, could you go into that in a little more detail? What sorts of questions can you pursue and what might that mean for evaluating policies and programs through that equity lens?
Yeah, that’s an awesome question. I’m glad you’re bringing us back to that because I was too quick to brush past that important point. So, I think a type of question that Bayes can really help us answer better is the question of what works for whom. So, not just, is a program working overall on average, but really what particular subgroups of beneficiaries or subgroups of schools or subgroups of primary care practices are benefiting the most? And I mentioned just briefly that these types of analyses, these subgroup analyses are a place where p values fail especially flagrantly. And the reason is really twofold. It’s really two sides of one coin almost why p values are a poor tool in that context.
The first reason is that, when we’re interested in understanding what works for whom, we always have smaller sample sizes than when we’re analyzing an overall effect. So, for example, if we’re interested in the effect of an intervention on women only, then we’re going to have a smaller sample size because there’s fewer women in the study sample than there are people in the study sample; right? And when sample sizes go down, we have less precision to estimate effects. So, for folks who like standard errors and uncertainty intervals, those standard errors, those uncertainty intervals get wider. And that can make it harder to pick up a true effect, if there is a true effect there to be found. So, in that way, when we’re doing subgroup analyses using the standard approach, we have a higher risk of missing a true effect that’s really there, of having a false negative finding.
But what’s really kind of pernicious is that the opposite is also true. When we use standard p value-based methods for subgroup analysis, we also run the risk of increasing the false positive rate where we think there’s something going on in a particular subgroup when, in fact, there’s nothing going on at all. And the reason that that happens is that in the null hypothesis significance testing framework, the more subgroups we look at, the more risk we run of finding some fluky finding just by chance, of thinking there’s something going on when, in fact, there’s not.
So, there’s these, like, twin problems of false positives and false negatives both when we use the traditional approach for understanding what works for whom. And what’s kind of magical and almost seems too good to be true is that Bayes can address both of these problems simultaneously. So, by bringing in this broader evidence base that we’ve been talking about, we can effectively increase the amount of evidence available for making subgroup estimates. That’s going to shrink our uncertainty intervals, shrink our standard errors, or, you know, increase our level of precision, our confidence. So, that’s going to prevent us from missing true effects that are there.
And then also, at the very same time, what bringing in this outside evidence is going to do is make us more skeptical of really weird, fluky findings in a given subgroup, where we analyze some small number of people and, just by chance, they all happen to have really crazy outcomes. Bayes is going to help us kind of slow our roll there and be like, “Hey, that’s probably just noise. That’s probably not real what’s going on in that subgroup.” And it’s going to reign those crazy findings back into a more plausible range. And in so doing, it’s going to simultaneously address this other issue of false positive findings. So, I think the take-home message is, when we want to do analyses of the impacts of programs on equity outcomes, we’re often asking questions around what works for whom, and that’s really, really hard to do well unless you take a Bayesian approach.
And, Mariel, to build on that, I think one of the key things you said is that sort of reigning in of extreme values and things like that is not done based on the researcher’s judgment. It’s not like, you know, you’re producing a bunch of discrete estimates of subgroups and then you have to make some professional judgment about, “Oh, this seems totally improbably.” This is really data-driven; right? It’s to what extent is the data overcoming, you know, our priors and telling us there’s really something different about this subgroup. I think that’s a really critical piece to emphasize in, you know, getting back to John’s point about we’re not bringing in our own beliefs as researchers and sort of picking and choosing which values seem extreme or not. It’s the data speaking and coming through in this model.
Totally agree. Great point, Tim.
Yeah, I think that is a great point, and it kind of inspires a thought. There is a role for researchers’ professional judgement in Bayesian modeling, and the role of judgment is really in what questions are we going to ask of the data, what questions do we think are important to ask of the data. So, our role is generating hypotheses in the scientific sense, not in the null hypothesis significance testing sense. Our role is to ask questions of the data, but our role is not to put our thumb on the scale regarding the answers to those questions.
And one of the great things about Bayes is it gives us the freedom to ask a lot of questions. I think one of the big impediments of null hypothesis significance testing for equity research is this fear of spurious findings that Mariel referenced, this fear that if we get a statistically significant finding we’re going to make a big deal out of it. We know people make a big deal out of it, but we also know they shouldn’t, and so we’re afraid. We’re afraid to look at more outcomes. We’re afraid to look at more subgroups because this, you know, rickety, old framework that we’re using, the statistical significance, it can’t handle it. It just can’t handle it.
And so, researchers end up in this very, if you think about it, bizarre situation where they tell people, “No, I can’t look at that additional outcome that members of a community feel is really important to look at because, if I start looking at too many outcomes, I’m going to get a spurious finding,” or “I can’t look at those subgroups that folks think we really need to look at because those subgroups have been disadvantaged. I can’t do it because I could have a spurious finding.” And so, we’re literally afraid to learn. We are literally afraid to look at questions – at more questions. And it’s not because we’re afraid of the answers. It’s because we’re afraid of the weaknesses of our statistics.
And so, it’s not really magic. It’s just a better framework. So, like, Bayes enables us to explore much broader sense of questions without running the risk of these spurious findings because, instead of putting on blinders and looking at every outcome and every subgroup separately and ignoring everything else in the world, instead of doing that, we put everything into the context of everything else we’re looking at, and that gives us the ability to make better assessments about uncertainty probability, that there’s a real difference. So, yeah, I mean, it’s – it might be kind of weird and surprising to think that statistics could make such a difference for studying equity, but I think it really can, and I think it’s just we were really handcuffed by that framework of statistical significance.
So, Tim, we’ve just talked about the case to be made for Bayesian to help answer questions about equity. I’m curious, at the Centers for Medicare and Medicaid Services, are you seeing an interest in using Bayesian in that way? Are people in the administration drawing a connection between Bayes and the potential for asking and answering questions about equity?
Yeah. So, I guess to start, you know, I caveat that I can only really speak for my own narrow experience at the agency, of course.
Of course, yeah.
And I think we’re early days in rolling out Bayesian approaches in our evaluations, such that our policymakers wouldn’t know to ask for Bayes; right? With that said, you know, advancing equity is now a core pillar of CMS and we are – you know, our leadership is intent on answering those sorts of questions of, are our programs having disparate effects, to what extent are we reducing the underlying disparities that are so prevalent in our system right now. And I have no doubt that as we wade into answering those questions, Bayesian tools are going to be essential to really sort of teasing out the answers to those questions because of all the advantages that Mariel and John have laid out.
With that said, you know, I think, as we started to sort of dip our toes into Bayesian methods six or eight years ago, I was surprised but also encouraged to learn that other parts of the agency were already using Bayesian methods. So, for example, to better estimate hospital-level mortality, they use a Bayesian approach that sort of borrows strengths across groups to pull in some of those outliers. So, you can imagine if you have a very small hospital and you’re trying to count rare event, like a death, you can get some very funny-looking results, where there might be a lot of hospitals with no deaths in a year and a hospital, by chance, has a couple deaths, and that could end up with a very high mortality rate if you’re not careful in analyzing those data. So, that’s one example.
But focusing specifically on the Innovation Center, we’ve, as I said, recently started to incorporate these techniques into a handful of our evaluations. And one example that comes to mind is our center’s first test of a primary care model called the Comprehensive Primary Care Initiative. And so, under this model, we paid about 500 primary care practices extra money in exchange for them providing more advanced care services, like care management. And the test of this model was to see whether these new supports for these practices improved the quality of care for patients and reduced cost for Medicare. That was our main outcome.
And so, as we looked at the overall results for this model and we were initially focused on did the model save money, and we ended up in sort of a gray area where we had a favorable effect estimate but we failed to reject the null. And so, we’re not really sure what to do here. We’re seeing some evidence that the model might have saved some money, but failing to reject the null puts us in a situation where we can’t say it was statistically significant. We also can’t say that the model didn’t work at all.
And so, one area that we wanted to look at was to sort of disaggregate these effects, to sort of peel back the onion, and we had done this test in seven distinct geographic regions. So, it was natural to wonder would the results vary by region, might there be some regions for whom the intervention worked particularly well and other regions where it didn’t work as well. And so, we had estimated these subgroups traditionally using null hypothesis significance testing, and it was pretty noisy; right? We had less sample in any given region and so there were lots of overlap and confidence intervals, and it was really hard to tease out is one region truly different from the other, or are they fairly similar.
Moving to a Bayesian analysis made the differences by region much clearer. We were able to produce results in a way that was far more intuitive. We essentially created sort of a red, yellow, green stop light you can imagine, where red is the probability that a given region lost money under the model, yellow is a probability that saved some amount of money up to our investment, and green was a probability of savings over and above our investment, that would be a net savings. And it was a good example because it showed both that the Bayesian estimates gave a more precise estimate for any given subgroup. So, we were able to sort of see the differences by region, but it also gave us really intuitive red, yellow, green savings, some savings, net savings framework. So, it did start to show that there was one region that had a high probability of a net loss, another region had a high probability of net savings, and some mix of other results in there.
The other reason why I really like this example is because it shows you both what Bayes can do and what it doesn’t do for you. So, here, we got a result that was really interesting and told us something about how the model might be working, but it doesn’t tell you the why; right? We still have to think about what might drive those differences by region and we still have to worry about the same concerns we’d have in a traditional subgroup that is there some unmeasured bias that might be driving one region’s results separately from another. So, I really want to stress that, as helpful as Bayes is, it’s not a silver bullet. It doesn’t solve sort of some of the core issues with doing social science, but it really can move the needle in terms of communicating results and getting to those subgroups.
John, Mariel, beyond the work that we’re doing with Tim and with CMMI and CMS, I was wondering, where else are you seeing increased interest in the use of Bayesian methods in federal policy research and in what form does that interest take?
Well, I could take a first shot at that, J.B. So, we’re seeing a lot of interest across multiple agencies. We’ve seen interest from IES, the Institute of Education Sciences at the Department of Education. We’ve seen interest in multiple agencies within the Department of Health and Human Services, for example, OPRE, which is the Office of Planning, Research and Evaluation, as well as the Office of Population Affairs, OPA. Really, all the agencies, I think, that are in this space of evaluating programs, that have traditionally funded impact evaluations to understand what the effects of interventions are, they’ve all had to contend with this ASA statement on statistical significance. They’re all trying to figure out what to do about it. And so, while different people, different agencies are at different stages, I think there’s pretty widespread interest in answering the question that they need to answer, which is what’s the probability this thing worked?
You know, I suppose another approach you could take is, instead of answering that question, you could just more clearly explain to people what a p value is or what statistical significance is. And I’ve sometimes heard folks suggest that we do exactly that, but that’s kind of backwards. You know, it’s kind of like, you know, I’ve got a hammer, the world is full of screws, but I love my hammer, so I’m just going to go pounding away on those screws no matter what, and that’s really backwards. We need to use the right tool for the right job.
You know, the mistake people made in misinterpreting p values was to think that the p value gave them the answer to the question that they really needed to answer. The mistake was not wanting to answer that question. People are right to want to know what the probability is and an intervention worked. That’s a very appropriate thing to want to know. So, you shouldn’t give that up. Instead, you should give up the thing that wasn’t answering your question. And I think that’s the realization that’s coming to a lot of agencies. So, I think the interest is there and it’s growing as people realize this more and more.
So, most of this conversation is focusing, I think appropriately, on the benefits of Bayesian methods, but I am conscious of the fact that it’s not the standard way of doing policy analysis today. And so, I want to hear, what are the concerns or objections you’ve encountered about using Bayesian methods and, to the extent that you’re able to address them, do you see Bayesian methods as a complementary approach to this traditional null hypothesis significance testing approach or is it something that’s in competition with it and may eventually replace it?
Yeah, I have a few thoughts. I mean, first, it connects I think directly back to what I said earlier about concerns about priors. I think that’s, you know, as John and Mariel have said, that’s where a lot of the concern starts and where, you know, getting over those concerns is what has accelerated the use of Bayesian approaches. And I think there is a lot of education that needs to be done to explain what a prior is and then to, as I said, to overcome some of the concerns about putting a thumb on the scale to be really clear about what your prior is and make sure that you have buy-in.
And I think one of the ways that we also do that is by testing the sensitivity of our results to our priors. And so, you know, if there is some uncertainty around what the right prior is, as I think there often is, you know, it’s not a huge lift to then think about specifying several different priors. And you might have one that you favor and you think is the most accurate, but if you find that result you get is very, very sensitive to your priors, that should tell you something about how you should communicate your results, and so that's a really important consideration.
I think, you know, the other thing is, we've spent so much time in this era of null hypothesis significance testing that some of the audience for our results have learned to be dependent on it; right? They've taken their stats courses. They know something about p values are important. When they look at a table of results, they sort of point their finger down the row and look for the results that have the little asterisk next to it to signify significance, and in sorting through a world of pages and pages of results, that's a really comforting thing, to be able to say, okay, this is the important result. Everything else I can sort of not worry about. This is the result I need to focus on, and everything else maybe is important, but I'm not really sure.
I think in moving into a Bayesian world, it is giving much more nuanced results, which is both a benefit but also a challenge to communicate what is important, because statistical significance alone is not important. You can have results that’s not policy relevant that happens to be significant. You can have policy relevant results that aren't significant. And so now we're really faced with removing that crutch of having this hard and fast rule, which we never should have treated as a hard and fast rule, to a more nuanced set of results that we, as the researchers, need to figure out how to communicate really clearly and concisely to policymakers who honestly don't have a ton of time to sort through pages and pages of results.
Tim, let me just ask one follow up there, because one of the reasons I wanted to ask this question is because I had seen a publication in which you were listed as a co-author, where I think you had helped create a dashboard that presented results, and both a frequent test, the null hypothesis testing approach and the Bayesian, so you were offering the results in both ways. And I wonder if that might be, at least in the near future, something we will see more of, where consumers of policy research will be able to see the results presented in both frameworks?
Yeah, I think that there are folks for whom they will always feel more comfortable when they're able to see a p value and a standard error. And I think, as researchers, we may want to give them all the data that they want. You know, it's on them to make their decision. They're informed in a certain way, and so I do think that, certainly in the near term, Bayesian results could function as a complement and not a replacement to traditional p value. But with that said, you know, I think we have consumers of our results who think much more probabilistically already.
Our Office of the Actuary run simulation models day in and day out, and they're thinking about probability. They're not thinking about what's the probability of this event happening if I assume the null hypothesis is true. That's just not the way they operate and so, for them, you know, I think Bayesian results are kind of a natural thing to think about. And so, I think we are in a world where both can co-exist, but we really need to sort of explain the limitations of the p value, as sort of the ASA had pointed out as we continue to provide them.
I agree with everything Tim said. I think it all makes sense. You know, I think, like he said, one of the big impediments has been this mis-conceptualization of priors and probabilities in terms of beliefs. But we can overcome that. I agree that we're going to be in a period of time where we're probably going to be reporting both statistical significance and Bayesian probabilities, because we need to have our reports maintain some compatibility with the installed base of readers. People expect it to be a certain way, and so you need to respect that.
Longer term, I think that Bayes will probably completely replace null hypothesis significance testing. I think that makes sense longer term. One small technical thing is, I think standard errors will actually always be interesting and important to report. I don't see those ever going away. Those are really important. It's more the p value that I think will eventually fade from the scene.
I would just add one more thing in terms of concerns that people raise about Bayesian methods. It's actually a very common question we receive, which is, okay, I'm conducting an evaluation or I'm reading an evaluation report about an intervention that seems really unique. This intervention hasn't been studied before, or this intervention model hasn't been studied before, and so there's really no prior evidence regarding this particular approach to addressing a problem or trying to affect an outcome, so what prior evidence can I bring to bear here, what can I do? And that's a very common question. And the answer to that question is that we can always take whatever the intervention is and place it in a broader context, and so long as we're clear about that and we're clear about the interpretation of the probability statement we're making, I think that's totally okay. So, I'll give you an example.
The federal government of the United States has funded many efforts to improve the outcomes of citizens. Every intervention that any agency is likely to evaluate falls within that broad population of efforts by the federal government to improve the outcomes of citizens. And so we can look across all of those efforts and all of those effect estimates in order to see what's the probability that this new thing that we're evaluating is one of the winners? What's the probability that it is among those efforts that was effective? That's a very broad framing, but it is a useful framing, I think, and I think it's an intuitive framing, and it's actually something we do all the time.
I remember way before I learned anything about Bayesian methods, I was reviewing some output from a research assistant at Mathematica, and there were all these significant starters everywhere, and I was very skeptical. And I said to the research assistant, look, it is far more common, in my experience, for research assistants to make programming mistakes than it is for interventions to have all of these positive effects. And the research assistant went back and they found, oh, yes, I did make programming mistakes. There weren't so many statistical significance stars.
So, in making that statement, I was drawing on a very broad base of experience of effects from lots of different evaluations for lots of different agencies, and that informed my assessment that this is not likely a real effect, and I was right. Now, I didn't even know about Bayesian method beyond the appendix of an econometrics textbook from the 1980s. I didn't know anything about it, but that was the thinking that I used. And I think that that framing is something we can use in interpreting these sort of novel interventions that seem very different from everything else.
And, John, I think that raises an important point that maybe we haven't touched on enough, is when we're talking about probabilities, we're mostly talking about probability distributions; right? And so, when you think about the probability based on the prior evidence, we're not talking about one point estimate. We're talking about characterizing the distribution of the probability. And so, in the case of an intervention that doesn't have a lot of prior evidence, we would expect that it would be a very wide distribution. We wouldn’t have a very strong prior guess about how likely that is to work. And that's one of the nice things about being very transparent with your priors is you can get a sense, you know, based on the evidence, are you making a very strong guess about how effective something might by or are you using a fairly broad prior that doesn't put a lot of weight on any one potential outcome?
Yeah, that's a great point, Tim, and it really relates to another trap that's very easy to fall into, which is thinking that something either works or it doesn't, and there's actually a distribution of effects. And that's actually a thing that makes that example from the intro kind of a tricky example, because, in that example, we imagined that there are only two possibilities, an effect of zero, or an effect of I think it was like seven percentage points or something like that. And to most readers of that brief, I bet that example seemed very natural. They probably don't question it, because that's how you're trained to think under significance testing. You're trained to think yes or no. But that's not the way things actually work. There's actually a whole distribution, as you say, of effects, and some of those effects are negative. So, that is something that comes up over and over again, and I fall into this trap myself all the time of switching into this binary thinking mode when it's actually a continuum. It's a super good point.
So, we've anticipated my last question a little bit. I think we’ve talked a little bit about educating, making sure people are more literate around what the Bayesian results mean, what the standard approach when we do have those outputs, what those mean. But I'd love to get into this a little bit more. As our ability to make use of Bayesian methods has improved over the last, say, two decades, I'm wondering what you would say is next. Where is there still room for improvement and, you know, in terms of the people and policy aspect of this, what would be the implications for people by making policies and programs more effective?
So, I think we’ve spent a lot of time so far talking about how Bayes can give us an answer to research questions that we've always had kind of top of mind, you know. What's the probability that this intervention works on average? What's the probability that it works in a particular subgroup? Those are pretty well worn and important research questions in places where I think Bayes really has a lot of value to add.
I think as I look forward into the future, one thing that I'm very excited about is Bayes opening the door to us asking brand new types of research questions that we really weren't able to even tackle before. And one of those, I think, that is particularly important and exciting in my mind comes back to this idea of Bayes letting us, as John was saying, kind of ask more questions without being scared that, because of our rickety old tools, any answer we're going to get is nonsense. And the one I'm thinking about in particular here has to do with data-driven subgroup finding.
So, historically, traditionally, because of the kind of confines of statistical significance testing, it's been absolutely necessary to prespecify which subgroups we're going to look at before we see any data. So, you know, in Tim's world, if we have this intervention trying to improve the quality of health care provided, we might pre-specify. We're going to see what the effect is in small primary practices and big primary care practices. We're also going to check what's the effect in rural practices and in urban practices. And maybe we would have one or two others, but that would really be it. We would cut ourselves off there.
But there are very cool new Bayesian methods. I'm thinking in particular of a method by Richard Hahn and his colleagues called “Bayesian Causal Forests,” and it's one that we work on a lot here at Mathematica, where we're thrilled to be kind of extending the method further for the policy research context. And that's a method that really lets you, instead of prespecifying which subgroups you're going to look at, go and ask the data, tell me where is this thing most effective. Given everything we know about these primary care practices, where do the pockets of success seem to be? That question is absolutely not possible to ask without these nice skeptical priors that reign in implausible estimates and, in general, are just really skeptical of fluky findings, and I think it's really an exciting direction to be able to move in.
Tim, what would your answer be to what's next? Where is there room for improvement and what might that mean for people in policy?
Yeah, I mean, I think the first obvious place we've talked already about is in that area of disparities and improving equity, particularly in health care, where we have such an inequitable system, and I think we've touched on that enough to say that it's going to be an important tool to get those fine-grain subgroups to really understand how we move the needle on this important issue.
The other area that I'll touch on briefly is, you know, I think Bayesian methods could be part of the answer for realizing the promise of rapid-cycle evaluation. So, this is something that we’ve talked about a lot, and we’ve sort of, at least here at CMS, we’ve moved from a situation where we run an intervention and then, you know, within five years of it ending, we'll have some answer about did it work or did it not. And our answer to moving to a rapid cycle-framework was to say, well, we want to test it more frequently. We want to see how it's going as we run the intervention and not wait until the end, so that we know if something's not going wrong, we can tweak it. If something's going really well, maybe we can scale it. And it turns out that that's really difficult, for a variety of reasons.But one of the ways that I think Bayes can improve that is by giving us those more fine-grained for whom
is the model working, not just is it working overall, and giving us the not just an overall thumbs up, thumbs down, are we seeing significant statistical results, but is this more likely than not to be effective and sort of answering questions like that, that might not be good enough evidence to say we're going to make this program permanent, but it might be good enough evidence to say we should tweak our test to try something a little bit different. So, I think Bayesian could play a really important role in in rapid-cycle testing.
John, I'll leave you with the final word here.
Well, I'm going to build a little bit on what Tim said there. I think that's a great point. I think Bayes has the potential to help us replace a vicious circle with a virtuous cycle. So, the vicious circle we referred to earlier is the fear to learn, the fear to test new things with small samples because we might get the statistically significant result that we really think is spurious, but we have no way to understand that or communicate it, and so that vicious circle led us to not ask questions, not collect data, because we were afraid of what to do with it. And then when you don't have the data there, you can't analyze it. So, people aren't collecting data. They're not asking questions. And that really limits what you can do on the analysis side.
But I think we're going to end up now, moving forward, potentially, with a virtuous cycle, in which people are encouraged to do small studies, you know, whether that's a school district looking at their own evaluation of two math curricula they might want to adopt, or whatever the example may be, because now they can place their small study in a broader context. They're not just putting on the blinders and looking at that, and they can contribute to a larger evidence base. And from that larger evidence base, we can be do more sophisticated analyses. We can start to move into trying to figure out not only what's likely to have worked in the past but what is likely to work in the future.
Today, if you're going to the What Works Clearinghouse in the field of education, you're getting information about the effects of programs as they were implemented in a specific context in the past. But what you really want to know is, what's the probability this is going to work for me in my context. That's something we can't give a good answer to now, and it's largely because we don’t have the data. You know, we have the tools, but we don't have the data. But before we didn't have the tools, so there's no point in collecting the data. But now that we've got the Bayesian tools, there is a point to collecting the data, and I think it's going to just build, where we're going to collect more data, we're going to start to see the really interesting things that we can learn from that data with these tools, which will then inspire the collection of more data, so on and so forth. So that's what I'm really hoping to see over the next rest of my career, I guess.
Thanks again to my guests, John Deke, Mariel Finucane, and Tim Day. In the show notes, I'll include links to any of the research that we discussed in this episode. As always, I want to thank you for listening to another episode of On the Evidence, the Mathematica podcast.
There are a few ways you can keep up to date with future episodes. You can subscribe on Apple Podcasts, Spotify, Stitcher, Google Podcasts, YouTube, or anywhere else that you find podcasts. You can also follow us on Twitter. I'm @jbwogan. Mathematica is @mathematicanow.
Read a brief about using a Bayesian framework for interpreting findings from impact evaluations prepared by Mariel Finucane and John Deke for the Office of Planning, Research and Evaluation at the Administration for Children and Families.
Read a paper co-authored by Mariel Finucane that compares Bayesian methods with the traditional frequentist approach to estimate the effects of a Centers for Medicare & Medicaid Services demonstration on Medicare spending.
Read a paper co-authored by Tim Day describing an experiment to provide evidence that would be useful to policymakers and other decision makers through an interactive data visualization dashboard, presenting results from both frequentist and Bayesian analyses.
Read Emily Oster’s newsletter article about why and how she applies Bayes’s Rule to interpret new evidence in the context of existing evidence, including a recent study about the effects of a preschool program in Tennessee on future student test scores.