Princeton’s Matthew Salganik Discusses the Evolving Intersection of Data and Social Science, Ethics, and Why He’s Trying to Predict the Future

Princeton’s Matthew Salganik Discusses the Evolving Intersection of Data and Social Science, Ethics, and Why He’s Trying to Predict the Future

Jan 18, 2023
On the Evidence: A Mathematica Podcast, with Matthew Salganik

Matthew Salganik’s book, Bit by Bit, explores the merging worlds of computer science and social science for timely, policy relevant research in the 21st century. In the book, he shows how traditional research techniques in the social sciences can sometimes be combined with digital tools and big data to generate high-quality evidence on a larger scale, in less time, and at a much lower cost. On the five-year anniversary of his book’s release, Salganik spoke with On the Evidence about the book’s legacy and the evolution of the field of computational social science since he first taught a course on the subject in 2007.

Salganik is a professor of sociology at Princeton University and a member of Mathematica’s Board of Directors. He is also a co-founder of the Summer Institutes for Computational Social Science (SICSS), a global program which offers free trainings to data scientists and social scientists who wish to draw from both disciplines to generate fresh insights about public policy. In 2020, Mathematica and Howard University partnered to host the first SICSS at a historically Black college or university and the first to orient its coursework around anti-Black racism and inequity.

Our interview covers the following topics:

  • The changing relationship between the social science and data science communities
  • The motivation for establishing SICSS
  • The role decision makers have in maximizing the potential of computational social science
  • The role policymakers could play in ensuring that computational social science research is conducted in an ethical manner
  • Why he is writing a new book about predicting the future

Listen to the full interview.

View transcript

[J.B. WOGAN]

I’m J.B. Wogan from Mathematica and welcome back to On the Evidence, a show that examines what we know about today’s most urgent challenges, and how we can make progress in addressing them.

On this episode, we’re talking with Matt Salganik about the merging worlds of computer science and social science to produce timely, policy-relevant research in the 21st century. Last month was the five-year anniversary of Matt’s book, Bit by Bit, in which he shows how traditional research techniques in the social sciences can sometimes be combined with big data and other digital tools to generate high quality evidence on a larger scale, in less time, and at a much lower cost.

Matt is a professor of sociology at Princeton University and a member of Mathematica’s Board of Directors. He is also a co-founder of the Summer Institutes for Computational Social Science, a global program that offers free trainings to data scientists and social scientists who wish to draw from both disciplines to generate fresh insights about public policy.

In 2020, Mathematica and Howard University partnered to host the first Summer Institute for Computational Social Science at a historically Black college or university; it was also the first to orient its coursework around anti-Black racism and inequities. Long-time listeners of the show will remember a previous episode in 2021 that focused on the summer institute at Howard University, and I’ll include a link to that episode in our show notes.

To kick off this episode, I’ve asked Mathematica’s President and CEO, Paul Decker, to help me contextualize the importance of Matt’s book to organizations like Mathematica that collect and analyze data and to the people that we partner with in government, philanthropy, and the private sector, who use those data to make better decisions and ultimately improve public well-being.

Paul became CEO of Mathematica in 2007, the first year that Matt taught a course on web-based social science research, a course that would eventually become the Bit by Bit book. And, as Paul recalls, it was around that time that evidence-generating organizations like Mathematica started to recognize the importance of data science and new analytical methods related to data science.

[PAUL DECKER]

And so one of the challenges I was faced with in leading Mathematica was how to think about data science in a social science research context. And my sense was that data science was adding something unique and productive to the picture, that data science’s ability or focus on leveraging technology, as well as the democratization of data sources, being able to rely on big data, being able to rely on administrative records much more productively than in the past in the policy research field really opened up opportunities in the field for data science.

[J.B. WOGAN]

So around the time that Matt was teaching his course on social science and big data, Paul was developing a vision for how Mathematica, an organization steeped in traditional social science research, could begin to incorporate data science into its work. Paul knew that the integration of social science and data science would eventually happen across the industry. It was the future of policy research.

[PAUL DECKER]

But there wasn't anybody really articulating what the integration might look like.

[J.B. WOGAN]

Then Paul heard about Matt’s book and had a chance to talk with Matt about his ideas.

[PAUL DECKER]

It was a discovery of somebody who had thought more carefully about this integration. And I really liked the way in which Matt framed it as a combination of two different perspectives that bring different strengths to the table, because that's really what I had in my head when I thought about how we build capabilities at Mathematica over time to be able to provide the best possible support for our clients that would build on the strengths of both traditions.

[J.B. WOGAN]

When Matt’s book published, it anticipated that the integration of data science into the social sciences would change policy research, so I asked Paul: How has it changed Mathematica’s work?

[PAUL DECKER]

Those methods and capabilities are helping us leverage data in ways that we just weren't able to do in the past. So it allows us to access a broader sense, a broader set of data sources. Different structures of data, including unstructured data, whether those data come in the form of video files, audio files, text files, all of those can be analyzed in ways that we couldn't analyze them in the past. So that just allows us a much richer approach to our work so we can deliver better on our mission of helping drive better decision making based on the evidence that we uncover.

[J.B. WOGAN]

Something that Paul has mentioned to me in the past, and that I wanted to ask him about in the context of Matt’s book, is that the data revolution, which involves Big Data and data science, represented an important and sustainable shift in the field of policy research in the ways that other methodological innovations have not. I was curious why, in Paul’s opinion, the trends Matt explores in his book had more staying power. 

[PAUL DECKER]

When we're talking about the value added by accessing big data, I think the value is inherent in the name. Big data means we're introducing new data or new information to the estimation process as opposed in the past where often we would see methodological fads that were based on different ways of manipulating data statistically without adding any new data to the picture.

So in a lot of those cases, I think given that context, it wasn't surprising that the new methodology didn't consistently generate different or better results, in a lot of cases from estimation. But when you're introducing new information, as in the case of big data, that new information means you can generate improved estimates.

[J.B. WOGAN]

OK, hopefully that contextual framing from Paul is helpful in thinking about why the rise of computational social science is such an important development for organizations that conduct policy research and the people who rely on evidence from policy research to make decisions. We’re now going to switch to my interview with Matt about the five-year anniversary of his book, Bit by Bit. This interview was recorded in December of 2022. A full transcript of the episode is available on the blog post associated with this episode at Mathematica.org. I’ll also add links to related resources, including a free online version of Matt’s book, in the episode show notes.

[J.B. WOGAN]

So, let’s see, I think where it would make sense to start is to give listeners just a sense of what “Bit by Bit” is about and what your goals were around writing the book in the first place.

[MATT SALGANIK]

Sure. “Bit by Bit” is for social scientists that want to do more data science, data scientists that want to do more social science, and anyone interested in the hybrid of these two fields. And I spend time in each of these fields, and I can see that people in them have a lot to learn from the other field, and also a lot to contribute. And so, it’s like if you have two friends that you know would hit it off, but they’ve never had a chance to meet each other, if they have a party to introduce themselves, that’s what this book is. This my party to introduce social science and data science to each other.

[J.B. Wogan] Okay. And in the book you talk about part of this as being a course that you were teaching too; right? It’s documenting stuff you’re kind already doing in the course, and also a way of supporting that course going forward, so that’s another aspect of why you wrote it.

[MATT SALGANIK]

Yeah, absolutely. So this book started -- grew out of a course that I was teaching at Princeton, starting in 2007, which was my first year there. I believe that was before the term “computational social science” became popular. At the time, I was calling the course something else, called “Web-Based Social Research. And what was happening is that I was assigning papers -- this was a course for graduate students. I would assign them individual papers, and then during the class, we would talk about some of the broader issues and trying to think about how those papers fit together, how that tie in to core ideas from social science, how they tie into core ideas from data science. And I found that there was a lot of stuff that’s really just missing if you read paper by paper, just kind of connective tissue, big picture. And so part of the book is to write all that down so that people can have that perspective as they approach the field.

[J.B. WOGAN]

Okay. And we’re talking of the five-year anniversary of the book, which is even farther back than the original, 2007.

[MATT SALGANIK]

Yes.

[J.B. WOGAN]

We’re at, what is it, 13, 15. So it would be 15 years since you started the course. I was wondering, has the course changed? Would somebody who took that course in 2007 notice big differences today? Did the book actually help change anything, update anything, or were there certain major changes that you would flag?

[MATT SALGANIK]

I think the book aged well, and the course aged well, largely because of the design of them. I was trying very hard, when I wrote this book, to have it make sense in five years, because there’s a lot of books in computational social science that I think do not age well. And the books that don’t age well, tend to be much more specific, and so I think the key to staying relevant is to abstract away some of the details, and so I don’t spend a ton of time in the book talking about specific platforms, because those are constantly changing, new platforms are constantly being created, new technologies are being created.

But I think if you move one level of abstraction up, there are a lot of commonalities and things that may sense today, thing that make sent five years ago, things that make sense five years from now. So I think that part of the book has really aged well. That’s not to say there are not changes. So, sometimes, in fact, the things that are old are new again.

So, my first year I taught the course, I had a whole week devoted to virtual worlds, things like second life, and there was a lot of deep thinking about the time that second life would be a future place for social science research to occur. That did not turn out to be the case; however, now we are very excited about the Metaverse, which has some of the same characteristics as second life. And, more generally, I think the value of seeing this over this longer historical arc is to see that there are things that come and go and there are things that remain, and focusing on the things that remain is likely to be the most helpful for the future.

[J.B. WOGAN]

That’s so funny, I wonder how many of our listeners will know or remember what second life is. I remember it because it was either the first or one of the first topics I was actually paid to write about as a journalist. I was freelancing for newspaper that, like a lot of places, doesn’t exist anymore. But it was the Jewish Transcript News in Seattle, and they wanted someone to write a feature story about the thriving Jewish community on second life and people’s different experiences with sort of exploring their Jewish identity in this alternate universe. But, yeah, I could see how in some way now, this book, and even now in just the Metaverse, even just social media was a version of a second life. It wasn’t the second life.

[MATT SALGANIK]

That’s true. One other thing about the history of the course, all the syllabi from the course are posted on my website, so you could include those in the show notes if people want to review the syllabus from 2007, and review it from the last time I taught the course.

[J.B. WOGAN]

Okay. Okay, that’s great. I can add that to the show notes. So, five years later, I’m curious what the impact of the book has been, and, also, how you think about the impact? Like how would you define or measure the impact of the book since it came out.

[MATT SALGANIK]

It’s interesting. I think I’ll talk, I guess, a little bit about my personal experience first, and then we can talk a little bit more about how to think about measuring the affect. Personally, it’s been incredibly rewarding when people come up to me and say, “Oh, I read your book, and it really helped me.” Also, when you hear that from people all over the world -- I was just in Holland giving a talk, and someone there said they were new to the field of computational social science. One of their colleagues gave them this book, and they loved. It really helped them get started in the field. And, to me, that was just so cool, because it’s people that I never even imagined when I was writing it.

How would you measure impact of the book? I think, in some ways, like, it got a very positive review in science when it came out. To me, that’s awesome. It got a number of very positive reviews from people that I respect a lot. It’s been translated into five languages now; Chinese, Japanese, Korean, Italian, and Turkish. And if any of your listens know other languages that they would like the book to be translated in, please contact me so we can try to help make that happen. It’s been using courses around the world.

[J.B. WOGAN]

That’s the question I was going to ask.

[MATT SALGANIK]

It’s been used in lots of research. So it’s been used in courses around the world, all different kinds of courses, graduate courses and undergraduate courses, computer science and social science. I have a list of courses that have used the book on the book’s website, so, there, I have opened sourced my own teaching materials related to the book, and encourage other people to use the book in their class to upload their teaching materials so that anyone who is thinking of using it in a course has access to see what other people have done as they’re designing their own course. So you can see there very widespread adoption, social science, data science, undergrads, grad students, interns, U.S., outside of the U.S., it’s amazing.

[J.B. WOGAN]

Do you know if it’s being used as a resource, a training resource for private organizations? I know it’s obviously something that people are aware of at Mathematica. But like you mentioned the book, you have your own experience with private companies like Facebook and Google. Like are there private companies like that that this is part of the required reading, or at least some recommended reading for people that are onboarding?

[MATT SALGANIK]

Sure. I know that Facebook Data Science Team read it for their book club a few years ago. I know I’ve given talks at a number of companies, and so I’m really happy for anyone to read it and learn from it.

[J.B. WOGAN]

Okay.

[MATT SALGANIK]

And I do think, also, the lessons from the book are very relevant to anyone, whether they work in a company, a government, or university, an NGO.

[J.B. WOGAN]

One of the things that you talk about right from the outset -- I think comes up throughout the book -- and you mentioned it in our podcast here too, about sort of creating a dinner party environment, where people, two different communities can socialize and learn about each other. So, I was wondering, do you think that those two communities, those two different sides understand each other better now, and to the extent that there are still gaps, you know, what are you thinking about in terms of ways to bridge those gaps?

[MATT SALGANIK]

Sure. I definitely think the two communities understand each other better now than they did before. I think this was a natural evolution that was going to happen. So, at the beginning of a lot of research, new kinds of research, there’s just a lot of uncertainty, people don’t know. Over time, there have been a lot of things that have helped the community form and develop some shared language, so I’m thinking of, for example, there is now a conference. I see 2S2 International Conference of Computational Social Science. There’s also the Summer Institutes in Computational Science, which is a training program, and there are many other universities that now offer these courses. So, at the time “Bit by Bit” were published, these courses were relatively rare, and now they’re becoming increasingly common, which I think is a great thing.

Can I go back one? I also forgot one thing about impact. “Bit by Bit” recently received the AAPOR Book Award, which I was very happy about and honored to receive. It’s a community that I drew from a lot in my work, and to be recognized by that community was very special. And one thing that’s especially nice about that award is that the book has to be published for at least three years, and so’s like an award where they want to wait and see how the book has aged before they give it, and so it’s very nice that they had thought that it had age well.

[J.B. WOGAN]

And you were very intentional about that. You were talking about abstraction, and you like -- rather than teaching people how to use Twitter API, you’re giving some guidance around usings big data, knowing that this it may not be Twitter API in the future. It may be some other emancipation of the data.

[MATT SALGANIK]

As we’re talking, Twitter is undergoing some very major changes very quickly, so you know never know what’s going to happen.

[J.B. WOGAN]

Yeah.

[MATT SALGANIK]

I do know that, you know, no matter what Twitter becomes, and no matter if Twitter goes out of business and gets replaced by something else, I think the principles of the book will still likely apply to the new Twitter, or whatever Twitter becomes.

[J.B. WOGAN]

When you were talking about impacts, I wondered if you would bring up the Summer Institutes for Computational Social science. Is that another way to think about -- like do you see a direct causal relationship, that the book comes out and then these institutes start to proliferate, or is there -- clarify for me what the relationship is between the six, including the ones that Mathematica sponsors with Howard and the book.

[MATT SALGANIK]

Sure. They’re definitely related. So, the Summer Institutes in Computational Social Science is something that I started with a colleague named Chris Bail, with funding from the Russell Sage Foundation, and we started our first one in 2017 at Princeton, so right about the time that “Bit by Bit” was finished, and the goal with the program was similar to the goal of “Bit by Bit.” It was to help people learn about computational social science, help build a community in this area, and so we had 30 people, graduate students, post-docs, and junior faculty from the social sciences and data sciences come to Princeton, and it was amazing. They learned so much.

We ended up with something like 300 applications, so we realized we weren’t able to provide this opportunity to everyone that we wanted, so we decided to live stream the Summer Institute, the first one. Then, during the Summer Institute, we quickly realized that more than the live stream, there’s the community aspect of being together with these other people. That was really valuable. And so we started the next year, in 2018, having partner locations that were run by alums at the Summer Institute at their home at another university organization.

And so, in 2018, I think we had something like seven. In 2019, I think we had something like 11, and then it keeps growing. Through Covid, we continued. I think this past year we had about 30 all over the world, including one that was run by Howard University, in Washington, and Mathematica, and that was a very special partner location that -- yeah, that was a very special partner location.

[J.B. WOGAN]

In terms of impact, there was one thing I meant to ask, which was about reader responses. You mentioned there were a couple of people who have just come up to you to say thank you, I read the book. [inaudible]. Are there any specific stories -- like, I’m curious. I think you mentioned you had worked at the Census. Have people from the U.S. Census Bureau read your book, and are they applying ideas from it, or are there any interesting stories you’ve heard about how people have been influenced or are applying ideas from the book?

[MATT SALGANIK]

Some people from the Census have certainly heard about it, because I gave a talk about the Census Bureau about the book a few years ago. In terms of, like, specific things, I think maybe I don’t have any particular stories. I think the citation count is maybe one way of looking at academic impact, and I don’t have that. I’m looking at that right now. Yeah, it’s been cited 720 times, which is great.

[J.B. WOGAN]

Yeah.

[MATT SALGANIK]

And I hope each of those has a story behind it.

[J.B. WOGAN]

So, do you think –

[MATT SALGANIK]

I can also talk about the process of publishing it with the Open Review tool kit.

[J.B. WOGAN]

Yeah, that would be good. And, also, I want to clarify, so while I have a hard copy and people can buy a hard copy, a free version is available online to look at; right?

[MATT SALGANIK]

That’s correct; anyone can look at and contribute to. And so, what we did, the book is about social research in the digital age, and I wanted to publish the book in a way that’s fitting of the digital age. And so what happened is we created a process called the open review, so you’re very familiar with the concept of peer review, which is, while I submitted my manuscript to the press, they sent it to experts to be reviewed to make sure that it was appropriate, and to collect feedback on it, and then I took those comments from those experts and improved the manuscript.

While it was going through peer review, it went through a parallel open-review process, where I posted the entire text of the book online, and anyone could read it and annotate any particular part that they wanted. And then was incredibly helpful for getting good feedback. The feedback from the two processes were quite different. The feedback from the peer review was often about bigger-picture issues about the book from experts, like I think it could have had a chapter about text [inaudible], things like that.

The open review comments were much more specific, and they were not always from experts. So they would say, “This sentence is confusing to me.” And that’s actually incredibly helpful feedback. It's different feedback, but it’s very, very helpful, and so through combining the feedback that you get from the open review with the feedback you get from the peer review, you can produce a better book.

The open review also allows for comments that are incredibly specific by what you could call by micro experts. So, for example, the book is full of examples that illustrate different points about computational social science, and sometimes the authors of those papers would write in an annotation about how I had described their paper. So, incredible expertise that you can’t really get in peer review either, because there’s hundreds of examples in the book, and so you can’t send it to every one of those people. So, the open review process leads to these better books.

It also leads to increased access to knowledge, so the book is available now for anyone who wants to read it, and it also leads to higher sales. This was designed to be done in way that’s friendly to academic publishing, because that needs to be a sustainable business, and so, for example, we were able to collect e-mail addresses of people who were interested in receiving an e-mail when the book was available for purchase, then we e-mailed all those people on the launch day and we had a big launch, and it became the number-one best seller on Amazon in the category of Social Science Research Methodology.

But so, again, the goal of this -- and then I got a grant from the Sloan Foundation to open source all the software that we did for the Open Review tool kit so that other people can do this were their own book. And, again, the goals are better books, higher sales, increased access to knowledge, and the software is all open source, and other people can use it as they wish. I think open access to books is a next frontier in the open-access movement. There’s a lot of fighting and action related to open access to academic journals. There’s still a lot of work to be done, but there is progress there, and I think open knowledge around books is a next frontier that we should start working toward.

[J.B. WOGAN]

The higher sales element is interesting. I think it’s a little counterintuitive. Before you explain the mechanism of how you actually were able to use it, I think people assumed that there would have been a tradeoff there in terms of greater access about information, but maybe lower revenue as a result.

[MATT SALGANIK]

No. I think that’s an illusion. Plenty of companies make a lot of money by being free online; right? The challenge is to get something in return when you make something free. And so we do get something in return. The biggest thing is we get eyeballs of people who are interested in buying the book. One of the biggest challenges of selling books is to make sure that people are interested in it and know about it. It’s a very good mechanism for that. You will also see – you had mentioned you looked at it online. You saw it, there was an ad for the book that shows up. After you read it online, you’ll see it after the book; right? This is a way to sell books to people who are interested.

So, I think, you know, people think there’s a tension between access and sales, and I don’t think there needs to be that. If it’s designed well, it can be something that moves us forward in both directions. That’s one of the things that’s exciting about the digital age, is that things that people assume are tradeoffs, the nature of those tradeoffs can be quite different. I mean, this is part of the theme of the book. Like a lot of the things that we have done in the past as social scientists were, in part, because of the constraints that existed. As those constraints changed, we can do different things. Sometimes that allows us to move into a different part of the design space that was just basically not possible before.

[J.B. WOGAN]

Yeah, for me, I think I read the preface, maybe the introduction online, and then I got a sense of the style of writing and, also, the length of the book and recognized that I would be able to read it. As a former English major and journalist inaudible] background, I would be able to read it; and, secondly, that it would be hard for me to stay focused on reading a book online of this length, and it would make more sense for me to get a hard copy, an analog version for me to do a deeper dive and close the laptop. I don’t know if I’m unusual among readers in that sense.

[MATT SALGANIK]

Yeah, I think that’s a very common path, and I hope people, when they read it, they enjoy it. They get excited about it, and then that moves them to want to engage with it a physical way. I also am glad to hear that you found the book accessible. This is one of the really important things about the book. I wanted it to be accessible to both social scientists and data scientists, and so that forces you to write in a way that -- I tried to think that anyone that can read the New York Times, that’s the audience I’m looking for for this book. And so, in some ways, having a dual audience makes the writing difficult, but in other ways, that forces you to be accessible in your writing, and so I’m glad to hear that you thought that came through.

[J.B. WOGAN]

Yeah, it definitely did. I wanted to ask about, you introduced the idea of ready-made and custom-made policy research techniques. I’m going to butcher the pronunciation, there’s the Duchamp and Michelangelo; right?

[MATT SALGANIK]

Yes.

[J.B. WOGAN]

There is Michelangelo’s David.

[MATT SALGANIK]

Yes.

[J.B. WOGAN]

And there is the Toilet, you said?

[MATT SALGANIK]

Yes. Fountain is what he named it.

[J.B. WOGAN]

Fountain.

[MATT SALGANIK]

But, yeah, it is a urinal.

[J.B. WOGAN]

Okay. Thank you. And so, one of those -- the David is custom made, and the Fountain is ready made. Could you explain a little bit more about how those are sort of metaphors for policy research. And I was curious, are there any new evocative examples of combining those techniques that have come out since the book first published?

[MATT SALGANIK]

Sure. So, one way I like to explain it, I think data science sometimes is like a urinal, but not just any urinal, it’s a very special urinal. So, Fountain by Duchamp is a beautiful piece of art that changed what people think art could be, and how Duchamp did that is through very creative repurposing. He saw this urinal that was created for one purpose, and he said, no, I’m going to use it for a different purpose. This urinal is art. And that is a very creative thing, and it can change people’s perspective about what’s possible.

Research using data science often has a very similar flavor, because data, sometimes, is created for one purpose. That purpose is usually not research. It could helping a company run. It could be administering laws. This kind of data is created for one purpose, and then researchers can see it and repurpose it and us it for something else. And so I think a lot of the best data science work has this characteristic of very creative uses of data that were created for other purposes.

So Google Flu Trends is one example that many of your listeners may be familiar with. Google Flu Trends had this great idea that people are typing things into the Google search engine, and they perhaps could use that to measure the prevalence of influence of the flu. And so this was a very creative repurposing of the search data, and very exciting. Then, the problem with repurpose data however is that it wasn’t created for the purposes of research, and so it doesn’t have some of the good properties that we like.

And Google Flu Trends, for example, broke down in its accuracy over time as Google started to change the way its search engine worked, because Google is making those changes not because they were trying to maximize the accuracy of Flu Trends, but because they were trying to run a search engine company; right? And so like a lot of these platforms, they are constantly changing, so this is one of the properties of big data sources that I talked about in Chapter 2, drift, an algorithmic confounding, ways that the creation of the algorithm changes the kinds of data that people create, which can change the kinds of inferences that you can draw.

So, this is an example of both the beauty and excitement of ready-made data sources, as well as my risks and concerns, and so I contrast that to the work that social scientists are more familiar with, where we create the data for the purposes of our research, and so that means it has very different properties, and that I associate more with artists like Michelangelo making David. He then looked for something that kind of looked like David. He spent three years making David, and that’s a model that social scientists are more accustomed to. And I think, increasingly, either of these pure approaches is limited, so if you are used to working with ready-made data, you’ll start to realize that there are really serious problems, often, with this kind of data, serious limitations. The data was not created to answer your question, and so you’re going to have to start investing more in custom-made data.

Likewise, if you’re used to using custom-made data, it becomes increasingly impossible to ignore all the ready-made data in the world. Like you might say -- let’s say you’re a health researcher, you might say, “I only believe in randomized controlled trials” or something like that, and when there is no electronic medical records, maybe this is a somewhat plausible position. As the amount of electronic records and other kinds of digital traces of health behavior increases, if your goal is to learn about health, it becomes increasingly hard to ignore all of that other data. That doesn’t mean you should embrace it without understanding it. It doesn’t mean you should not -- it means, like, you have to take advantage of both of these things. And we’re seeing much more integration of ready-made and custom-made data since the book has been written, and it’s a turn that I think will likely continue.

[J.B. WOGAN]

Were there any cool new sexy examples that you’ve seen, like -- okay. Google Flu Trends is a good example.

[MATT SALGANIK]

Yeah, it’s good, because it’s both cool and sexy, and it failed in the end.

[J.B. WOGAN]

Yeah.

[MATT SALGANIK]

And they shut it down. It’s hard. It’s really hard to do well.

[J.B. WOGAN]

Yeah. I imagine there has to be an analogy of different things like that. During the pandemic, people were trying to use ready-made datasets to get new insights about Covid-19.

[MATT SALGANIK]

Absolutely. I’m sure there were. I think the one that I believe I had heard about is using data from some companies about people’s mobility to try and track how that was affected by lockdowns and other kinds of restrictions. I don’t know that much about that work though.

[J.B. WOGAN]

Okay. I was wondering, you know, so five years have passed, people are often issues new editions of their book. If you were to be issuing a new edition, if you were writing an update, are there chapters you would add? Are there topics you wish you had covered or you would love to cover in a new editions?

[MATT SALGANIK]

There’s a lot of stuff that I did not include, but I feel like those decisions have aged well, because, subsequently, people have written books entirely dedicated to this topic. So, for example, a big topic that is not in my book is working with texts as data. This comes up a lot in computational social science. People are very excited about working with this new form of data. It raises special statistical issues. I decided to leave it out, because it’s a subject of a whole other book, and now of my colleagues, Brandon Stewart, has written a fantastic book on this topic called “Text As Data,” which rewrote with Justin Grimmer and Molly Roberts. So I think we’re seeing more.

Another thing that’s completely absent from the book is anything about programming. This is another thing where we see lots of other books written about these topics. So I think of this book existing in kind of the constellation of other books, and not trying to do everything all on its own.

I think a couple things that I see, one kind of very narrow thing that I see is missing is, I talk some, actually, in Chapter 2, Chapter 3, Chapter 5, and several of the chapters about using machine learning inside of research process, where you kind of do machine learning as an intermediate step; for example, label text or label images. And I think it has now become clear that some of the statistical and technical issues with that are a little more complicated than was clear at the beginning. So I would say that’s not really a new topic, but I would like to add in some of what’s been learned subsequently about how to do that carefully.

The other big thing that I think is missing is -- so that’s a little thing, and the big thing is more about how technology is impacting society. So, within computational social science, kind of work is about using new data or new techniques to answer old questions, but some of it is about answering new questions about the world. So, for example, social media changes the way politics happens. Social media may have impact mental health of teens. What are those kind of impacts? How do we measure them? How do we study them? How do we make decisions about what we want technology to be? So, how should technology be regulated? What kinds of transparency requirements should exist? This is a whole different topic, very interesting topic. But many of the techniques that are described in the book are also very relevant for these other questions about understanding and improving the relationship between technology and society.

So, rather than think about a new edition of this, I’m actually writing a new book about a different topic. It’s about predicting the future. So, this year, I’m on sabbatical at the Institute for Advanced Study. And subsequent to writing “Bit by Bit”, some of my work has been about using machine learning and AI techniques to predict life outcomes for individual people. So, given some information about someone, how well can we predict what will happen to them in the future? This is a core scientific question. It’s also very important for policy-maker who are considering using predictive models to help improve the allocation of services to their people they’re trying to help. And my research so far has found that this has been not nearly effective as expected, and so now I’m trying to understand, in general, different ways that we can predict the future, and what are the limits to what it is that we can possibly do, and what it is that we should be doing.

[J.B. WOGAN]

That’s interesting. So, the area where I was wondering if there might update was around equity, like some of the subject matter that has become the theme of the Howard Mathematica [SIC], and it’s not to say it’s not in there in a way, because I think a lot of the discussion in the ethics chapter could apply, but it’s not using the same kind of language of today around algorithmic bias. I don’t know if the phrase “algorithmic bias” appears. Maybe it does. But, yeah, I was curious if there would be a new edition or a section that would have maybe a chapter on the dangers of AI, machine learning, exacerbating inequities or sort of using these?

[MATT SALGANIK]

Yeah, this is what I meant when I talked about the intersection between technology and society. So, this is largely a book about research using computational social science. And I think there’s a whole new set of considerations that gets introduced once you start deploying this technology in society. So, it’s one thing to study predictability, you know, in an abstract academic setting, and then it’s another thing to try to use predictions in a real delivery of social services. And there is a lot that needs to be understood better about how we deploy these technologies in the world. It’s not nearly as simple as, oh, here is a textbook about random forest, so I can run a random forest model, now I’ll just deploy this into a real government organization. That, to me, seems like a recipe for disaster.

This stuff needs to be done very carefully. There needs to be a lot of thinking about all the things around the predictive model, because, often, that is where some of the most important problems can arise in the predictive model itself. Also, it needs to be understood, some of these concerns about equity, justice, differential impact and so on, all of these things are outside of what you would get in a traditional machine learning textbook.

Now, there’s a lot of advancement in this area now. It’s very exciting. One of my colleagues, Arvin Narayanan has a new book coming out about fairness and machine learning, which I think will be excellent. It’s co-authored with Moritz Hardt and Solon Barocas.

[J.B. WOGAN]

Okay. And do you have a working title for your book, and a sense of when it might come out, or is it too nascent at this point? Okay.

[MATT SALGANIK]

Too nascent at this point. But that is definitely what I am most excited about going forward, is using some of the ideas in “Bit by Bit” to think about predicting the future, and to think about the challenges that arise when you do computational social science in the world. It's one thing to do it in an academic setting, it’s another thing to do it in the world, and there’s a whole new set of issues that need to be considered very carefully.

I’m glad you mentioned the ethics chapter. So, the book has a lot about ethics. Ethics runs through all of the chapters. There’s a very long chapter about ethics that was probably the hardest chapter for me to write. Just personally, it was very hard to write, because -- I don’t know -- I’m not trained as an ethicist, and it’s the first time I’ve ever had to write about something like that. But that chapter is the one where I think I’ve gotten the most positive feedback, where people have told me that that was very, very helpful to them.

Because a lot of the work that’s written about ethics now is not written by empirical researchers. It’s written by people who specialize in ethics, and that has a place. But I think there’s also a place for people who do empirical research that want to think about ethics. I hope this chapter can be helpful to them as well. It’s a little bit of a different perspective of someone who is used to doing this work, not just writing about this work.

[J.B. WOGAN]

Maybe it would be worth -- yeah, I really enjoyed that chapter. I don’t know if you did this, but it stood out to me as a chapter that could have been an essay on its own, or an op-ed somewhere, because you’re putting forward a thesis, and you’re calling for a principle-based approach, as opposed to other approaches, I think it was ad hoc and --

[MATT SALGANIK]

Rules-based.

[J.B. WOGAN]

What is it called? Rules-based.

[MATT SALGANIK]

Rules-based. So, the data scientists generally use an ad hoc approach, where they often think about it, sometimes quite carefully, but in the absence of any kind of framework. And then many social scientists follow a rules-based approach, like, I just do what the IRB allows me to do. And I think both of those approaches are incomplete. And so in the chapter about ethics I put forward a principles-based approach. When I was writing this, I thought, oh, I’ll have to figure out what are the principles that researchers should follow. And I started reading around, and I’m like, wow, there is no way I can figure out what are the principles. And, in fact, I don’t have to, because people have already done that; right? And so I draw extensively on two other documents. The first is the Belmont Report, which was written in the early 1970s, published in the early 1970s, about research ethics in the social and medical sciences.

And then the other document is something called the Menlo Report, which was published much later and is much more about ethics in computer science research, which brings up a number of different kinds of issues. And so from those two reports, the Belmont Report and the Menlo Report, I’m able to put forward four principles that I think should guide researchers, and then show how those four principles themselves can be derived from two larger ethical frameworks. And so none of that is new philosophically, but I think seeing it organized like that, I’ve heard many people say it’s helpful to them. I know it’s been helpful to me. As I think through new research problems using those principles and frameworks to kind of guide my thinking has proven to be very helpful.

[J.B. WOGAN]

Am I remembering correctly, the Belmont Report is something that your former board member colleague, Pat King, had co-authored, or is that –

[MATT SALGANIK]

Yes. Yes, Pat, King. We did a little bit. I didn’t know that Pat King had worked on the Belmont Report until -- somehow I found out once around the board, and I asked her about it, and she had some wonderful stories about, like, rich and deep and serious discussions that they had. It’s an amazing document. The actual Belmont Report is incredibly short. It’s something like 20 pages. But then there’s two appendices that are like -- I don’t know -- a thousand pages each, or something, of very detailed commentary and debate and discussion, and so it just is a really -- I think those 20 pages are a very, very powerful document that has a lot of staying power.

[J.B. Wogan] It’s so funny that she contributed to writing something that influenced your book, and you ended up working together. But the timing didn’t work out. It wasn’t as if you were able to consult her as a colleague while you were --

[MATT SALGANIK]

Yes.

[J.B. WOGAN]

-- writing the book.

[MATT SALGANIK]

Yes.

[J.B. WOGAN]

So, okay, I just referenced that you are a board member of Mathematica, and I will mention that in my intro as well.

[MATT SALGANIK]

Sure.

[J.B. WOGAN]

I was curious -- I honestly don’t know -- how have you, or have you incorporated ideas in “Bit by Bit” in your role as a Mathematica board member. Like are there ways in which you’re able to probe the company’s leadership around their use of data science or advocate for the use of data science? I don’t exactly. I’m sure that they would benefit, or they do benefit from your expertise and perspective in this area, but I wasn’t sure exactly at Mathematica.

[MATT SALGANIK]

Sure. So, in our role, as members of the board, our job is to provide kind of high-level oversight and guidance to the company. And so it’s not the case -- like we never do code review or talk about specific details of specific projects, as much as I would personally like to do that. So, the way it enters my role in the board, I think, comes from me thinking a lot about what’s the difference between the form of something and the soul of something. And I want to explain what I mean by that.

So, like, let’s take survey research. Survey research looks a certain way. It looked a certain way in 1950. When we think about survey research today, we should think what about survey research is just constrained by the technologies of the day, and what about survey research is fundamental to the core of its essence? So what’s the form versus what’s the soul? The soul is what’s really important, and what we need to keep. The form is what needs to change, in fact, not just should change, but needs to change.

And so, in general, to trying to probe and ask people, like, what about this is the core soul that needs to stay the same? What are you trying to do and preserve in a fundamental way, and what can you change in order to make it better, faster, cheaper? Because if you change and you destroy the soul, that’s a bad change; right? But if you change the form and preserve the soul, that actually can create a lot of new opportunities. So, trying to think really hard about why are things the way they are now, how much of those are real constraints about the soul, and how much of those are artificial constraints that have grown historically based on the limits of our technology?

[J.B. WOGAN]

Okay. Okay. Yeah, I’m thinking about in the book you give the example of photography and cinematography of, I think, a photograph or 20 photographs or what we have today with HD, just like very seamless high resolution film today. I’m trying to imagine, trying to think about what that would mean for Mathematica in 10 or 20 years, in terms of the techniques we use to answer certain questions, and how maybe the goals might be the same, in terms of the questions that we’re trying to answer, but ways in which we’re answering these questions could be completely different.

[MATT SALGANIK]

Absolutely. So I think you’ll see increasing moves to ready-made data. You’ll need to figure out what are the appropriate safeguards and controls to make sure that is done in a responsible way. You’ll see, I think, a lot more automated or quasi-automated decision-making. I think there’s a lot of things that are also hard for us to imagine right now. I mean, that’s the other thing is, in the book I talk a lot about comparing not just what something is today, but what it will become.

So, I try to think a lot about, like, oftentimes if you see the first version of something, it’s not very good. It’s expensive. It’s complicated. It’s clunky. But the really important thing is to look not just as what it is now but where it will be in the future. And so I try to think a lot about, okay, forget about this today, what’s this going to look like five years from now, what’s this going to look like ten years from now, and how do we need to be behaving today to take advantage of those opportunities?

[J.B. WOGAN]

I want to shift gears, and the last question I wanted to and was about policy-makers. Maybe I would throw in decision-makers more broadly in policy and programs, but not necessarily the two camps we’ve talked about already. Not the data scientists, not the social scientists, but people who might be users of the kind of major takeaways from social science research or computational science research -- computational social science research, I should say.

[MATT SALGANIK]

Sure. Yeah.

[J.B. WOGAN]

So, like I could imagine -- like I wondered if there’s a cliff-notes version for decision-makers. But what should those readers and those listeners know about the evolving field of computational social science, and what do you think their role should be? What role do they have to play in this midwifing the field and ensuring that it’s practiced in a principle-based ethical manner?

[MATT SALGANIK]

Sure. I think they have an important role and a huge responsibility. No, I think, so, in one way, the book is for people who are doing social science or computational social science. In another way, I think the book will be very helpful for anyone who is consuming computational social science produced by other people. So, if you have data scientists reports to you, if you have social scientists reporting to you, how do you know how to evaluate their work? How do you know what questions you should be asking them? How do you make sure they are behaving in an ethical way that is consistent with the values of your organization? So, all of the things in the book are helpful for those things as well.

There’s nothing in the book about how to actually do something, about how to actually program something up. It’s all really a series of ideas that you can carry around in your head and pull out when you see something. So, you see something, it reminds you of something from the book, you can pull it out, ask a question about that, make a suggestion based on that. And so, I’d like to think that it would be very, very helpful to anyone who is consuming this kind of research to help them ask the right questions, be realistic about what is possible and what is not possible. So I hope it will be helpful to everyone.

[J.B. WOGAN]

My spidey sense was going off a little bit when I was reading about how the laws and regulations are always a little bit behind the teach and technology, and I wondered if there might be someone, a legislative staffer or in congress who might be saying to themselves, okay, well, that’s always going to be reality, but what can we do to try to be a little bit more in sync with how social science in the digital age is evolving and try to anticipate some of the problems of that Matt is bringing up in his book? Do you think that’s another potential audience that should be hit here?

[MATT SALGANIK]

Absolutely. And so I’ve just spent -- the three previous years, I was a director of a center at Princeton that tries to do just this. It’s called the Center for Information Technology Policy. It’s joined between Princeton School of Engineering and Public Policy, and we work to understand and improve the relationship between technology and society. And we take both of those verbs very seriously. So, understand, at the center, we do research that helps advance state of the art in topics like machine learning, privacy and security, digital platforms, and so on.

And then improve; we engage directly with policy-makers to help improve society based on this understanding. So there’s stuff that happens at the center that looks a lot like what happens at a think tank. There’s stuff that happens that looks a lot of what happens at a university center. We think it’s one of the only places really doing both of these things together. And so, by working at both of these layers, I think we’re able to do stuff that is hard to do other places. So, we can ensure that the results of the research get put into the policy-making process very quickly. Also, we can make sure that the problems that policy-makers are facing get put into the research community very quickly. And so we have done a lot of different things to ensure that kind of back and forth, and that flow, and I think that work is incredibly important. And if there are any members of congress listening, they should definitely check out our website, citp.princeton.edu, and we would be happy to kind of try to help them, because we realize that these are really, really hard problems. A lot of it is empirically unknown; right? Like what is happening online right now? We actually don’t know. Separate from what effects this is having on people, you don’t even know very, very basic questions about the online ecosystem. So, there’s basic empirical questions that need to be done. There’s basic policy evaluation that needs to be done. If we change the laws, if we change the regulations, what will be effects of that be? And then there’s also core moral questions that need to be addressed. What kind of society do we want to live in? And I think all three of those issues interact, and those are the kinds of issues that we try to address at the center, and those are the kinds of issues that will be addressed some in my next book about prediction.

[J.B. WOGAN]

I think that’s an excellent place to wrap up this conversation. Matt, thank you so much for being generous with your time today, and talking about “Bit by Bit,” and where you’re going from here.

[MATT SALGANIK]

Fantastic. Thank you so much. It was a pleasure talking to you.

[J.B. WOGAN]

I’m going to stop record.

[J.B. WOGAN]

Thanks to my guests for this episode, Paul Decker and Matt Salganik. A full transcript of the episode is available on the blog post associated with this episode at Mathematica.org. Related resources, including a free online version of Matt’s book, are available in the episode show notes. This episode was produced by my colleague, the inimitable Rick Stoddard. As always, thank you for listening to On the Evidence, the Mathematica podcast. Stay up to date on future episodes by subscribing on Apple Podcasts, Spotify, Stitcher, GoodPods, or wherever you listen to podcasts.

Show notes

Read a free online version of Salganik’s book, Bit by Bit.

Attend a virtual book talk on February 7th about Bit by Bit, including a Q&A with the author, hosted by the Washington, D.C. chapter of the American Association for Public Opinion Research. 

Listen to a previous On the Evidence episode about the Howard-Mathematica SICSS, which features Salganik.

Learn more about the Howard-Mathematica SICSS.

Watch Salganik give a Tedx talk at Princeton University about the tension between ready-made data (big data) and custom-made data (with which social scientists usually work).

Learn more about Salganik and his appointment in 2018 as a member of Mathematica’s Board of Directors.

About the Author

J.B. Wogan

J.B. Wogan

Senior Strategic Communications Specialist
View More by this Author