Rahul Bhargava: How stories move

Tips on how to work with Big Data using new tools, explains why the notion of “completeness” in a story is a myth, and how he uses his “smell test” to weed out unreliable information.

As well as being a research specialist at the MIT Center for Civic Media, Rahul does a lot of facilitation work. His original background is in robotics and education, and here he talks to us about his research into how information and stories move in today's information ecosystems.

He gives tips on how to work with Big Data using new tools, explains why the notion of “completeness” in a story is a myth, and how he uses his “smell test” to weed out unreliable information.

Rahul, how do you work with information?

In a few ways. One is that I try to understand what's out there and understand what it's saying. We particularly work with information at the Center for Civic Media. We care a lot about news and how information flows. How does a story go from a blog to a mainstream newspaper to television? We look at that question and we try to study it quantitatively. So, we don't just look at the story of what someone thinks is happening. We say, "Alright, how can we actually measure this? How can we go out and get all the mentions of some story that happened in the last month? How can we combine that with all of the mentions on social media? How can we combine that with what people are saying in the blogs?" When you pull all that together, you can start to get a sense of how information moves. And we call that the “media ecosystem”.

So, how does information move through that network of media sources? We think that matters because what stories get told really matter. They shape our perceptions and they shape how we understand the world. So, if we can try to understand which stories get told and which ones don't, we can have a better sense of how our world is being constructed.

What has changed in that ecosystem in the last five to ten years, from the point of view of an activist?

It's been fascinating in the last couple of years, because we see the flow of how information and stories move changing every few months. Something new happens, something exciting. And that's a change in both the physical landscape of what's going on (say the protests in Cairo, or revolutions) and how that interconnects with the virtual world of information on-line. We talk a lot about how there used to be a division, where we saw the virtual and the real world, and that's sort of a false division right now. You put those two together and they're overlaid, and that's been driven by a lot of technologies.

You can point to mobile technology, social technology. You have a virtual world in your pocket. So, actually, it's no longer virtual. We see that changing all the time, and we're trying to keep up and understand how a story gets told. One person might say - "oh, everybody is on Facebook." Well, that might be true in their context, in their country, but it might be completely different in a different context, in a different country. And that keeps on changing.

What is over-hyped about information?

The idea that information is going to magically give you insight is something that I think is over-hyped. The idea that if you have all the information, you will immediately understand what's happening, or even that it's enough to understand what's happening, is a myth. Even when you do, what I say is better is combining qualitative and quantitative, putting together the story of what's happening with the measurable items of what's happening. Even if you can stitch that together to understand the high level and the ground truth, it's still not clear that having more information will help you understand something. It might just be that you need to be part of it to understand even one segment of it, because we've come to a world where information flows more smoothly. There is less friction. And in such a world, it's harder and harder to understand the completeness of a situation.

Are you saying that information can't speak for itself?

I wouldn't think so, because it has no story to tell on its own. Any time you use it, any observation of it, reduces it to one story. Even if that story tries to cover a lot, even the act of recording that information is in a sense reductivist. It reduces it to one narrative or multiples. It's just not full coverage. I think that there are multiple valid stories that information can tell and that we can pull out of it. But the idea that there is completeness in the information, I think it's a myth.

From your point of view, what is more important in terms of advocacy work -enabling people by giving them a better entry point to information, or curating a story for them that will review connections between the chunks of information later?

I approach advocacy a lot like I approach education. I was trained in a certain model of thinking about learning. And the way I think about learning is that it's something where you want to have a really low entry point and a really high ceiling. People can enter a field without very much in their way. They can start to engage, and they can play with it. But then if they want to dig deeper, if they want open up a box and find out what's inside, they can keep going. I think advocacy needs to have a similar approach. I have no problem with a simple story being told, as long as there's some way to continue that story, or the investigation into that story, which to me is the curating. Creating these multiple layers so that someone can start at the top... dig one deeper... dig one deeper. If they want to, if that's the path they choose.

On the same note, how would you use this method, going from small steps to big steps, to create influence?

I think the question of how you influence someone by using information-based advocacy is a huge area we're digging into right now. I think actually we're getting better at it as time goes on. We're trying to understand how we start to learn new tools. And we can start by not being dismissive of simple messages. Because curation is editing and, if you want to have influence, with some audiences you need to start with a simple message. But I also say that you don't need to be scared of complexity. So - start simple, but don't be afraid of getting to a complex story. That's one model we can use for having influence. Identify your audience really well, understand who they are, and then address them at multiple layers, and at multiple points.

What are the biggest ethical concerns when it comes to using and re-using information in advocacy work?

The biggest concern is actually excitement. We see something neat, we get excited about it and we want to do something with it. So we jump on it, we say - "this is great! Let's use this, I'm sure this could influence this audience that I really care about. People are gathering next week, and I can use this in that setting, and it'll make a great splash." But we don't actually dig into understanding where that story came from, what the full narrative of it is, where this evidence came from - the traditional questions of who funded it? What are the holes in it? Really understanding where it came from is a huge ethical concern in curation.

If we're going to use other people's information to help fill out our story, which we should, then we need to really understand the dilemma they had in crafting their information and collecting it. Whenever we're collecting information there are always holes in it, and there are always assumptions and biases. To use someone else's information you need to dig deep to understand it. And even though it's super exciting, you need to spend that time, and catch your breath. And then go out and make your awesome point and try to influence someone.

On the one hand you would definitely describe yourself as a promoter of Open Data, Creative Commons licensing, and overall transparency in data information, which is a mindset nowadays also being heavily promoted by government. The problem is that these same governments are using every possible channel to intercept and collect data. When we look at the latest leaks... How do you approach the risks of promoting total openness while this channel is hijacked and used against people?

That's a good question. I think the risks of Open Data are something that any information worker thinks about every day. There is no simple solution. The questions is - what is your default position? If your default position is that everything's open, then you need to understand and own the risks that you're bringing to other people. There's a great example in a US state recently where people didn't like the policies of a governor. If you can collect enough signatures in this state, you can actually have the Governor's position recalled, and have a new election. It turns out those signatures are open data. The recall effort failed and that data was opened, so the people that supported the Governor pulled the signatures and addresses of the people that wanted this person out of power, and started posting it on Twitter. What they also did, to make it into a sort of retribution, was they combined that information with open data about police reports. So Bob Jones tried to recall the Governor, and it happened that he has a ticket, on this date. The idea was to make it look like only criminals would support this recall effort.

There are tons of stories like that about the risks of Open Data. They're always there, but we're never going to think of all of them. So we need to take measured efforts to try to protect data when we're releasing it. That doesn't mean our default should be - "don't release it." But even more so now, with lots of research and a lot of computer science algorithms showing that anonymised data is seldom actually anonymous. There's a stat that I heard recently at a conference, where someone who works on privacy concerns told me that if you look at the ZIP Code in the US, the postal code, the date of birth, and the gender, that uniquely identifies 80% of the US population. That's a huge number, and that's in a lot of data sets that the government certainly puts out.

You have similar examples in a corporate context. You have the Netflix release of data, and lots of examples like that. We have a large set of examples where we think we've released something clean and it turns out it isn't clean. And in fact computer scientists are just getting better at that, de-anonymising data, because it's a fun challenge. At the same time, we have a lot of data that we can release in aggregate. So, there's certainly a default position that I take, which is this. If we're working on a data-centred project, I want to release that data in aggregate at least, because I know that I'm only going to be able to glean a certain set of the insights from it. I know that I live in a network of researchers that will come to some new understanding by using that data. And I know it's kinda like cartoon super heroes, where if one person has one super power and another has another, by your powers combined you're going to be even stronger. That's a primary motivation for a lot of our academic work in releasing data. Now I don't work in a lot of sensitive areas, so I don't have a lot of data where someone's being put at risk. I don't actually have a ton of experience with that and I don't need to handle those ethical concerns.

What would be your advice to people who are entering the field of Big Data? Social network data may be close to real life for some, however far away from real life of other people. What should be their concerns? What should they be starting with?

People starting out in the field of Big Data and social data need to think a lot about what subset of the population they're seeing. It's very easy to get super excited about these giant sets of data, because we've never had anything that big before. Twitter certainly isn't reality, but it's the closest data set that we've ever had for reality, and that's meaningful. But we tend to get over excited about that. We say - "this is really relatively easy to get and tells me a ton of what's going on in the world". And you forget along the way in your excitement that it's a proxy; it's a representation of one slice or a couple of slices of the world. It can be informative, but it's certainly hard to make normative statements. It's hard to say this is what everyone is doing or what everyone thinks.

If I was talking to someone that was new entering this field I would say - "hey, just remember to get outside every once in a while and to check your truth". If you think you've found something, take a look at it for a sec and try to identify it with something totally different. I call it the “smell test”. If something smells a little too nice, then it's probably not right. Or if something smells a little funny and it seems like it fits too easily, then you probably want to find another way to check it and it.

That is one of the risks of some of the tools we have. For instance, Excel is a tool for data analysis but it has no way to create rules to check that your math is right. In programming, there's an approach that involves test-driven development. If you're doing statistical analysis, you write your test on some test data where you know what the output should be of your algorithm. As an Excel cell can contain an algorithm, you have no way to test it, to make sure you wrote it correctly and that it's operating on the right fields.

So I think the tools actually carry a lot of built-in beliefs and a lot of built-in systems that you need to be aware of for your analysis. The tools are such a big part of that analysis. We can't work on printed out tables anymore. It's just too big, and so much of the analysis we're doing is algorithmically driven. We need the brain of the computer to help us.

You use two interesting terms. One is 'smell' and the other is 'instinct'. Can you elaborate on these as a scientific matter?

One of the ways I do talk about analysing data is I think about the smell of something. This is something I learned early on in my programming days, where if something works the first time, that should smell funny to you. Because nothing ever works the first time. I think that approach is really about engaging all of your senses rather than getting caught up in the excitement of something. I use the word 'smell' to talk about it because it's something that's very relatable. It's a sort of fuzzy thing that's about your instincts that you can't quite quantify, but that you need to keep in mind because we're humans. We have all these gut instincts and feelings that govern how we interact with things. And that's true of information and information analysis and advocacy as well. If you're creating something and it sort of feels like the story's a bit too simple (I would call that “smelling a little funny”) then you want to check it. And the way to check it is not, of course, to sit and smell your screen. The way to check it is to test against other people. Try it on another data set if it's an algorithm. Ask a couple of your colleagues for feedback if you're trying out a visualisation. That's the way you verify your instincts, it's sort of a smell test.

You mentioned visualisation. It often seems like visualisation is a necessity now, as it's become a popular way of expanding research output. Some people are very excited about it. What's your stand? What's the role of visualisation in presenting research and influencing?

When I think about sharing information with different audiences, I am definitely a believer that text and the visual go together. That's how we interpret the world at this point. That said, I tend not to use visualisation as a word, as much as the word 'presentation'. I like presentation over visualization. I say that because I think visualisation has gotten a little bit overloaded to mean fancy presentations of information. It doesn't have to be that, but that's what a lot of people mean when they use it. They want something that's really well-designed, and they want something that is impressive in a 'shock and awe' sense. I try to fight against that, because that's a barrier for a lot of people. They say - "I don't have the capacity to do that."

What I say is - "hey, you can think about how you present your information, and that could be in a data sculpture, or it could be in an interactive game. All of that is information presentation. Or it could be in a traditional bar chart. That might be the appropriate technique for your audience, for sharing that information." I think I tend to use those terms also, because when I'm working with novices in the field of information presentation, they're often intimidated by the professionalisation of the graphics, because it looks really fancy. There's been a bunch of research that shows how, when you show something that's very formal and presented with a nice clean interface, it tends to bestow authority on you.

That research is in Western cultures, but I believe it's true of other places as well. When you present something very formally, people assume that you're right. Which might be your goal. But if your goal is to engage someone in a conversation, to get talking to each other, then often a more informal presentation can help you get there. It doesn't necessarily get you there, but it can help get you there. I find there are other types of presentations, like using the power of art. For example, if you want to talk about recycling rates you can make a bar chart, or you could build a sculpture out of recycled materials. I really am attracted to those examples. I think they show a way that's within reach for a lot of other types of organisations that don't have the capacity to do fancy visualisations. And by using the word 'presentation' I think we can bring those people into the conversation. Because all of it is about conveying an information-driven story.

What are the most exciting projects you're currently working on?

I get excited about a couple of things. One of my main interests right now is digging into large sets of data that are about information, that are about people and stories and trying to understand what they say. To be concrete, we have tons of news reports and closed captions from TV screens. In our research centre we're really digging in to try to understand how stories move through media. And that to me is very exciting, because I think these stories that move through social and mass media define a lot of our world. They're the lens through which we see and understand how our society changes. That gets me super-excited. Any chance we have to get one little peek, one little insight of what's going on and how a story moves, is super-interesting to me.On the flip side I'm also very excited about making change and impact in a concrete way. I think that digital technologies enable that. When we build tools to enable new ways of interaction or expression online, it's seldom the tool that's the hard part. It's the social process around the tool, it's how people use it. So I get super-excited when I get the chance to go out into a group of people that have helped me build something and get to use it with them and start that feedback loop of learning why and how this tool might help and where it doesn't. That to me is super-exciting. It's the rubber hitting the road, as we say. It's getting real and has an opportunity for real impact. It's not laboratory software or hardware that you parachute it in. It's the chance to build something and use it together. It's this great learning journey that I really enjoy.

Recently I started working on food rescue in the neighbourhood I live in, which is very exciting because it's a town I care about. And by food rescue I mean the ability to take excess food (say the food they haven't sold at a farmers' market, or a grocery store that doesn't want to sell bread the next day), to take that food and give it to people in need. It's re-distribution, saying - "hey, this food is gonna be thrown away, it's excess, let's take it to a food pantry." Now that's a logistical problem. And technology is really good at solving logistical problems around planning and coordinating. So there are some people in Boulder, Colorado, that have built a great software system that does just that. I e-mailed them and I said - "can I build on this for what we're doing in our town?" And they said - "yes! Great, We love it! We have nine towns using this, we do a lot of pick-ups by bicycle." It doesn't work as well in Somerville for the winter, when there's four feet of snow everywhere... But I was able to grab some open source code, to work with the group in my town, to decide what their needs were. We did some great exercises around defining who all the players were, bringing that local perspective to the software itself. Because just like a lot of public health interventions, something that works in one town isn't necessarily going to work in another town. The same is true of software. This food rescue software might work in Boulder, but we might need to tweak it, either the software or how we use it, for Somerville. That's an activist example.

Give us an example of a project you have been working on at MIT recently.

Analytically, in my research, I get to work with a lot of super smart students. These are students that are coming into the MIT Media Lab for Masters Degrees. I'm a research specialist there and I get to collaborate with those students. Recently we decided to dig into this investigation, to try to understand what happened with the Trayvon Martin story in the US. This was a case in the United States where there was a young African-American male who was shot in Florida. That happens everyday unfortunately. It's a normal occurrence that someone's shot by someone else and dies.

This story somehow got to the point where President Obama was talking about how, if he had a son, he would look like Trayvon Martin. How did that happen? How did this story go from a local story to the national level? We wanted to understand it, because people were saying that this is one of the first great digital, on-line, grassroot stories we have. The bloggers and the change.org petition all bubbled it up to the national media attention. Now we said - "Okay, we need to dig into that. That's great, that would be awesome if it's true. How can we figure out if this is accurate or not? Do we have enough data? Can we collect enough information that we can actually make a statement about the accuracy of whether this was a grassroots movement that brought this one story up to the national setting."

We grabbed as much data as we could. We looked at the Media Cloud project from the Berkman Center. We looked at the closed captions from TV news reports that archive.org has. We were able to get the times of all the signatures for the change.org people. We were able to look at Google searches over time. We just opened it up and grabbed everything we could and then we plotted it all. We asked - how often was this story mentioned in all of these media? Then we looked at that over time to try to get a sense of what happened. That led to a bar chart with multiple bar sets over time. That wasn't enough. So then we went out and talked to people, which was was done by two of my colleagues Matt Stempeck and Erhardt Graeff, who led the investigation. And when we looked at at the principle research scientist in our group, Ethan Zuckerman, said - "how do we dig into this? How can we keep on adding layers to this?" Adding another layer is to understand the qualitative experience of what was happening. When was the press conference? What are the words being used? What's the framing around the story?

We started to see that there was a pattern. In fact, when we found out about who the network of actors were it turned out that it was a very well orchestrated press campaign that led to that national attention. Certainly there was bubbling and rumbling in the grassroots media, but the thing that really made it take off was a well-orchestrated press campaign by the team that was working with the family. That's an old media story. What's interesting is that we believe, from our analysis, that it stayed in the media longer because of the grassroots support for the story. Some of it was controversial, some of it was re-framing. Once something gets really popular in the US we have the left and the right, who want their own understanding and their own way of telling the story of this boy being shot.

In this case, a lot of the left was saying - "hey, this is about gun laws, this is about other issues that we care about, that we want to mobilise on." On the right, people were saying - "hey, this kid wasn't really a saint. He was a drug dealer. " They're pulling anything they can. I'm not saying any of that stuff is true, but they were pulling all these threads and we can analytically see the story changing by looking at the words they're using in telling it. So we did that analysis and over time, we were able to create this narrative of how the story progressed. We believe that's helping us understand how other stories can move through media. In this particular example we have a deeper understanding and a deeper conversation about what exactly happened from a media analysis point of view and what might happen next time. And if you're an activist, we can understand how you can actually game the system.

In terms of storytelling, what is the role of negative data or negative information? By that I mean gaps in information or missing information...

When we're investigating things in a research setting we end up spending a lot of time trying to gather as much data as we can. That's true in an advocacy setting as well. And you always end up with gaps in that information. You always end up with pieces that are missing, either a whole type of data from a certain source, or just a piece in a larger data set where you're maybe missing a year. And that's incredibly frustrating. You sit there and you say- "if I just had this, this data would be perfect. I could make the best story." But sometimes you're able to say - “I'm missing this. I can make a better story out of this. I can make a story about what I'm missing, and I can make that part of my point."

I find that's more true in advocacy than research. In research, people really hammer on you if you don't have completeness, if you don't show a full story. There's this academic sense of asking questions. Now, that doesn't slow us down. We still try to paint as large a picture as we can, even if one section's missing. But we tend not to highlight that section that's missing as much. We don't say - "this part's missing, we're making assumptions about it." We generally say - "this part's missing, so we can't speak to that part of the story." ?

We often focus on the rational value of information and data, however we know that to influence people we also have to speak to beliefs and to emotions. How do you balance that?

Whenever you're looking at information and making a story out of it you have to consider the emotional components of it. Even if you're coming to it from an analytical point of view as a researcher, you still need to think about the emotional impact of what you're saying with that information. If I've got some giant data set and I'm analysing it, and I say that Q over P equals R, that still is some fundamentally real statement about that data. It means that there are more people living per unit square mile in Texas than somewhere else. It still means something.

While academics tend not to focus on the emotional impact of the message they're telling or saying, it's still there. I think the role that advocates play when they're doing this information advocacy is looking at that information and highlighting the emotional component. They're still starting from a story that they see in the information. And they're turning that story into an emotional appeal that has information in it.

That varies. There are certainly some places where the presentation is the information. But there are other places where you have the presentation of an information-based story that is just the emotion. The evocative image, as I call it, the picture you look at, or the graphic you just can't turn away from. You're thunderstruck, just left with your mouth open. And those are so compelling.

But I also like to think about it as a trajectory. The idea that behind that layer of emotional reaction, the gut reaction, there's nothing wrong with trying to be evocative even where you want to be able to tell a slightly more information-based story. There's a deeper layer beyond that, and a deeper layer beyond that. So when I think about these as layers, then I have a way to understand that the emotion isn't disconnected from the rational version of the story. It's part and parcel of the same thing and there are multiple ways you can go with it. In my setting, a research setting, we don't spend a lot of time focusing on emotion. We try to talk about the methods we're using for getting the story out of the information and the stories we find. But we're not trying to convince someone of something. We're not trying to cause behaviour change, so we tend not to focus on the emotional part. We're simply trying to understand the system.

But on the other hand, you know that your research and findings will be used by different groups to prove different points of view. What's most disappointing about that? Do you often see information being twisted by people who would like to make a totally different story out of it?

Whenever you're doing research and you release it in a paper or a talk or a blog post, people always have different understandings of it. Especially when it's information-based research, when you have some data set. It's always a double-edged sword - if it cuts one way, it can cut the other. Someone can take what you're talking about and spin it to mean something else, and you have no control over that. That's part of releasing your ideas into the world. That's how ideas work. They propagate from one person to another and sometimes it's a game of telephone, where the next person hears something totally different and the next person hears something totally different and the third person, they're speaking a different language completely. At other times, it's very clear and their story goes out and other people can confirm it, which is what we care about often in the research setting.

Part of research is the idea that we create a larger base of understanding that other people can draw on. That's part of its goal, it's definition, and part of the philosophical and spiritual reasons I like research. This very utopian idea that we're increasing understanding in the world, which can certainly be criticised, is still an idea that underlines a lot of this work. And that builds with it the idea that people can use it for whatever they want. So it's a danger that's built-in and you can feel disappointment or anger when someone uses it for something else, but you can't really get that upset about it. That's part of what you're doing. You can also fight back. If someone's putting out an understanding you disagree with, you can double down and reinforce your understanding of it and call more people to you.

So it becomes, in essence, a conversation. That's particularly nice, because there's so much awareness now. In this day and age we have a much better chance of understanding and hearing about when someone else is using the information. It gets back to me. So I get a chance to respond, which is super exciting because that's often a great conversation. It can reveal something I missed in my information, that I got wrong - it's a benefit too.

There's a whole generation of people learning to collaborate using new media, which creates space for a fairly autonomous zone with no control, as long as it's established in a certain way. We've also recently seen a lot of movements trying to build up zones in the real or physical world. Many of these people are for the first time experiencing a freedom of information in the virtual world that they haven't been able to in the physical world. Like in Egypt, or Tunisia. But in all of these cases the transfer of the virtual into the physical was suppressed. Do you see that there's a possibility that societies can learn something about how to govern from the virtual world? ?

When we look at the Internet and the way that communities exist on the Internet, the first thing we see is a lot of over-hyping. The idea that a lot of these behaviours are new. And that tends to be a myth. For instance the idea of self-organising communities - there's a long history of self-organising communities, but traditionally they've been minority communities that have been suppressed by autocratic powers. Sothe first thing we say is - "hey, a lot of this Internet stuff isn't new, it's just in a new medium and it's a lot more people."

Now the fact that something isn't new doesn't mean it's bad at all, we just need to acknowledge what came before and to acknowledge that there's a history of those in power not allowing participatory models to take over. There's a long history of that being suppressed. I think some of what we see now is when people try to take models they see on the Internet and bring them into new states or existing states during moments of reinvention of a revolution. I think that there's an opportunity to learn from past failures.

Whether a government should work like some part of the Internet works is a great question that I can't answer. It's just too difficult. We have so many examples of how societies govern themselves on-line and off-line, that there's no way to say there's a right answer for that. I also don't have a lot of experience in those crisis communities so I haven't seen how those play out. But I think that in an academic sense, from a research point of view, when we look at trying to move systems of governance from the Internet to the real world, we also think about the opposite. Moving systems of governance from the real world to the Internet. And again, in the same sense as there are a lot of systems of governance on the Internet that aren't new, there are a lot of things existing in the real world that might be useful for the Internet.

We think about this as a two-way flow. That's how I like to think about it. And both places can learn from each other because frankly, it's not like running a country is a solved problem. We've got a lot of work to do and anything that we can try out is great. This is one place where the Internet actually shines. It's a better place to be experimental with systems of keeping people together and communicating and making decisions. It's just easier to try stuff out. That's because of the medium and how nimble it is, and the fact that it it can leave all baggage behind. Now, you can argue about that in different platforms, but there are ways to leave a lot of baggage behind when someone goes to a virtual space that frees them to act in a different way and to govern themselves in a different way.

Read

Watch

Listen

Learn