Michael Kreil is an open data activist, data scientist and data journalist who works at OpenDataCity, a Berlin-based data-journalism agency focusing on telling stories with open data. In this interview we talk about three of their projects, re:log, the Malte Spitz's data retention project, crowdflow.net, all based on the information that we and our devices are generating and traces that are being left behind. We discuss the de-anonymisation of data, creating tools for metadata analysis and why metadata might be the wrong word to use.
I have a problem with the term “metadata.” I don't think that this term is precise, because, simply put, the basic idea of metadata is that it's data about data. For example, if I take a photo, I can add data like the camera model, time and geolocation, so, the additional information about when and where the photo was shot is called metadata. But, for example, if I take a lot of photos, I can use the metadata contained in these photos to connect the location in which I took them with the time I took them. The metadata can be used to track me. So, from that point of view, metadata is the data itself, and that’s the interesting aspect, not the photos themselves.
I think that only people who add data to the data can use the term metadata. But, in general, from a public point of view, everything is data, which is usually about persons. So let's stop calling it metadata.
The interesting thing is that metadata seems to be some kind of a by-product, yet it can be used to analyse certain behaviours, of political and social nature, for example. Let’s take something simple as an example, like a phone call. Making a phone call doesn't seem very important. It's hard to analyse one million phone calls or one million photos, with the analysis being based on speech recognition or face detection, both fields still being in a state of technological development. But it's pretty easy to analyse the metadata contained within them, because metadata has a simple, standardised format for every phone call: there is the date, the timestamp, the location and numbers of the caller and the callee. This standard allows us to analyse a huge amount of metadata in one big database. For example are there instances happening in the population that are represented in the metadata, such as who has depression or who is committing adultery?
One of my projects is built around all emails I have sent and received. I found that when I sent an email with the same content to two people that this becomes instantly interesting, as it displays some kind of relation between these two people to one another. As, for example, I would never send the same email to my mother and to a customer, because they have nothing to do with each other. Based purely on that metadata, I can draw a rough social network of people I'm exchanging emails with. I'm pretty sure that you can get a clear picture of the social network and the social structure of a company, only by referring to metadata such as sender and recipient. There is additional information you can use: filtering your email traffic through metadata; who started the conversation; who changed the topic. Also you could look at what kind of greetings and goodbyes are being used, stating if it's formal or informal communication, as this then visualises the kind of company culture: is it based on friendships or on hierarchies?
Mail network by Michael Kreil
Text and images are very unsuitable for this kind of analysis, only metadata is really qualified. If you have an algorithm, or an idea about how to get data out of an image or text, a 95% accuracy is enough for data analysis. For example: analysing images taking their colour or the usage of an Instagram filter, you can find out if they were taken in daylight or at night, inside or outside, in winter or summer. When analysing telephone calls: are there a lot of pauses in the conversation or are people talking constantly? Also by analysing the sound of a phone call, you can find out whether it is digital or analogue.
Well, the background story is, that, taking part in re:publica, the digital conference, I met the guys who built the Wi-Fi system for the conference. I learned, that, to control this system and avoid long-term crashes, they used a special script that scans every access point, every 15 minutes, and compiles log files detailing which devices are connected to which access points. So, even in case of a crash, this very useful information system allows these log files to be used to find out precisely the source of the problem. I asked them to hand me the log files that they had anonymised by using hashes instead of MAC addresses. Within a couple of days, using this data, we created the first software prototype that demonstrates how a Wi-Fi network can be used to track devices.
I think that, amongst others, this particular perspective on the conference is really interesting. And there are different ways of looking at this project. Examining the behaviour of the audience by looking at the location and movement of their MAC addresses, makes visible, how specific sessions and topics of the conference are connected to each other. For example, two sessions sharing 90% of the same audience, or displaying that one device is moving from one access point to another. There are some more ideas about how to put a conference online, so that it's not just an offline event. Usually, there are hundreds of sessions, and it's impossible to see all of them. One idea is to use a recommendation engine that helps the participants choose sessions they like, "People who saw this session also visited that workshop".
A lot of people, who saw this application, wanted to know which dot they were. Actually, we could also add some de-anonymisation functions in there. For example, in this app it's possible to select a group of dots and I had the idea of creating an intersection of dots by pressing shift. So it’s possible to select all the dots that visited the same session as I did and select the intersection of all those dots that have visited another session, and so on, until you find your dot through following your route of session attendance.
Screenshot from re:log website
There's also another perspective, and that's one of actually having a surveillance system. Because having an overview of 150 access points means having a surveillance system. It was really interesting, that a lot of companies tried to contact us, wanting such a system in their conferences, stores and shopping malls. Around 30 companies have contacted us, just in reference to this project and, in the end, we didn't work with any of them.
You can use this application for two things. Firstly, for collecting, publishing, anonymising data and using it for the purpose of connecting people and enabling them to share conference reviews with each other, as well as giving feedback to the conference managers about the quality of their conference. On the other hand, the application shows, what surveillance is about. We are not just talking about conferences, we are talking about the possibilities that companies like Deutsche Telekom, have, owning a hugely distributed Wi-Fi network with around 750,000 WiFi hotspots in the whole of Germany. Or, for example, Starbucks Germany has a special contract with British Telecom. So, the routers are probably also owned by British Telecom. It would be interesting to find out if British Telecom actually does analyse, or maybe even has the legal obligation given by the GCHQ, to do so, which MAC addresses visit Starbucks stores in train stations and airports. Perhaps they can track device movements within Germany and throughout Europe. I'm not sure, in which countries Starbucks is bound to special contracts with British Telecom. And the main difference between this kind of surveillance and the system that we used in re:log is, that re:log allowed an opt-in for an active connection with the access points. However, if you own such a system of access points, you can also “listen” to its passive communication, meaning, if you have a standard Wi-Fi router, you can change the software, enabling it to collect the MAC addresses that pass by. Actually, a Wi-Fi network can also be made into a sensor network, it is just the software configuration that makes the difference.
I think 95% of the people really loved it and 5% of the people criticised it. A few even said that it's the same thing as European Data Retention. Our deeds were criticised. When we finished the project, we were concerned about the pros and cons, but in the end, the most important thing for us was to actually make people think about surveillance and that's why we published it. We're pretty sure that it's not possible to find any private information in there and especially due to the public character of the conference if it was possible it wouldn't do any harm. But it's definitely created with the vision of encouraging people to think about it and understand what such a surveillance system is all about.
That's a good question. A conference is a very public space. I still think that the main purpose of this conference is actually to make people get together. A lot of people have name tags.
I think if we could somehow add names or labels to the dots in re:log, and I like to point out that "we wouldn't do it!", the creepiness would be definitely stronger. But I think it wouldn't affect the people’s personal lives on a big scale. At least in comparison to the data that Google, Facebook and Apple or your mobile phone service providers are collecting on a daily basis. There are WiFi and mobile networks everywhere and there could be some kind of a much bigger system behind this, much more dangerous, compared to what we did at re:log.
Not at all. We are facing a certain problem: we're getting the data and trying to find out what’s inside of it, trying to find algorithms and writing software that actually shows what's inside. So it's more like a translation process. And, at all times, we think about "translating the data", making it easier for the people to understand, but we never actually think about putting it in a specific context. So, for example, if I just invert the colours, making it a black application rather than having a white background, maybe the feedback would be totally different. Maybe that's the one thing that nobody really thinks about: that every software has a dual use. Maybe we should make a second web page where we duplicate every project and just invert the colours, to make this problem of duality visible.
The inverted interface of re:log by Michael Kreil
For example, we made this data retention visualisation with Malte Spitz. It shows that data retention means that you can follow every person for six months and analyse what they are doing. But this is also a very interesting for the police! Probably nobody thought about that before, but maybe the police understood the potential of data retention and really started to campaign for it. So, perhaps we even helped sell more surveillance software and created more arguments for the police to actually want to own these possibilities. Before, it was just an abstract idea and now they're saying: "Well, that’s something we need". And that's the problem with software: that it can be used in both ways. For good and for bad.
It was actually my first data journalism project. Before I created data visualisations using unemployment data and stuff like that. Then I met Lorenz Matzat. He had gotten the data from Malte Spitz and asked me what I would do with it. There was this Excel sheet with, I'm not sure, 36,000 lines of whatever in there, and there was no tool at all to have a look inside. You could make a simple map, using just the geo-location data, but you wouldn't see the aspect of time. You wouldn't see the movement. So, I wrote a small prototype, just a simple map with a moving dot. This was actually the basis of the application that went online a few weeks later.
Especially in journalism, you don't have time at all. So, it's really sucking life out of you. There's a difference if you have a day-to-day job. With data journalism either you have to publish something in three days because it is a hot topic or you have time to dig in and finalise the project during the next weeks; I definitely prefer the second type of projects.
Years later, Balthasar Glättli also wanted an analysis of his data. In the end, he didn't just give me his telephone data, he gave me everything else that is collected by the data retention in Switzerland. So, there was also the metadata of every email and the owners of telephone numbers or email addresses. He was really open about his private metadata. I'm really honoured that Balthasar trusted me on that level.
Malte told me, that when he gave his data to us, he had the idea of visualising just one work day. But we built this web app, that actually shows six months of his life. He was really perturbed about that. But, as intimidating as this application is, the more important it is to publish it. The more it hurts, the more important it is to actually publish it. That, I think, is the main reason why Malte and Balthasar actually published so much private data about them. Additionally, Balthasar had a few problems, because he's also in the Defence Committee of the National Council. In his metadata was the location of a secret hideout that he visited. It was secret, but his phone provider collected Balthasar’s locations and, by publishing this data, some journalists found the hideout and published it. It was too late to remove it. It’s an interesting thing that when cellphones are tracked all the time, you should, actually, constantly think about when to switch off your cellphone in your pocket.
No, not at all. Sorry, not at all, anymore. I think you can even generate more metadata from existing data. Just think about living near a sightseeing building or something like that, how many tourists took holiday pictures and published them on Facebook. How many of these pictures contains your face? Then, one day, somebody will create a facial recognition algorithm and gather even more metadata about you.
OpenDataCity has a special team of really high-level nerds. Experts on hardware, servers, software development, web design, user experience and so on. I contribute the more mathematical view on the data. But usually a project is done by just one person, who is chief and developer, and the others help him or her. So, it's not like a group project. Usually, it's a single person and a lot of help. That makes it definitely faster, than having a big team and a lot of meetings.
There are no rules. Just start. Share ideas, show projects and learn from others. Always learn new stuff! I think the most important toolbox is lodged in your mind, so it's more in your head than in your code.
A lot of stuff is on GitHub already.
Also, the more complex a software project becomes, the more work you have to put into and it grows exponentially. So, keep it simple and make it fast. It's much easier to write software, throw it away and start over again quickly, than having this huge generic system that tries to do everything. It doesn't make sense. It's just too much work. You'd get this huge software system with thousand dependencies and, in the end, it's really hard to innovate, get new stuff in there, or, the worst case, to change the concept. Almost every software that we have published is not generic but is used only for one case. So, keep it simple and get a prototype in under three days.
I've worked a lot with metadata and created many projects around it, so it's more of a gradual learning curve. However, there was one thing that I was really surprised about. I published data that I thought was anonymised, but, two or three months later, I was able to de-anonymise it. This happened with crowdflow.net, a project about the consolidated.db, the database that records all of your movements on iPhones, that was backed-up unencrypted on PCs and Macs. Crowdflow started by somebody publishing their consolidated.db. I combined this person’s database with mine, then cross-referenced the geo-positions. It seemed, that we visited the same events and were just 100 meters apart from each other. So, I thought: “Well, there is actually a social aspect in that. So, maybe I should start up a small webpage and try to collect as many of these consolidated.db's as possible." I gave it a deadline of just 24 hours and, if I wasn't finished by then, I would stop. So, in one day I started the webpage and created a Java application that extracts the data, and added a way to upload your consolidated.db, if you wanted to. In the first round 30 people sent me their data. Then I found out that you can see all the Wi-Fi access points in Berlin. So, I created some nice maps and more people sent in their data. Then I made some further analysis based on MAC addresses: what kind of Wi-Fi routers do people have at home, for example. This encouraged more people to give me their data. I had something like 1500 data donations of people, who had sent me the geo-location of their iPhone. I created this big database of all the Wi-Fi routers in Germany, Europe and America.
I published an anonymised database of Wi-Fi routers and I thought: “Great, that’s nice". Then, a few months later, I was in a small village near Berlin, where my parents live, I looked at the map and, according to it, there were seven Wi-Fi routers in my parents’ garden. It didn't make any sense, they only have one router. So I checked every MAC address and found out that all the Wi-Fi routers in a radius of about 50 to 100 metres were somehow concentrated in the database at my father's house. It didn't make sense. Why would Apple do that? Then I realised what was happening: The iPhones were trying to triangulate the position of Wi-Fi routers. But, in small villages with a low iPhone-per-square-mile ratio, these iPhones couldn’t triangulate the routers, because they receive the routers’ signals just from one point: the point where the iPhone owner lives. So Apple makes the assumption, that all the routers in your neighbourhood are on your estate.
So I started to write an application to find these small groups of dots, and, in the end, I had a database with the locations of iPhone owners. It's interesting that you can use this data to de-anonymise iPhone users, while none of them has chosen to donate their data. So, the breach was not giving us the data, the breach was giving the data to Apple, Apple to the iPhone users, and the iPhone users to us. So, it's a really indirect hack. In the end, nobody cares because, come on, if somebody knows if I have an iPhone or not, there’s no problem finding out. In the end, I presented it at conferences, trying to explain the problem with metadata. It's interesting that you can't control how information is published anymore. You can wait until a new piece of software or technology is developed, or develop a new algorithm yourself and then you can read even more out of public dataset than before.
It’s increasingly difficult to control your data and information. We can't prevent it, but at least, we would like to understand more about it.