A selections of topics, approaches and ideas for those who are thinking of getting better at working with data and want to see how others struggled before them.
"Decoding Data" is the first in a series of Guides that Exposing the Invisible launched in 2016. It is based on materials that we produced while working on Exposing the Invisible along with new material we commissioned. Decoding Data should give you a good sense of work in progress, we don’t want to claim any uniqueness or any significance of the presented material but it is a solid selections of topics, approaches and ideas - a great companion for those who are thinking of getting better at working with data and see how others struggled before them. For us, taking up on the entire series was sort of an internal call. We are siting on so much great content, that if only it was given a different structure and little bit of new content here and there it could be a decent contribution to what everyone else is doing - trying to improve the way we work with data, images, sound, and express respect for our sources and contributors.
This guide is not comprehensive but is very condensed, and yes it might come across like a mixture of styles and organisational principles, but we don't know anyone who reads guides from A to Z (except we actually do) in a format that should not take you long to find your own way through it to get to places that are of most interest to you.
What’s different about this Guide, yet another guide about data?
In one sentence we would say: it focuses on the efficacy of data when there are some possibilities and we definitely do not fetishize its power.
The Guide in numbers: 11 people from 8 countries contributed to the writing of this Guide | 20 countries were featured in 38 case studies | It is 142 A4 pages of 12pt font along with 76 images | We recommend 64 tools along with 89 resources | Metadata is mentioned 159 times, privacy 39 times and data is mentioned 798 times...
Image created by John Bumstead
This chapter will look at three ways to 'find' data. Firstly governments releasing data through Freedom of Information requests or Open Data initiatives, secondly whistle-blowers and institutions leaking information and using databases to find data and lastly when you can't look at data directly and you have to find creative approaches to find data.
Before we begin
There are many directions you can take when deciding what data you need to explore, and/or to expand your investigation. Before we begin to talk about where to find that data, it is important to consider what you need that data to do.
We'll begin with five considerations which we have found helpful in deciding what kind of data we need and where to source it.
- How the data specifically aligns with your aims
This relates to what you are trying to achieve, who your audiences are and the sorts of data outputs which are likely to be useful to them.
What sort of data do they need, find credible and can you collect it? How are you going to get the data to them and what do you hope they will do with it? How are you going to know that the data is useful in advancing your goals? Thinking about how you will use the data in advance will have a significant impact on how you collect it as well as on the level of detail of that data. Being clear about this from the outset will save you from having to go back and fill in holes.
- Recycling existing sets of data
There is likely to be an 'ecosystem' of both advocacy and official organisations and groups already collecting and publishing information about your issue.
Could you improve how existing information is used by adding value to it and producing a new analysis or providing it to new audiences in interesting new formats or services? Could you achieve your aims by establishing a partnership with such a group? There may, however, be good reasons to proceed independently even if others are already working on the issue. Those reasons may include creating critical and alternative information resources, fostering the resilience of the advocacy sector as a whole by doubling up, and building your own skills and capabilities.
- Research methods and supporting technologies
What sort of method are you going to use to gather the data for your investigation?
Documentary and investigatory methods common in human rights work, for example, rely on reporting on or interviewing victims and survivors of incidents that might constitute Human Rights violations. However, to use this sort of material as the source of a meaningful analysis will require an understanding of statistical sampling, content analysis and a range of technologies.
- Risks that you and others involved in your project will face
Because much advocacy focuses on sensitive, taboo or politically charged topics, it is essential to be sensitive to the kinds of risks that collecting data may involve.
What measures are you going to take to protect the identities of interviewees and the substance of the material they give you? If you store information on a computer, what measures will you take to ensure it is available to the right people, and doesn't fall into the wrong hands?
- The scope and sustainable growth of your initiative
Try to sketch out the scope, scale, geographic range and comprehensiveness of the data collection initiative that you are planning.
Will the scale at which you are working allow you to cover the issue usefully? Are you allowing enough time and resources to test things out? Are you running a 'one-off' project or a longer term initiative? If you are planning for growth, what sorts of systems and people management challenges do you think you will encounter as your initiative scales up?
Hopefully these questions can help you decide what kind of data you might need for your investigation. We will now look at where this information can be found.
Digging for Data
The release of data through Access to Information laws, whistle-blowers and technical and creative innovation has made more and more data publicly available. Governments and international institutions are publishing more, and a growing collection of initiatives around the world are making it easier for us to use in a meaningful way. This increased amount of data now available is being used by investigators, artists, activists and technologists to use to expose abuses and excesses of power, clarify how tax money is spent and look at how public services perform.
The opposite is however also true. Data remains as hard as it ever was to find on issues that are opaque, such as corporate and national security issues. In the majority of countries, not least those that are closed and repressive, Freedom of Information legislation remains a seemingly unattainable goal. The data that many activists have is often found, leaked or discovered through tenacious and risky investigations that take unusual forms. We are living in a Quantified Society where everyone is generating data on a constant basis and this aspect is extremely important to Exposing the Invisible, as this creates new channels to explore and options for cross-examining prevailing narratives. There are places and people who are neither connected to the internet nor have access to technology. Unfortunately their livelihoods and files are also affected by the Quantified Society due to the various actors racing to be the first to connect and quantify these people and places, often with very little concern for matters of privacy, consent, human rights and other political, social, economic and cultural consequences. This means that we are not only relying on leaked data, the data is out there because we struggle to understand the multifaceted characteristic of the data-driven society we embrace.
M.C. McGrath introduces these three different types of open data in his interview with Exposing the Invisible.
“There are three main ways that people and groups can get data from governments and organisations. One of them is governments releasing them themselves, either through FOIA requests or open data initiatives, which is great and people have done a lot of great work around, but the governments can also choose what to release and spin it. So that's important, but in some ways it's the weakest way.
Then, there's leaking documents. So whistle-blowers releasing documents, giving them to journalists and the media and them being released that way.
The third way is just taking advantage of the data that people and institutions leak accidentally themselves. The powerful thing about that is that people don't explicitly decide to release it, not even a whistle-blower explicitly decides to release this information. It's up to people to collect and make sense of it on their own. It doesn't rely on any other entity, except for the people who are accidentally releasing it, which will always continue to happen in some fashion.”
This chapter will look at three ways to 'dig' for data:
The data is there but you have to ask for it: Governments releasing data through Freedom of Information requests or open data initiatives.
The data is there but you have to search for it: Whistle-blowers and institutions leaking information, using databases to find data.
The data isn't there and you have to create it: Finding data from unexpected sources.
The data is there but you have to ask for it: Governments releasing data through Freedom of Information requests or Open Data initiatives
How much information and data do you think the public bodies in your country create? Governments have long published at least some sorts of data, often through national statistics offices, or through various different thematic websites. However, the current scale and nature of data publication by some governments is very different to the scale and nature of even a few years ago.
There are two complimentary activist 'movements' working specifically in this area. The first is the ‘Access to Information’ movement also referred to as Freedom of Information (FOI). The second is the ‘Open Data’ movement. Access to information activists put pressure on governments to enact and implement laws enabling people to ask questions of any official body that is part of or controlled by the state, and receive prompt and thorough answers. They draw on the idea that information produced using tax money is owned by the tax-paying public, and should be made available to them without restriction. As public bodies respond to people's queries and pro-actively publish the information they create, people are able to see, better understand and scrutinize the workings of the public bodies they fund. Access to information is seen as necessary for effective participation in public life; a tool to redress one sort of imbalance between people and the powerful institutions that govern them.
Open Data activists build on these ideas and concern themselves with the re-use of data and information released by public bodies. This follows on from two important changes created by the internet:
a collapse in the costs of sharing any kind of information, and the methods by which information can be shared and consumed; and,
the fact that 'digitally native' people everywhere are creating and consuming information on the internet. Many people use online forums, social media and blogs as a key part of their lives, using it to learn and form opinions and seek advice. Other, more technical groups ‘mash up’ data – putting it online, showing it on maps, making it searchable - to try and show interesting or new things.
The mixture of technological accessibility and connectivity, at an increasingly lower cost, with regulations forcing institutions to share publicly funded data aggregations has led to the idea of open data rising in prominence over the past few years. Open data is possible because these institutions already use information communication technologies to gather and analyse the data and if you already have the data in human- and machine-readable formats, why would you not make it open? It can then be verified and reused in different contexts. It is beginning to be experimented with across a spread of governmental and civic activities, often with interesting results. Their impact will take longer to determine, and a common objection advocacy groups have to it is the fact that the availability of more data doesn't automatically translate into more effective services. Open data and FOI are not ends in themselves, they are far from perfect and there's an art to using them effectively.
In her September 2015 paper entitled Data Science for Good: What Problems Fit?, Julia Koschinsky addresses the challenge of figuring out how to gain not only new but actionable insights from the data but also how to translate these insights into impacts. She performs an analysis of open data initiatives that are considered effective and identifies the types of problems where data science techniques can add value. She determines 'effective' as those that are “widely assumed to generate new and actionable insights and have social impacts.” She divides 72 case-studies into four main categories:
improving data infrastructure by combining data with higher temporal and spatial resolution and automating data analysis to enable more rapid and locally specific responses
predicting risk to help target prevention services
matching supply and demand more efficiently through near-real time predictions for optimised resource allocation, and
using administrative data to assess causes, effectiveness and impact. In almost all cases, the insights that are generated are based on an automated process, localised, in near real-time and disaggregated.
The case-study below looks at another effective case, not featured in Koschinsky's research, but one that fits into more than one of her categories of open data initiatives that have a social impact, a most fitting, matching supply and a demand for optimised resource allocation.
Transparent Chennai and the Right to Information Act
Lydia Medland previously of Access-Info, a European non-governmental organisation which defends the right to information in the service of human rights, says of India, which implemented the Right to Information (RTI) Act in 2005, that it is:
“...a good example of where RTI has taken off in a big way. There is a lot of civil society activ ity around RTIs. Just asking the question sends the signal that the community are wanting accountability that isn't happening. People want to know simple things, like who is the per son in charge of issuing passports, or food ration cards, and that sends the message and res ults in action; or something like what is the attendance record of the school teacher, and then the school teacher starts coming.”
The success of the RTIs that Medland talks about is borne out by the chilling statistic that between 2010 and 2012, more than 100 RTI activists in India have been either killed, harassed or assaulted.
Transparency Chennai is a good example of how RTI requests are being used as a tactic to get around the absence of open data and open data infrastructure; however, the kind of data made available through an Indian RTI request is not in a digital format, it is not available online and it is not free (the cost is however negligible). Still, the work of Transparent Chennai highlights how consistently asking questions and getting answers are key to investigation and can lead to unexpected discoveries.
Transparency Chennai is an advocacy initiative in Chennai, India that can, oddly, credit its successful work with data to a lack of toilets. Transparent Chennai provides citizens with information about public services in the city for them to use in advocacy with the government. Transparency Chennai found that the women in urban slum communities they worked with needed to know where their public toilets were, and wanted more of them close to home and workplaces. Transparency Chennai realised that none of this information was easily available – who decided where toilets were built, where they actually were, who used them, where the funds for them came from and so on. Mining the city's labyrinthine bureaucracy, they found that information about these services was scattered across different zonal offices, which they had to physically visit to access, after, of course, waiting for hours and having to make repeat visits.
They found that these numbers weren't even accurate; direct requests for information from zonal offices put the number of public toilets at 572 (for a city where at least 10,000 houses have no toilet!) but that number climbed to over 750 when Transparent Chennai filed RTI requests. Next, they had volunteers map the locations of these toilets and found that the toilets had not been built in places they were most required (near slums, near informal markets and so on), nor were they evenly distributed and tended to be clustered in certain areas that weren't residential and that there was corruption – city councillors awarded contracts for building toilets were actually pocketing the money rather than getting toilets built.
Most interestingly, they found that many newer urban slums didn't get toilets because they just didn't exist on maps and city plans. The city of Chennai hadn't officially documented a ‘slum’ since 1985 even though many had grown up with the arrival of migrant labour from across the country. So if a slum didn't technically exist on paper, how could it merit a toilet? Transparent Chennai was able to show that urban planning was extremely inefficient and didn't fulfil the requirements of the most marginalised communities. Furthermore, digging through existing data revealed that the population figures that the city was using in its planning were incorrect; the actual population of the urban slums was 70% greater than the city was accounting for. Transparency Chennai has moved from toilets to public transport, road safety, sanitation, housing; they aggregate information from a variety of sources and make it available to other human rights activists and community organisers who work with the urban poor.
This example provides a new picture of how governments, businesses and advocacy groups could (and perhaps should) function in the internet age. These new trends in getting and using public information have been developed using a combination of re-thinking ideas about transparency and a redefinition of the methods and ways that data and technologies can be put to use in advocacy.
There are many people all over the world working on Open Data initiatives, and trying to establish and use Freedom of Information Laws. Below is a list of resources to start investigating the data which may be available in your country.
Further resources for this section:
Beyond Access: Open Government Data and the Right to (Re)use Public Information (download pdf), a report by the Access Info Europe and Open Knowledge Foundation, gives a good overview of the state of access to information and open data movements around the world.
Freedom of Information Advocates Network (FOIAnet): who is in the access to information movement in your country?
Find public data that might be useful to you by looking at DataCatalogs.org.
Watch Lydia Medland in her interview with Exposing the Invisible 'From Freedom of Information to Genuine Accountability.'
The data is there but you have to search for it: Whistle-blowers and institutions leaking information and using databases to find data
Whistle-blowing has become much more high-profile in recent years. In this section, we will look at the information that can be made available through the acts of whistle-blowers. This data is often put into searchable databases hosted by organisations such as WikiLeaks where people can go through huge amounts of indexed and searchable data on a diverse set of topics.
These aren't the only databases that we will explore. Company databases, international and national finance databases and worldwide registries are also a well-utilised source of information for those wanting to investigate issues of corruption and abuse of power.
Whistle-blowers and institutions leaking information
Whistle-blowing is a crucial source of intelligence to help us identify government, company and individual wrongdoing. We will look at three examples all of which deal with classified information that has been made public, all of which have some relationship to WikiLeaks and all of which have been taken and made into searchable databases open to the public. The difference between these examples, however, is the level at which they happened and the types of confidential information they leaked. TuniLeaks looks at government leaks, Hacking Team leaks focused on a company leaking information and Transparency Toolkit focuses on information that individuals working in the intelligence sector are leaking. These examples also deal with different types of confidential information: 1) public but classified, 2) private but secret, and 3) presumed hidden but actually accessible.
1) Public but classified, TuniLeaks: In November 2010, Wikileaks began releasing a quarter of a million leaked internal memoranda (‘cables’) sent between United States embassies and the State Department. The cables cover over 40 years of confidential reporting, opinion and analysis by US officials about diplomatic relations, human rights, corruption, politics and events in nearly every country of the world. Immediately after the WikiLeaks release, Nawaat de Tuni - an independent news website run by a collective of Tunisian bloggers and digital activists –started looking through the cables for what they could reveal about the Tunisian dictatorship of Zine El Abidine Ben Ali. Nawaat set up Tunileaks to pull together the cables from the US Embassy in Tunis, translate them from English into French and then spread the content widely across the Tunisian internet.
Tunileaks was put online days before a remarkable chain of events was also set in motion. For years, the Tunisian regime had been successful at suppressing public dissent about its corruption and human rights abuses. In mid-December 2010, citizen-made videos and reports about protests started appearing in relation to the suicide of a young man in response to the dire economic and political situation spread across social media. These videos and reports were picked up and re-broadcast on television and online by the Al Jazeera news network. In under a month, the dictatorship had fallen. In a piece written in 2014 by Sam Ben Gharbia, co-founder of nawaat.org, he talks of the impact that Tunileaks had:
“In a chat with a British journalist this year , Ben Ali’s propaganda minister Oussama Romdhani confessed that “Tunileaks was The coup de grâce, the thing that broke the Ben Ali system.” It wasn’t the information about corruption and cronyism, Tunisians didn’t need Tunileaks to tell them their country was corrupt. Tunisians had been gossiping and joking about the corruption for years. What was different was the psychological effect of an establishment confronted so publicly with its ugly own image. It was that the government knew that all people knew, inside and outside the country, how corrupt and authoritarian it was. And the one telling the story wasn’t a dissident or a political conspirator. It was the U.S. State department, a supposed ally.”
Tunileaks illustrates two useful ideas. First, it shows the value of keeping an eye on external resources for information that could be brought into play. Sometime we can be too narrow in relation to where we look for relevant information. Second, Nawaat successfully repackaged existing information to make it accessible in a timely way to audiences who would never otherwise have been able to access it.
2) Private but not secret: Hacking Team: On July 8, 2015, it was revealed that the Italian technology company, Hacking Team, had suffered a breach and their internal email database was now available online. Hacking Team is one of many companies that make and sell surveillance technologies and products.
Over 400 gigabytes of internal emails (more than 1 million), source code, invoices and documents from the Hacking Team hack are now in a searchable archive at Wikileaks; we recommend that you install and use the Tor Browser before searching (see Chapter 8: Protecting Data for more information). Both Wikileaks and Transparency Toolkit published this database which revealed details about their operation, contacts and communications with governments and companies around the world.
The information was taken from leaked information; whether it was an internal informant within the company that caused the data breach or an external hack is unknown or at least unpublished. These leaks are of interest to journalists, NGOs, researchers and investigators who wish to analyse their content as they offer rare insights into the capabilities and practices of this very secret company. Searching through this database revealed a number of findings.
Details of Hacking Team's client list and business model. Hacking Team sold their surveillance technologies to a number of governments and regimes with poor human rights records and those who have been criticised for aggressive surveillance in monitoring the activities of activists, lawyers and journalists. Hacking Team was found to have sold surveillance technologies to the governments of Sudan, Ethiopia, Bahrain, Egypt, Kazakhstan and Saudi Arabia. The governments of Bahrain, Egypt and Morocco have invested in surveillance technologies. Before these leaks were published, Hacking Team had explicitly denied working with numerous repressive governments. In 2013, Reporters Without Borders named Hacking Team as one of the 'corporate enemies of the internet'. Hacking Team responded with the following statement: “Hacking Team goes to great lengths to assure that our software is not sold to governments that are blacklisted by the EU, the USA, NATO and similar international organisations or any ‘repressive’ regime.”
The devil is in the detail. As yet, it has not been possible to verify the veracity of the documents. However these leaks offer us a rare opportunity to look into the inner workings of a company like Hacking Team. Share Lab, the Investigative Data Reporting Lab, took this opportunity and published an investigation that focused on what could be learned from examining the company’s metadata. They identified that there was an ongoing conversation over the signification of metadata and there was a bulk of metadata connected to these leaks. So they undertook an experiment and tested different methodologies on the metadata available. The conclusion of the investigation was that the goal of the research was not to conclude anything about the Hacking Team's activities but to use Hacking Team as a case-study on how metadata analysis can be performed and what can be learned from it. The outcomes of this research were to inform a scientific and 'popular' audience on the real important of metadata for our privacy. The Lab hopes that others will be inspired to use similar techniques on their own research and find new connections and leads based on metadata.
What technologies Hacking Team were selling and working with. Through the hacks we learned that Hacking Team sells Remote Control System (RCS) software to law enforcement and national security agencies. This can be described as 'offensive hacking' rather than 'defensive hacking'. RCS software allows these agencies to target computers and mobile devices and install backdoors. To gain insight into what technologies companies like Hacking Team are selling is usually difficult, if not impossible. This, however, is something that the next project focuses on.
3) Presumed hidden but in reality public, ICWATCH: ICWATCH is a project created by Transparency Toolkit, a toolkit that provides a set of tools to collect data from various open data sources. ICWATCH is a database of an estimated 27,000 LinkedIn CVs of people working in the intelligence sector. This database can be used to find information about the intelligence community, surveillance programmes and other information that is very much private but that has been posted publicly via the professional networking platform, LinkedIn.
Exposing the Invisible recently interviewed M.C. McGrath, the creator of both ICWATCH and Transparency Toolkit. In his own words he explains how this information was presumed hidden but, in reality, was public:
“ICWATCH is a searchable collection of currently just LinkedIn profiles of people involved in the intelligence community. LinkedIn profiles, because many people mention things about their work and their job history on LinkedIn, so they say, "Oh, I know how to use Microsoft Word and XKeyscore", just in their skills on their LinkedIn profile, and sometimes they also mention unknown code words and define them.
In ICWATCH we have quite a bit of data, about 27,000 profiles of people involved in the intelligence community, primarily the US intelligence community but also some people around the world. These range from people who are saying that they work as a contractor or maybe mention some interesting terms, to people who are listing tons of secret code words on their profiles, sometimes with helpful descriptions of what the code words are. We've collected them all in one place, and made software so that anyone can search through them to better understand surveillance programmes, or which companies help with which programmes, or the career paths of people in the intelligence industry. We want to understand both the details of the programmes themselves as well as the people involved. Institutions are made up of people, and being able to understand why people get involved and, if people leave the intelligence community, why they leave and what pushes them to do so, is important for understanding how we can reform mass surveillance.”
'Hiding in the Open', five short videos by Exposing the Invisible featuring M.C. McGrath
Explore the data on ICWATCH by looking at theICWATCH database hosted by Wikileaks
Metadata Investigation: Inside Hacking Team by Share Lab
An interview with Mari Bastashevski, featured on Exposing the Invisible, the day after the data breach at Hacking Team
Using databases to find information
There is a lot of information publicly available that can be found on worldwide registries and international or national databases. The next example looks at how these database can be utilised to find information that is available if you know where to look.
The Organised Crime and Corruption Reporting Project and the Investigative Dashboard
Many of us suspect that there is a separate, parallel, privileged hidden system that enables wrongdoers to not only benefit from their wrongdoing but also to move money freely around or hide their funds from taxes or from the public’s view. This hidden system was always thought to be an impenetrable network of complex connections between institutions and individuals using legal gaps, loopholes and liberal regulations, such as offshore company registers, bank accounts and the availability of anonymously owned entities and companies, to manoeuvre within it.
Regardless of how complex these systems are and what levels of anonymity are established and maintained, money leaves a trail. These trails are generated when funds are moved between people and institutions and since this occurs within the digital spectrum, it is possible to trace them back to those who send the money and those who receive the funds.
The Investigative Dashboard is an initiative of the Organised Crime and Corruption Reporting Project (OCCRP), an international network of investigators and journalists whose aim is to make business transparent and open and to expose crime. By 'following the money' through a range of investigative strategies and processes, the Organised Crime and Corruption Reporting Project is able to show how and where organised criminal networks and corrupt dictators hide their wealth. Paul Radu, who runs the OCCRP, says:
“Organised crime is very creative and is good at hiding itself; organised crime uses complex business structures and corporate structures, and there are intersections between governments and corporations too. There is always an interface between the world of organised crime and the real world. It is an underground activity but it must have a public interface because they involve people. We act at that interface. We see organised crime and corruption as a puzzle to be solved.“
Solving this puzzle usually requires a good deal of digging and fishing around through databases and records to peel back layers upon layers of fake companies that serve as fronts for criminals to secretly privatise their assets. Companies can be registered and owned in multiple locations around the world and uncovering who the beneficiaries really are means exposing the details of every single company-within-a-company, a lot like matryoshkas, nested Russian dolls. Most offshore tax havens – the Cayman Islands, the state of Delaware in the US, the Bahamas, Panama, Switzerland etc.- are popular with the criminally wealthy because of the ease of setting up companies without having to provide too much paperwork; many banks around the world have a KYC, or Know Your Customer, standard, which means that detailed information about customers must be collected. However, this standard is waived by banks in offshore banking hubs. So, a corrupt president can put their kickbacks into an account for a fake company registered in the names of the proxies (sometimes these are people whose identities are stolen and used without their knowledge, as fronts). Many of the banks in countries like Switzerland and the Bahamas are highly secretive and do not readily divulge information. A lot of investigative efforts have been stalled by the sheer difficulty in gaining access to information, so investigative journalists have to be extremely resourceful and inventive in following trails.
Recently, the Investigative Dashboard were surprised when Panama opened up its company registry database. This was a big surprise as Panama was known for its secrecy. However, it quickly became clear that it was not that easy to directly access the information; you had to know the actual name of every front company to search through the database. Working with hackers who were able to 'scrape' the database and re-index it, OCCRP was able to add functionalities – like searching by name of company director for example – which make it easier for investigators to find information.
To access a list of worldwide registers, have a look at here.
Frequently used by investigative reporters for diverse investigations, the Panama registry of companies is a great tool for journalists and activists interested in issues pertaining to corruption and tax avoidance.
LittleSis is a free database of who-knows-who of business and government. They define themselves as the grass-roots watchdog network connecting the dots between the world's most powerful people and organisations.
OpenCorporates is a database which aims to gather information on all the companies in the world. The database currently offers information on 50 million companies in 65 different jurisdictions. Information that can be found on OpenCorporates includes a company's incorporation date, its registered addresses and its registry page, as well as a list of directors and officers.
TheyRule is a website offering interactive visualisations of the biggest companies in the US, helping you see who has the power in each company and what the ties are between individuals at the top of corporations. TheyRule also provides data relating to various institutions and foundations, shedding light on who is hiding behind lobbies and think tanks in America.
Create a ‘map’ using the white space where you can place the company, institution or person you want to obtain information on. Explore the different options in the left hand menu to show the connections between companies, institutions, boards of directors and people. You can work on different visualisations at the same time. Individuals who sit as directors of more than one board are represented with a fat belly, which gets bigger according to the number of boards. For each ‘item’ (person, company and institution), you can find out what information is publicly available by clicking on ‘research’. TheyRule partners with organisations such as LittleSis, Corpwatch and Democracy Now! to provide in-depth and accurate data.
Our Currency Is Information, a film by Exposing the Invisible that features Paul Radu and OCCRP's work.
Treasure Islands: Tax Havens and the Men who Stole the World by Nicholas Shaxson.
Data dark zones: When you can't look at data directly
“...if you're interested in an industrial plant and you think that there are environmental crimes being committed there, you're going to have a very hard time turning up at the front door and knocking on the plant door and seeing if they'll let you come in and tell you whatever crimes they are doing. But what you can do is assume that if they're dealing with toxic chemicals and there's a good chance they have a bad safety record, so what you can do is go to the local fire department and ask if there are any documented incidences of a hazmat response. In other words, have there been any instances where you've been called to anything to do with hazardous waste. So you start to build up evidence around the thing that you are looking at when you can't look at it directly.
Trevor Paglen, co-author of Torture Taxis: On the Trail of the CIA's Rendition Flights.
Open data resources may have little to offer directly on many of the contentious global issues – state-sponsored violence, conflict, human rights violations, environmental degradation and resource transparency - particularly as they play out at the global level, or in transition or repressive parts of the world. In such places, it may be impossible or even dangerous to ask a local authority or a company to release data. Yet this has not stopped activists experimenting with these methods in 'data dark zones', for example, by not waiting for information to be released but instead finding it themselves, creating their own resources or working with leaked information.
In this section we look at examples of techniques for directly recording and collecting information. So, effectively, we often know that data around an issue exists, but do not know how to collect it. 'Collecting' is an approach that refers to the direct recording of information, something that is central to projects that directly count instances of something, like rights violations, for example.
Here are some examples of groups and individuals who could be useful starting points when considering innovative ways of finding data.
Find leads in other places: Many different groups collect and publish data about the same thing, but do it in different ways with different approaches, standards and technologies. For example, governments publish information about companies in different ways: in a globalised world, this makes it difficult to track the activity of companies and the individuals associated with them. An interesting approach to solving this is to look at OpenCorporates which pulls together data about the registration and ownership of companies from around the world. OpenCorporates does the heavy lifting of making corporate information easier to access, meaning that others researching companies don't have to. The International Aid Transparency Initiative (IATI) does something similar in creating a standard that governments and international organisations can use to publish data about development aid spending, enabling it to be aggregated and compared.
In recent years, a highly secretive industry has grown in creating and selling technologies that can be used to intercept emails and website use, hack online user accounts and track their owner's behaviour and location through internet and mobile use. The risks that activists and journalists around the world face as a result of digital surveillance by repressive regimes has also grown, some might say in lockstep with the market for these technologies.
However, it has long been difficult to gather evidence of systematic connections that would help activists exert pressure on companies and governments to comply with human rights standards in the export and use of these technologies.
Researchers from Privacy International (PI), a human rights group based in the UK, managed to attend a number of surveillance industry conferences. By collecting many of the product marketing materials freely distributed at the ISS World conferences, they were able to identify which companies were offering what services. Privacy International and a consortium of activist and journalist groups released this information as The Spy Files (a similar set of information was also released by the Wall Street Journal as a searchable dataset called the Surveillance Catalogue).
Through further data-gathering activities, PI were able to obtain lists of the companies and government agencies who attended the same ISS conferences. They published this in the form of a Surveillance Who's Who, which gives leads to public agencies in over 100 countries that have shown active interest in surveillance technologies. This data has been wired into other public data resources and services about public spending and company information. By publishing them on an open platform, PI raise the issue but make them available for others to conduct analysis and investigation. Others have the opportunity to fill in the gaps in the existing dataset, improving the resource for everyone interested in the issue.
Let rejection be your proof:Mari Bastashevski is an artist, researcher, writer and investigator. She focuses on issues of systemic failure and international conflict profiteering and the information vacuum that surrounds these issues. Bastashevski discusses using photography and the process of rejection in her work through enabling her subject to define their own perimeter of secrecy, whether legally established or imaginary.
“Usually, I would either request permission to take a photograph in advance via e-mail, or during an interview with a company or a governmental institution. Usually, the permission would then be denied. This is a fairly standard system of request and rejection, where I become the requester and they have the power to play a rejecter. The rejection itself is very interesting and it varies from silence to an email very diligently composed by a public relations executive. The latter is especially true of Western European companies.
What I try to do next is to ask them where exactly the photograph is rejected and where the border of rejection ends, defining the distance. For me, this is the disruptive element that forces the requester and the rejecter out of our established positions. It also allows ‘the photographed’ to draw their own perimeter of secrecy, be it legally established or imaginary. This element of the work is very much a performance. One in which both the photographer and the photographed stumble and look a bit surprised.
This specific approach was inspired by a Swiss defence contracts broker, BT International, whom I met back in 2011.
At the time, I located the company’s office in a countryside-setting surrounded by a rural idyll and decided I might as well ring the bell straight away. The director, who was there alone, asked me to take off my shoes and invited me in for tea. So we ended up sitting there drinking tea and we talked for a while. He answered some of my questions and he did not answer some of my other questions as he went into the traditional ideological discussion of, "Well, if we're not going to sell it, someone else will and there are bad guys," and all of these things that you hear over and over again. All the while I can see cows grazing, right outside the window, and remember how this very small company is responsible for brokering a lot of very serious deals around the world and I am trying to figure out how to compute the two. And when I asked him whether I could take a photograph he of course said “No” immediately explaining that I could only take a photograph from a legal distance outside the perimeter of the house," and that was the end of it so I put on my shoes and photographed the cows."
Riedbach, Switzerland. View from BT International head office, from the series State Business (Chapter I)
Extrapolate from unexpected sources: The Public Laboratory for Open Technology and Science (Public Lab) develops and shares open-source, do-it-yourself tools for communities to collect data about environment pollution and contamination. In Brooklyn in the United States, Public Lab and its collaborators used balloon-mapping to identify zones of contamination in the Gowanus Canal. While the Gowanus Canal has been widely recognised as in need of a clean-up, and has state funds to do so, balloon-mapping allows local communities to monitor this process and collect 'shadow' data. In the Czech Republic, they supported activist groups to monitor illegal logging in the Sumava National Park. The most basic aerial balloon-mapping kits they have developed involve just a camera, a bottle, balloons and rubber-bands. Aerial photography is usually restricted to governments with the technologists to launch satellites, but this sort of 'grass-roots mapping' allows communities to influence how they name and own their territories. Coupled with more advanced technologies like near-infrared cameras and thermal photography, these simple tools can serve as powerful forms of data collection that can be used for investigation.
Another example of extrapolating from unexpected sources is the work of James Bridle with his project ‘Seamless Transitions’. Exposing the Invisible spoke to him recently about this project and he describes how he found visuals to describe deportation of migrants that, for example, had outstayed their visa, a process that happens late at night using hidden infrastructure that was impossible to document. However, by using a range of techniques such as first-hand accounts, aeroplane spotting websites and working with an architectural studio to visualise these hidden spaces, he was able to generate a rough outline of the process.
“One of the things that I've noticed and that a lot of my work turns up is making the invisible visible. It could happen in a number of ways and one of the simplest ways is simply providing images where they didn't exist before. This is a thing that's going on but you don't see pictures of it in the newspapers because it happens within this kind of protected sphere. It happens within private space. One of the prime things they do is that they privatise things so it's not the government’s responsibility to tell you about it and they don't have to provide images of it. I wanted to fill in the gaps in that imagery and effectively use the same way of thinking about technology as I'd used in the investigation to do image-making. So I worked with people who tend to do architectural visualisation, who worked with architects, who produce plans for buildings and nice luxury apartments, who are very adept at rendering and making images of initially, imaginary un-built spaces.
But we did investigative work to get the floor plans and the planning documents and the eye-witness accounts and what few photographs there were of these places from various times, in order to build full 3D models of them, so that we could then essentially do tours of them. And we did that, not just for the airport terminal that I visited, this private terminal at Stansted airport, but also for the detention centre, where many of those people were held, which is again kind of privately-run space.”
A 3D render of the deportation centre from James Bridle's Seamless Transitions project
Undertaking new forms of measurement: In a discussion paper written in July 2015 by Jonathan Gray entitled Democratising the Data Revolution he writes about recent projects that have undertaken new forms of measurement by counting what has not been counted. "Recently there have been several data journalism projects which highlight gaps in what is officially counted. The Migrant Files is an open database containing information on over 29,000 people who died on their way to Europe since 2000, collated from publicly available sources. It was created by a network of journalists (coordinated by J++) who were concerned that this data was not being systematically collected by European institutions. In a similar vein, ‘The Counted’ project by _The Guardian_records information about deaths in police custody in the US, explicitly in response to the lack of official data collection on this topic. Both projects featured a team of journalists and data experts who collected dispersed information from many sources, correlated and verified this information and then published it as a new body of evidence. This new body of evidence not only exposed the sheer numbers of people dying but it also highlighted the lack of a consistent system of cross-country monitoring and a total lack of accountability of existing institutions and their methodologies."
Another example of collecting dispersed information from many sources is the Dutch cartographer, Thomas Van Linge, a 19-year-old Dutch student, who maps out the territorial control of Iraq, Libya and Syria as it evolves and posts these maps to his Twitter account. These maps are then shared with his 25,000 followers and are often cited by major news organisations as accurate depictions of who controls which areas in these countries. He usually creates these maps on Google Earth through sources gathered from social media platforms such as Twitter, Facebook and YouTube, and also from personal contacts in the region. He estimates that he uses over 1,100 sources for his Syrian maps to verify claims of territorial control.
In an interview with Newsweek in June 2015 he said:
“I want to inform people mostly and show people the rebel dynamics in the country. I also want to inform journalists who want to go to the region which regions are definitely no-go zones, which regions are the most dangerous, and also to show strategic developments through time.”
He goes onto describe his motivations for creating these maps:
“I hadn’t really considered it at the time, but I was annoyed by other maps that didn’t make the distinction between rebels and ISIS groups of areas, which were still at the time intertwined.”
These five approaches are some of many that investigators are using to find or create data that is not outwardly available. In the next two chapters we will look at techniques for collecting this data and approaches to analysing it.
Image created by John Bumstead
This chapter introduces four ways to think about the data you have or have found, and how you can help others get the most out of it. The second section takes a practical approach to gaining access to data that is inaccessible with three tutorials.
The song ‘Tom's Diner’ by Suzanne Vega was the first song ever to be compressed into the .mp3 format in the early 90s. Karlheinz Brandenburg was based at the Fraunhofer Institute in Berlin when he developed the .mp3 format which dramatically improved our ability to share and store music on our computers.
The same institute developed a technology that collected evidence from hundreds of millions of scraps of paper hastily torn up by the Stasi after the fall of the Berlin Wall. The Stasi, more formally known as the Ministry for State Security, was the official state security service of the German Democratic Republic from 1950 to 1990, renowned as an extremely effective and repressive secret police agency. Notoriously passionate about the use of paper, the Stasi documented everything they learned about the people they kept tabs on. After the fall of the Berlin wall in 1989, the Stasi knew that the tide had turned and that they would be subject to investigation for spying on the population. They deployed poor-quality shredders to destroy the paper records they had amassed; when the shredders broke down, they ripped up the documents by hand and stuffed them into sacks. Before these sacks could be destroyed, the Stasi headquarters were surrounded and the sacks were acquired by the new federal authority on the Stasi archives.
Decades later, the Fraunhofer Institute invented a technology called the ePuzzler. The ePuzzler made it possible to match and reconstruct an entire Stasi document. First, each and every single piece of torn paper had to be ironed out and then scanned. Data about the size, shape and notation on the paper are digitally recorded during the scan. The ePuzzler then uses a mathematical formula to match information about the size and shape of the paper against about six hundred million other pieces.
Image showing unsorted fragments of various sizes" Source: BStU/Jüngert
The creation of ePuzzler is interesting for a number of reasons. In 2007, the Fraunhofer Institute faced the seemingly impossible task of matching millions of pieces of paper in order to investigate the Stasi and their activities, from 1950 all the way to 1989. Rather than relying on a technology that was already in existence, they created a new technology to address this challenge. This case study also demonstrates that while collecting information, you might not know when that information will become useful. When investigators were collecting these endless bags of shredded information in the 1990s, they could not have known that 20 years later, scientists would invent a technique that could automatically piece this information back together.
As digital technologies have made the sharing, storing and discovery of information more accessible, there are new possibilities for working with data. Nevertheless, the data we need to build our investigation is never ready and organised in a way, nor in a format, we can immediately use. We often have to search around on the internet to find the data we need; at other times, it's already gathered but we have to ask many different people for it.
Open data advocates argue that public bodies should not only release information and data with modern online habits in mind, but they should do it in a way that removes technical, financial and legal obstacles to any sort of re-use. In practice, this means designing methods and standards for releasing different sorts of information in ways that anticipate but don't preclude what people might want to do with it. These include making sure information is released in digital formats that can be used in commonly used desktop tools like spreadsheets, and using common standards to enable linking between datasets. This important technical work removes practical obstacles to the potential held within the data, helping to realise the action that the calls for access suggest is necessary.
This chapter is split into two parts.
The first part, Making data useful online, introduces four ways to think about the data you have or have found, and how you can help others get the most out of it.
The second section, How to access data when it's inaccessible, takes a practical approach to gaining access to data with three tutorials:
Making data useful online
Data can be a place to find new stories, particularly if you can create a way of showing raw data that enables the public to help ‘trawl’ through it for interesting things.
In 2011, Tactical Technology Collective interviewed Dr Sushant Sinha from Indian Kanoon on a project that demonstrates this.
“If you would have asked me in 2008 whether I was an activist I would say 'no'. I was a pure tech guy at that time. Now I think I have a role of providing free access to law in India.” Dr Sushant Sinha laughs quietly when we ask him if creating IndianKanoon.org, a free, daily updated, online search engine of 2 million of India's laws and court judgements, is activism. He sees his work as solving an annoying problem. “A lot of other Indian websites that try to provide legal information make no connection amongst the documents, so judgements don't refer to one another. You can't find a link to these judgements. As a result what happens is that people are confused by this complete jargon.”
Sinha, a software engineer with Yahoo! India in his day job, started taking an interest in law in 2005, spending time on the growing number of law blogs that appeared in India around that time. But he was unable to quickly find sources mentioned in the blogs or understand what a case was about. “The frustration was that I did not have the legal background. In 2005 or 2006 the Supreme Court of India started putting each judgement online. So I started reading them and I was like 'oh man, there is too much jargon'. But then an idea struck me. Let's suppose these people know that these sections are important, so why can't computers automatically discover it?”
To advance his own understanding of the law Sinha used his skills as a computer scientist to bring together around 30,000 judgements from the Supreme Court of India published on its official website. His computer programmes 'read' through each judgement, picking out citations of sections of the legal code and references to other cases the Supreme Court had decided. They then link them all together making the legal documents dramatically easier to search, browse through and understand.
He didn't stop there. In early 2008 he decided to put his work online as a simple to use search engine. The High Courts of India also publish the outcomes of court cases each day, so Sinha began to include them in the search engine. His programmes – called 'crawlers' or 'scrapers' – automatically visit these websites each day to look for new material, downloading what they find and adding it to the search engine.
Screenshot taken from the Indian Kanoon website
Not everyone has been happy. As the court judgements in Indian Kanoon are also indexed in Google and other search engines, many people involved in court cases are finding their names appearing in search engine results for the first time. Some have pleaded with Sinha to remove their names, effectively asking him to change the content of original, already public court documents. From the data supply side, Sinha notes that Indian Kanoon's focus on ease of use shows how the interests and capabilities of the IT companies running the court systems get in the way of a useful, responsive service for the end users. As for civil society groups working to improve access to information about the law, he believes they don't see it as their role to work on this sort of technical project: “Civic activists in India tend to follow this route: file a public interest litigation in the concerned court about it. They know what it takes. I have no idea about how to take that battle - so whatever sort of battle I can fight, I'm fighting.”
If you want to use these techniques in your area of work, below are four ways to think about the data you have or have found, and how you can help others get the most out of it too.
- Find a public service that really should be better, or try to create a completely new one if it doesn't yet exist: some of the first, most interesting and influential open data initiatives have been created by people frustrated by a public service that wasn't working as well as it should have been. Two of the best examples are The Public Whip and They Work for You websites. Together, these create an ‘accessible’ version of the official transcript of the parliament of England and Wales. The developers of this website were frustrated that it was not possible to see how Members of Parliament had voted because this data was buried in strange places in the official transcript. They applied a technique called web-scraping (which we discuss later in this chapter) to collect this information, and a user-friendly website to display it. Work in the area of online parliamentary informatics has rapidly taken off globally in the last few years, as you can see from the long list of sites here. Frequently used by investigative reporters for diverse investigations, the Panama Registry of Companies is a great tool for journalists and activists interested in issues pertaining to corruption and tax avoidance. With details on directors, registration and subscribers, the Panama Registry of Companies is very similar to the chamber of commerce in many countries – a place where you can check the names associated to (and occasionally ownership of) all companies and non-profits ever registered. The website allows you to browse through their database after you have registered (registration is free). However, to access the data you must know the name of the company you want information on. To access a list of worldwide registers, have a look here as a starting point. To circumvent this issue, Dan O'Huiginn, a hacker working with the Organised Crime and Corruption Reporting Project, scraped the official database to create a new one that allows you to enter the name of a person and see what companies he or she is affiliated to. Paul Radu gives us the example of the Azerbaijani president's daughters-after entering their names in this search engine, he discovered that they owned several companies registered in Panama. The Investigative Dashboard has produced two video tutorials to help you navigate through both interfaces: click here for the one on the official Panama registry of companies website and here to learn more about using Dan O'Huiginn's scraped version.
- 'De-fragment' an area of knowledge by pulling it all together: Many different groups collect and publish data about the same thing, but do it in different ways with different approaches, standards and technologies. For example, different governments publish information about companies in different ways: in a globalised world, this makes it difficult to track the activity of companies and the individuals associated with them. An interesting approach to solving this is to use OpenCorporates which pulls together data about the registration and ownership of companies from around the world. OpenCorporates does the heavy lifting of making corporate information easier to access, meaning that others researching companies don't have to. As mentioned above, the International Aid Transparency Initiative (IATI) does something similar in creating a standard that governments and international organisations can use to publish data about development aid spending, enabling it to be aggregated and compared.
- Find and tell stories with data: the release of data through access to information laws, and technical innovation from the open data movement has also affected news and investigative journalism. Initiatives like Pro Publica have added to the existing skills of journalists by developing better technology tools for collecting, analysing and showing data. This means that sometimes the data is the story; at others, the publication of visualisations made from the data extends reader interest in a story through revealing different aspects of it or angles to it. Finally, data can be a place to find new stories, particularly if you can create a way of showing raw data that allows the public to help ‘trawl’ through it for interesting things.
- Publish information in ways that are native to the internet: advocacy groups are beginning to adapt to internet-native ways of publishing information online. The idea is to encourage others to use it by making it easy to search, explore, re-use and contextualise. Rather than think of your data as a table in a report, think of it as a service to others: what else could they do with it that you can't? The wave of open data portals, such as Open Data Kenya, go even further by providing tools for mapping and quantitative analysis. The Open Knowledge Foundation has created a guide to realising this sort of technical openness.
Increasingly, online data sources are used not only by specialists, researchers, academics and journalists, but also by concerned members of the public who have surprised data publishers with the level of their curiosity and the massive efforts they are willing to undertake in investigating raw data in order to form their own opinions.
To learn a bit more about this way of thinking about data, see the resources below:
Agitagogo: a Civic Hacktivism Abecedary by Tony Bowden, containing tips on how to build public service websites that aren't dreadful.
The Open Data Manual: the Open Knowledge Foundation's guide to the social, technical and legal aspects of open data.
Data Driven Journalism- a handy resource to find out about technologies and journalism.
Wikileaks data journalism: how we handled the data – a practical look at the process used by the Guardian to turn a big dump of data into news stories.
How to access data when it's inaccessible
In many cases, data will not be as freely available and easy to re-use as it could be. Researching an issue or requesting information can result in stacks of paper or thousands of digital files. These can be overwhelming, difficult to make sense of quickly, and it can be hard be to know how to proceed. Do you just start flicking through the documents with a pen and paper, or would a more systematic approach be better? What technologies could be helpful?
Activists and journalists have been collaborating with technologists on a range of potentially useful approaches to overcoming situations where the format gets in the way of the information. In this section, we suggest some practical starting points for collecting and working with data and look at three basic robust technologies:
1. Accessing information locked in PDF spreadsheets
Deep dive on scraping and parsing: reverse engineering a digital document to make the data in it more useful.
Most of us have come across information 'locked' in a PDF document. It is possible to unpick these documents and to ‘reverse engineer’ them to make their content easier to work with and analyse. This can be done using a technique called scraping and parsing. In the example below, we look at data produced by a single organisation in Zimbabwe, but the ideas and techniques are applicable to anywhere that a digital publication format gets in the way of using the data inside it. The idea applies equally to extracting data from a website.
The Zimbabwe Peace Project(ZPP), a Zimbabwe-based organisation, documents political violence, human rights violations and the politicised use of the emergency food distribution system. They have a nationwide network of field monitors who submit thousands of incident reports every month, covering both urban and rural Zimbabwe. Between 2004 and 2007, the ZPP released comprehensive reports detailing the violence occurring in the country. The reports are dense PDFs and Microsoft Word documents that are digests of incidents, unique in their comprehensiveness. As documents, they are also pretty inaccessible and their formats get in the way of seeing what actually happened and gauging how the situation has changed over those years. It is hard to do anything with the data, such as search, filter, count or plot it on maps as it is locked inside the PDFs. What can we do about this?
All documents are arranged in a particular, pre-defined way. Whether they are reports or web pages, they will have a structure that includes:
different types of data, such as text, numbers and dates.
text styles like headings, paragraphs and bullet points.
a predictable layoutsuch as a heading, a sub-heading, then two paragraphs, another heading, and so on.
Here's a single page from one of the ZPP's reports about political violence in Zimbabwe in 2007 (PDF).
The structure of the above page can be broken down as follows:
|How does it appears in the report?||What is it really?||What type of data is it?|
|1||Northern Region||Heading 1||Geographic area (Region)|
|2||Harare Metropolitan||Heading 2||Geographic area (District)|
|3||Budiriro||Heading 3||Geographic area (Constituency)|
|4||A date||Heading 4||Date (of incident)|
|5||Paragraph||Text||Text describing an incident|
|4||A date||Heading 4||Date (of incident)|
|5||Paragraph||Text||Text describing an incident|
This structure repeats itself across the full document. You can see a regular, predictable pattern in the layout if you zoom out of the report and look at 16 pages at once.
So there's lots of data in there, but we can't get at it. The report is very informative, containing the details of hundreds of incidents of politically-motivated violence. However, it has some limitations. For example, without going through the report and counting them yourself, it is impossible to find out what incidents happened on any specific day across Zimbabwe. This is because the information is not structured to make it easy for you to find this out. It is written in a narrative form, and is contained in a format that makes it hard to search.
To tackle this problem, the format of the information has to change to allow for more effective searching. Try to imagine this report as a spreadsheet.
|Geographic area (Region)||Geographic area (District)||Geographic area (Constituency)||Date of incident||Incident ..........................|
|Northern Region||Harare Metropolitan||Budiriro||4 Sep 2007||At Budiriro 2 flats, it is alleged that TS, an MDC youth, was assaulted by four Zimbabwe National Army soldiers for supporting the opposition party.|
|Northern Region||Harare Metropolitan||Budiriro||9 Sep 2007||In Budiriro 2, it is alleged that three youths, SR, EM and DN, were harassed and threatened by Zanu PF youths, accused of organising an MDC meeting.|
|Northern Region||Harare Metropolitan||Budiriro||11 Sep 2007||Along Willowvale Road, it is alleged that AM, a youth, who was criticising the ruling party President RGM in a commuter omnibus to town, was harassed and ordered to drop at a police road block by two police officers who were in the same commuter omnibus.|
A spreadsheet created in something like Open Office Calc or Microsoft Excel allows this information to be sorted, filtered and counted, which helps us explore it more easily.
However, making this spreadsheet from the original ZPP reports would require lots of cutting and pasting – costing us time that we don't have. So what can we do? If you can read it, a computer might be able to read it as well. Thankfully, documents that are created by computers can usually be ‘read’ by computers too. With a little technical work, a report like the one in our example can be turned from an inaccessible PDF into a searchable and sortable spreadsheet. This is a form of machine data conversion. Knowing how this works can change how you see a big pile of digital documents. The computer programs that are used to convert data in this way are called scraper-parsers. They grab data from one place (scraping) and turn it into what we want it to be by filtering (parsing) it.
Scraper-parsers are automatic, super-fast copying and pasting programs that follow the rules we give them. The computer program doesn't ‘read’ the report like we do; it looks for the structure of the document, which as we saw above is quite easy to identify. We can then tell it what to do based on the elements, styles and layouts it encounters. Using the ZPP reports, our aim is to create a spreadsheet of the violent incidents, including when and where they happened. We would give the scraper the following rules:
Rule 1: If you see a heading that is a) at the top of a page, b) in bold capitals, you shall assume this is a Geographical Area (Region) and print what you find in Column 1 of the spreadsheet.
Rule 2: After seeing a Geographical Area (Region), you shall assume that until you see another heading at the top of the page in bold capitals that is different from the previous one, you are looking at things that have happened in that geographical region.
Rule 3: Until then, whenever you see a paragraph of text that has one line on top of it, and one line beneath it, that is preceded by a date in the form ‘Day Month Year’, these are incidents that happened in this geographical region, so you will copy them to the column called ‘Incident’.
Once the rules are set, the scraper-parser can be run. It will go through this 100 page document very quickly, pulling out the data you have told it to. The scraper might not get it right the first time, and there will be errors. The point is that you can improve a scraper-parser, run it hundreds of times and check by hand what it has put in your spreadsheet, and it will still be faster than trying to re-type out the content yourself.
Scraper-parsers have to be written especially for each document because the rules will be different, though the task is the same. However, in most cases it is not a major challenge for a programmer, the challenge for you is to understand that it is possible, and clearly explain what you need!
Dull repetitive tasks are precisely what computers are made for. You might think that it is not worth writing a scraper-parser for a one-problem task. However, what if you have hundreds of documents, all with the same format, all containing information you want? In the Zimbabwe example, there are 38 reports produced over nearly 10 years. Each is dense, and in total contain data on over 25,000 incidents of political violence. The format gets in the way of being able to use this data.
A scraper-parser can:
Go through all 38 documents you tell it to, whether on your computer, or on the internet (scraper-parsers can browse the internet as well).
Pull out the data that you tell it to, based on the rules that you write for it.
Copy all that data into a single spreadsheet.
A scraper-parser can also:
Check the website where the ZPP publishes its reports each day and, if there is a new one, download it and email you to let you know, before adding it to the list of reports it ‘reads’ into your spreadsheet.
Include new columns for the date the report was published, and the page number where the incident was recorded in the report (so you can check the data has come across properly).
Change the format of every date for you e.g. from 27 September 2004 to 27/09/2004.
Automatically turn the spreadsheet into an online spreadsheet (like Google Spreadsheets) that can be shared freely online, and update it when data from a new report becomes available.
Scraping and parsing can be technical, but if you are trying to extract data that is already organised in a table, then it is much easier and there are tools available that can help you. To unlock data from more complicated layouts, you may need to get a programmer involved. Below is a list of further resources that can help deepen your understanding of this technique so you can give it a try yourself:
School of Data offers various courses on scraping: how to scrape data from websites without using code,how to scrape data from tables in PDFs and more advanced tutorials for programmers.
Scraperwiki is a tool to unlock data held in PDFs, tweets and websites. Great for technical and non-technical users alike.
Pro Publica produced a guide on how they used scrapers to collect data to show the connections between pharmaceutical companies and doctors in the US.
2. Turning websites into spreadsheets with web-scraping
Information useful for investigations can be found hidden within websites. There are often online databases that are spread across many pages that you must endlessly click 'next', 'next', next' to see the contents of these pages, let alone analyse them. When you find this information you often want to quickly collect this data, run queries on it or see how it has changed over time. Doing this manually can take an off-puttingly large amount of time. However this can be done automatically through utilising a technique called web scraping. This allows us to turn information stored on websites into more usable formats such as spreadsheets.
Through running a script or software on a website the tool effectively scrapes the information you have deemed important and gives you the ability to ask questions about it. Questions such as, “when did the price go up for this emerging market? Or who was the main donor for certain political campaign?”
By using web scraping you can:
- Collect content from a website;
- Transform the extracted content into spreadsheets;
- Check whether and how the content has been changed over time.
Examples of web-scraping
The Transparency Toolkit team extracted more than 27,000 resumes of job positions on Linkedin website to collect and analyse the people working into intelligence community.
Mostre.me: This website collected open data from the Brazilian Government Department of Culture through web scraping tools and used the content to create data visualisations in order to help public consultation about cultural projects that are being supported by Government funds and expose their status of execution and ownership. The official Government website has the obligation to make this information public, however their website was unusable and hadn't been updated for more than 7 years. By using web scraping and analysing the data, an independent group of people enable the website to reveal how public funding in the cultural area is being used in Brazil.
HURIDOCS created the tool Caselaw Analyzer to collect and analyse the data collected from European Council of Minister's website. The project was aiming to understand why the European Court of Human Rights marks certain cases "important" and which were the patterns of the actual enforcement of judgments.
There are a number of web-based services that offer free software to users to scrap websites. These services differ from operating systems, Kimono.io for example is available to Mac and Linux users, while Import.io is only available to Windows users.
For open-source alternatives, try scrapy.org tool, available for Linux, Mac OS and Windows. Scrapy is a good alternative, but requires a little bit more of understanding about terminal operations, we recommend reading their documentation first.
The scraperwiki tool is also a popular alternative, and is free for journalists.\
Things to watch out for:
It's important to note that many of the web-scraping tools are hosted by third party servers. This means that the information that you might be gathering could be stored in "another person's house". Read more on why this might be a problem in the Protect Section of this Guide. It's also helpful to note that similarly to information stored within PDFs, hiding information within websites can be a dissuading technique used by those who do not want the information to be investigated more thoroughly.
Take care if you are working with sensitive information or issues, asking the following questions could be a helpful start:
- Could using this scraping service reveal the purpose of my investigation to others?
- Am I putting myself or others in danger by asking these questions?
The legality of web-scraping
We would suggest the following steps as a good place to start:
- Check if the content being scraped is copyrighted
- Be sure that using that by using scraping tools you will not interfere in the website's services and capacities
- Try not to gather sensitive user information
If you are unsure about the legality of your action, it is always worth checking first with a lawyer.
3. From piles of paper to drives of data
Paperwork is a fact of life whatever you do. Whilst this is changing, not all information that might be useful to us is 'born digital' or exists in a digital format. For a range of reasons, paper can still remain a better solution for whomever was trying to capture or transfer information. Confronting a mountain of paper that you know or hope contains information relevant to the issue you are working on can be intimidating and discouraging. This section proposes some rules of thumb and a process to help you overcome such challenges.
In response to the terrorist attacks in New York on September 11, 2001, the US Government intensified its interrogation of those foreign nationals it suspected of involvement in terrorism. To do this, the Central Intelligence Agency (CIA) set up a programme of ‘extraordinary rendition’, through which its operatives apprehended people in one country and took them to countries where torture was routinely used, such as Egypt, for interrogation. The programme clearly violates a range of international Human Rights and humanitarian laws.
A decade later, Human Rights lawyers continue to seek redress for those people abducted, detained without due process and tortured. Over the years, they have identified the planes which transported prisoners, the dates and routes of flights, as well as the companies running them. They have pulled in data from many sources including national and international aviation bodies.
In a court in New York State in 2007, a legal battle related to payment for services rendered arose between Richmor Aviation, a company whose planes had been contracted for use in the rendition program, and Sportsflight Air Ltd, a small firm which had been involved in brokering some of Richmor's services for the government. The Human Rights organisation Reprieve learned of it in 2011, almost accidentally. Crofton Black, an investigator at Reprieve, says that the court transcript and discovery documentation from this case became a treasure trove of information about the extraordinary rendition programme:
“We were very struck by the level of detail in the documents. There was a stratum of information that hadn't really been publicly available before. What it shows you is a microcosm of the way that the program was running between 2002 and 2005. There were phone bills, lists of numbers that were called by the renditions crews during their missions. There were catering receipts, records of ground handling services, references to many different service providers in different countries who provided the logistical framework for the missions.”
The 1,700 pages of hard copy court documentation were couriered to Reprieve. To start making sense of the material, volunteers at Reprieve first scanned it in and made a PDF out of it. They then quickly skim-read it to identify the types of documents they had, bookmarking the key blocks of information. To help pull out the topics discussed in the material, a technologist used optical character recognition (OCR) on the material and created a searchable index of the all the words used.
However, the most useful information contained in the invoices for services couldn't be picked up reliably using OCR, and had to be extracted by hand. Over a few weeks, Reprieve's team manually pulled this data out of the invoices into a spreadsheet. They made a first pass over the material, creating a simple data structure, which they then expanded to include more detailed information about different flights. By picking apart this paper trail, Reprieve's investigators pieced together dozens of trips, using the invoices to evidence where the plane had stopped and which companies had provided services to suspected rendition flights.
This data has served to fill holes in numerous different cases, and analysis of it has been made available to journalists and legal teams worldwide. “The million dollar question in all this stuff is which of these flights had a prisoner on it, and who was it? So, that's one thing these documents won't tell you, of course. But the spreadsheet is a fantastic analytical tool. If we hear about a prisoner who was transferred on a particular date, but they don't know where, we can look at that date and see if it matches anything in this,” says Black.
He has only one regret. “Optical character recognition is still quite poor. If these documents OCR'd properly then it would have been different ball game from day one”, he explains. Looking at a sample of what OCR produced from the scanned documents, it is easy to see what he means:
u 1'I::CC:.1 ... eu. (>04t Ollicc Box 179
-OIdChlJthBITI. NewYoric. 'Z130
re:/epllane:(618) 794-9600 Nlghr:
(518) 794-7977 FAX:(61B/794-7437
It's often difficult to gauge the amount of time and effort it will take to bring an information dump like this into a form where its value can be seen, let alone exploited. At some point, working with the information in an ad-hoc way, by hand or using basic but well-understood technologies may become impractical. It may create a diminishing return over time, for example, if useful information wasn't pulled out of the source material first time round. On the other hand, the alternative approaches that experiment with emerging technologies (like OCR) or use a more systematic approach can seem difficult to justify: they may add costs, or seem like overkill for just a box of paper.
Mari Bastashevski is an artist, researcher, writer and investigator. We spoke to her in September 2015 about her work around issues of systemic failure and international conflict and one of her cases where she glued shreds of paper back together:
“In the case with the shredded documents, left behind by a Ukrainian oligarch, Kurchenko, after the fall of Yanukovych regime, the documents presented no informational value and I doubt they would ever lead to justice in a country where the judicial system is bankrupt. At the same time, these documents are incredibly interesting objects, historically and culturally. In shredding these, the individuals, removed from power, betrayed themselves: their act revealed guilt. The labour of the citizens into re-gluing these shreds back together on coloured paper is a graceful act of reclaiming agency. It may prove entirely futile, but it’s poetry.”
Whatever approach is taken, investigations of complex and concealed systems of Human Rights abuses are about adding layer upon layer of information from different sources. This example shows the importance to investigators of being able to quickly respond to the availability of new information resources, breaking down whatever form they come in and linking it authoritatively to what is already known. Digitisation is a key skill in this.
Digitising printed materials
Digitisation is the process of moving this information from analogue to digital formats, which can be analysed using computers. It is not a single process but a set of steps, which we will look at it turn.
Before you begin, get prepared:
Be clear about what you want to achieve and why: the decision to digitise reflects a balance of motives. One of the key reasons activists and investigative journalists digitise hard copy material is security. A digital archive can be duplicated and kept safe from deliberate destruction by adversaries, or degradation due to the rigours (temperature, humidity, light) of the environment they are working in. Other drivers include concerns about the sheer scale of the materials, both as a physical storage problem and a challenge to getting at useful information quickly.
Know what you're dealing with: do some work to ascertain the scale, shape and scope of the material you have. Do you have a room of paper documenting years of work, or a folder or two? Is it a one-off initiative, or something you'll have to do every day? Create a count of the current number of physical paper sheets or images, the number of individual documents, creating a breakdown of the different sorts of documents you have. If you think that additional material will appear, try to anticipate how regularly and in what sorts of quantity before you start.
You should also thumb through the material and identify the different sorts of information in those documents that you think is likely to be important, as well as look for documents that might be missing. This will help you decide which information is a priority to pull out of the material and will guide the design of your data capture processes. This scoping work is critical to designing and organising the digitisation process: estimating how long it will take, how much investment in technologies and labour may be required, how much it will cost, and ultimately whether it's worth doing yourself, or at all. Digitisation may be better contracted out to a specialised company, though you will have to assess whether this is both secure and affordable.
- Test the water before leaping in: after deciding on the route you want to take, design a draft process and test it out on a small sample of the material you have. Such ‘dry runs’ expose and test your assumptions and ideas, and can help you identify problems that may be hard to correct later.
Step 1: Digital imaging of hard copy materials
This is the technical process of moving hard copy material into a digital format. There are a range of different aspects to this process:
Organisation: even if you only have a small amount of material, you should create a scan plan to quantify the amount of work. This lists out the hard copy that you have, and is used to decide which materials to scan and when, and should tell you what's been done and remains to be done. Scan plans can help manage your time, and ensure that you haven't forgotten anything.
Hardware: you will need a computer, and will have to obtain a scanner. Scanners designed for home use often don't cost much but are not designed for even moderately heavy, professional use. Ideally, the scanner you choose should have an automatic document feeder enabling it to scan loose sheets one after the other, and a duplex function so it can automatically copy both sides of a sheet of paper. Try to establish a scanner's duty cycle, which indicates how quickly it can scan and how long it can be used continually before disaster strikes. Where possible, try to test out scanners or cameras before you buy them: they may appear a good match, but may be tedious to use, or have a terrible build or usability flaw that only reveals itself during heavy use. These may include unreliable software, overheating, and badly designed feeders that jam or don't pick up sheets.
If you have a lot of books you need to scan, then it will be a painful process using a flatbed scanner and it may be worth searching your network for someone with a book scanner. Smartphone cameras are very high quality and a number of Apps have been developed to scan documents, though they don't (yet) seem geared toward high-scale requirements. The project Memory of the World wrote an excellent book-scanning manual in 2014 that offers detailed instructions on how to digitise paper books into e-books.
Software: you will need driver and scanning software. Most off-the-shelf scanners are either ‘plug and play’ or come with driver software to install on your computer to control the scanner. You will also need software to manage the scanning and processing of the resulting digital images. There are commercial options such as Adobe Acrobat, and open source alternatives like XSANE and ScanTailor. These enable you to define the scan quality, which includes amongst other things DPI (dots per inch), resolution, colour and file format. ScanTips has excellent guidance about all of these topics.
Digital storage: After materials are made digital, they will need to be stored safely. You will have to consider how to keep the files safe from corruption or unauthorised modification on the digital storage media you are using. This means having a back-up plan and making sure that they are only accessible to the people who need to use them. Some media may need a large amount of storage space, so it is important to plan ahead to ensure that you don't run out of space, and that you have ample space for back-ups as well. Go to Chapter 8 of this Guide for more information.
Quality assurance and 'chain of custody': after a document has been scanned, there are three things you need to do. First, check that it accurately matches the original hard-copy. Second, process the scan to improve its quality and organise it in a way that fits your needs. For example, where you have scanned a double page spread into a single page, software like ScanTailor can split the image into two pages. Third, decide what to do with the original hard copy. Do you need to be able to show others that your digital versions are perfect copies of the originals? For example, in a legal process digital copies of documents may not be acceptable as evidence. In these cases, you will have to think about a digital 'chain of custody' that can be used to show how the physical and digital materials have been handled.
Step 2: Organising, indexing and contextualising digital files
After you scan in the hard copy, you will then have to organise and catalogue it digitally, making the material easier to find, sort and relate to other materials.
Organising raw files: scanned files will appear on your computer in image formats like .TIFF or perhaps .PDF. Most scanning software automatically gives each scan a filename, such as DSCR23453.TIFF– you should change these to a file-naming scheme that makes sense to you. Most digital files also contain technical metadata describing the size, creation date, date of last modification and so on. Some files also have special sorts of metadata. For example, pictures taken with smart phones or digital cameras with GPS devices may contain metadata about the location where the picture was created (software like ExifTool or MediaInfo can help you find this data). This automatically-created metadata is useful for organising digital files. File browsers such as Windows Explorer, or Nautilus on Linux should be adequate for organising, filtering and searching large collections of digital files using this sort of technical metadata.
Cataloguing files: beyond managing the raw files themselves, you can also create new sorts of metadata that describe what is in a document or file. You have to define what sorts of metadata you think it is important to add. Whilst quite a heavy read, the Dublin Core site has a thorough description of what metadata is, and has some useful ideas you can adapt to your own needs. Metadata could include terms that indicate who created the material, what it's about, the events that it relates to, people mentioned in it and so on. This sort of data is particularly important for visual material like videos and images, which often contain a small snapshot that can't really be understood without knowing the surrounding context. WITNESS has written a thorough guide to managing and cataloguing video and audio materials about Human Rights.
Step 3: Extracting content from the materials
Whilst digitisation has extensive value, for investigators its purpose is often to better understand the information within the material itself.
Automated content extraction: adocument that has been scanned in remains an image, which means the text in it cannot easily be 'read' in the same way that a document created in a word processor can. It is possible to use Optical Character Recognition (OCR) (such as Tesseract or some of the tools built into commercial imaging software) to find and extract text from images. However, prepare yourself for disappointment: even where OCR software is used on scans of typed material that is plainly laid out, it is fiddly to use, erratic in its output and always require a human eye to ensure accuracy.
'Old school' content extraction: realistically, extracting content from the materials is likely to be a manual process, which means finding, reading and hand-typing actual information contained in a digital document and entering it into something like a database or spreadsheet.
Quick notes on digitising other media
In this section, we have focused on the challenges of paper. However, video and audio tapes, photographs and maps also regularly appear as resources in most kinds of investigative work. Here are some tips and links should you need to digitise these sorts of media:
Video and audio: Physical media – like tapes or DVDs- throw up three particular challenges: time-based media is generally more difficult to manage, digital versions require a lot more storage space and preserving original physical versions over the long term is complex. When digitising, try to capture at the highest quality possible. When digitising older or damaged media, it may be better to work with a specialised third party you can trust rather than trying to do this yourself. For a brief overview of digitising video, the TAPE project has some useful guidance and resources. For a very detailed practical guide, see this guide to digitising moving images by the Consortium of Academic and Libraries in Illinois (CARLI) Guidelines to the Creation of Digital Collections.
Video analysis tool: Robert Ochshorn created InterLace \<video> navigation, examples of which you can see here when used for the web documentary ‘Montage Interdit’ by Eyal Sivan and for the project ‘Right to the City’. See his other experiments here and Robert's personal website here. The code for Robert's projects can be found on his GitHub account.
Maps: Moving printed maps into a digital format first requires the map to be scanned. Depending on the size of the map, you may have to scan in parts of it using a flatbed scanner and create a range of smaller tiles. The alternative is to locate a wide format scanner. After creating a digital image of your map, it will need to be geo-referenced and rectified. This means finding where your maps sit on an existing, accurate digital map such as the Open Street Map. Finally, the scan can be uploaded to an online mapping service so it can be viewed online. MapWarper (a tutorial video) is an online geo-referencing system that does this. If you don't want to upload your materials to a server, geo-referencing of maps can also be done using desktop GIS software such as QGIS (here's a basic guide to geo-referencing, and atutorial video).
- The Quartz guide to bad data: An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
Image created by John Bumstead
In this chapter, we focus on how to ready the data you have found and collected for use in your investigation. We will introduce techniques around structuring, categorising and standardising your data and examine the five signs that you might have grown out of your spreadsheet and require the properties of a database. Once we have introduced these techniques, we will look at how to join the dots and tell stories with data.
Activists, investigators, journalists and artists need to find data and join the dots together themselves: only rarely will the complete picture land on their desks. In a recent interview with artist, researcher, writer and investigator Mari Bastashevski, Exposing the Invisible asked how she contextualises gigabytes of data:
“Whatever it is that I’m doing with data I’d always bring it back into the analogue world, and then take the analogue back online. There are two primary categories of documents that are vital to my work process: useful (the footnote) and useless (the object). The first informs and maps the narrative, the second is the raw material, the fertiliser to the narrative.”
Structuring, categorising and standardising your data
Investigators need to seize the opportunities that are presented and often have to work with very limited resources. To store and make sense of the data they collect, they often use the cheap, ubiquitous software like the spreadsheets that come pre-installed on most computers.
Collecting data to monitor specific activities or to document events can seem exciting at first, especially if you have a vision of how this information may contribute to a debate. In order for it to be useful however, it is essential that the data is well-organised and designed so that it can be pulled together, analysed and presented in a meaningful way. You will need to know about things which may at first seem challenging: how to standardise information, how to enter information so that it can be collated later and how to work with data in a group. We discuss each of these areas in the sections below.
Data has to be entered consistently. If it is not, then it is harder to search, count, sort and filter accurately. Here's a simple example:
The problem here is easy to spot, but it is one that is recreated daily in one form or another when data is collected. One way of reducing these errors is to standardise how data is entered. This means making choices about how it can be represented consistently. Think of all the things that you can describe about a thing; it is a process we do quite naturally and there are many different, equally plausible and accurate ways to do it.
- Dates and times: Thursday November 1, 01 November 1976, November 1st 1976, 19761101 are all ways of representing the same date.
- Names: The naming of people and things is very complicated and varies across different geographical areas and cultures. Do you use 'United Nations' or ‘UN’? Do you use a person's first and second names in different columns or in the same column? Is there a commonly accepted naming protocol for surnames?
- Places: The location of a thing or an event is commonly recorded data. But how specific do you need to be when describing geographical data? You can be precise by using latitude and longitude; or general, such as using a country's administrative (town, city, district), electoral geography (ward, constituency) or operational geographies such as areas covered by a police station.
The key challenge of standardising data is to make a choice and then stick to it. It will save an enormous amount of time and frustration.
The next challenge is to be clear about how, in practice, to apply standards to the data you are collecting. For example: our source of information tells us that 600 people attended a demonstration and we want to create an entry in a spreadsheet. We have categories for 'small', 'medium' and 'large'. How do we decide which term best describes the size of the demonstration?
When you look at your spreadsheet you need to be able to know that every time you see a demonstration described as 'small' it means the same thing. Design a set of rules to let everyone working on the data know that:
Small= between 0 and 99 participants
Medium= between 100 and 499 participants
Large= between 500 and 999 participants.
Everyone entering the data needs to follow the same rules each time.
The challenge is tougher with certain types of data that involve evaluating something and making a judgement about it. With more complicated issues that don't break down to a set of numbers, you need to find 'baskets' in which to fit a variety of different sorts of factual information. You need to be sure that simplifying them will still be useful to you later.
Field monitors for a human rights organisation have interviewed a victim of serious physical mistreatment by the police. We have categories for 'Torture', 'Inhuman and Degrading Treatment', and 'Grievous Bodily Harm'. What term best describes it? In this case, these terms have legal meanings in international and domestic law. To increase the consistency with which the terms are applied, you could develop a guide sheet explaining each term, each type of situation, the information and evidence required to differentiate and then provide examples.
If you want to compare your data to others, consider whether they have used the same sorts of data, and whether they have applied the same rules to their data. This will be covered later in Working together and sticking to standards and structures. This is a serious concern: projects can fail because different country groups collect data differently, making regional or global comparisons a waste of time and resulting in the need to start the initiative from scratch.
Standardising your data against an external resource is also useful. For example, if you are using geographical information such as place names, they can be recognised by Google Maps (using an automated technique called geo-coding) which simplifies the process of creating a map in that service. Deciding too late that you want to make a Google Map and then having to go back through all your data and re-enter place names that Google Maps does recognise can be a tedious exercise.
Another form of standardisation relates to the structure of how you record your data. For example, a good rule of thumb when entering data is to put one piece of data in one field or cell. Then your spreadsheet can sort and filter it easily for you. Here are some examples:
Scenario 1: A human rights organisation documents where people were harassed by police in Phnom Penh, Cambodia
Problematic data entry:
|Date and Time||Place|
|01/11/1976 at four thirty in the morning||At the lake in Phnom Penh|
Better data entry:
|11/1/76||04:30||Phnom Penh||Lake Boeung Kak|
Scenario 2: A research organisation documents the gender of detainees at prisons
Problematic data entry:
|Alcatraz Island||Adult males (DAM), adult females (DAF)|
Better data entry:
|Facility Name||Demographic 1||Demographic 2|
|Alcatraz Island||Adult males (DAM)||Adult females (DAF)|
The problem in Scenario 2 is not solved by adding another column of data. It would be better to create a new unique category called “DAMF” to be used when a facility has both adult males and adult females prisoners.
The structure of your data also affects your ability to count different things in your data. An issue often experienced by spreadsheets users is that they structure their data around the wrong thing. For example:
|Diana||Burger, french fries|
|Philip||French fries, salad|
Here there is more than one sort of data in each cell. This may allow you to count how many customers you had, but it makes it harder to tell how many portions of french fries you sold, or how many orders were placed. There are only two entries here but what if there were thousands each week?
A better way to organise this would be:
In geeky language, these sorts of issues are all about a logic concept called normalisation. They are very common and reflect the difficulty of trying to squash quite complicated information into a single table of information and keep it usable. Where there are big problems of 'normalisation' it might be time to move the data to a different sort of tool, like a database. This makes things easier as it enables you to use the spreadsheet to count and rearrange the data. Initially it means more work and is less readable, but it will allow you to do proper analysis later. In the first entry method, you are still unable to answer some important questions, such as the overall number of orders that were made.
Thinking through the structure and the standards of the information before you start will be of great benefit later on. By standardising the way you enter data, you have a better chance of spotting where connections are made and where relationships and patterns exist. By structuring your data in this way, you ensure that you do not overlook opportunities for useful analysis.
Knight Digital Media Centre's excellent guidance and tips for working with spreadsheets.
Intermediate Data Analysis for Human Rights, by Herb Spirer. Though the context is Human Rights, the learning points are generally applicable to a range of different topics.
HURIDOCS Events Standard Formats and Microthesauri. The most comprehensive explanation of shaping and structuring information about Human Rights violations and a good general overview of informatics concepts and data systems.
Follow the Money by the Centre for Responsive Politics and Larry Makinson. An excellent guide to digging into political campaign finance in the United States, with a guide to collecting and standardising data.
Also from Knight Digital Media Centre - a tutorial on 'cleaning' data, which involves finding and fixing problems with data entry, standardisation and structure of data.
Working together and sticking to standards and structures
Working in a team to manage data can increase a group's ability to take on a project which may otherwise prove too unwieldy or too time-consuming. It can also increase the value of the data by putting it in the hands of more people. However, working in a group can also add to the complexity of the work and can increase data errors. It also has an effect on the privacy and confidentiality of information, requiring you to consider who has access and how to safely transfer files. Here are some tips about where errors can occur and some ideas for detecting and mitigating them.
Tracking data entry errors in teams
Everyone makes them - even NASA, and there are hundreds of ways that errors can be made and introduced into spreadsheets. Data management can be mundane and repetitive. The more people that enter or use data on a spreadsheet, the higher the likelihood of errors being introduced. You can create simple processes that can identify basic errors in data entry. Here are some examples:
If one of your fields has dates in it, sortit to show the earliest dates to check if there are dates listed in the distant past (for example the year 201 instead of the year 2011)
If you are using a set of standard terms in a cell, like country names, people working on the data may not enter them consistently. For example, a user might make a typing error, entering 'cambodia' rather than 'Cambodia'. Most spreadsheets show these by listing the unique valuesthat are contained in any column: it will treat the two differently so you can see that an error has been made.
If every row of data should have a piece of information inside, an empty cell may be an indication that someone has forgotten to enter a piece of data. You can ask the spreadsheet to count any empty cellsin a row, and highlight the row in 'red' if so.
Growing out of spreadsheets
It makes perfect sense that a large number of activists use spreadsheets to organise data. But far fewer consider switching to databases even when the problems they face using spreadsheets become more apparent.
Five signs you might have grown out of your spreadsheet:
You start colour coding things in the spreadsheet and have created little 'hacks' (like adding 'AAA' or '!!!!' to a row of data to ensure it appears at the top) to find data.
You scroll around a lot to find and edit information or perhaps you have bought a bigger computer monitor so you can see more data on screen.
Different people need to enter data into the spreadsheet so you spend time emailing it around and copy-and-pasting data into a 'master' spreadsheet.
You regularly have to reformat to fit the needs of different tools to make charts, maps or graphs.
You create multiple spreadsheets to keep count of data in other spreadsheets.
If you are doing any of the above, it is time to start thinking of a different type of tool.
Spreadsheets are a great 'Do It Yourself' data tool, widely used to record, analyse and create simple visualisations of data. They were designed as digital ledgers for book-keeping and accounting, but the grid format and ability to re-arrange data means they are also suitable for other uses, to an extent. Nearly everyone who can use a computer can 'sketch' with the simple and intuitive interface, piecing together columns and rows to create a basic model of some issue or thing they want to record data about. Spreadsheets don't require much technical knowledge to get started and come installed on most computers, so you can get up and running quickly.
One of the large appeals – and perhaps simultaneously a pitfall - of using a spreadsheet is that it can be made to look like a written document. A spreadsheet can be given a beginning, an end, a title, some authoring information and a date of publication. It can be constructed like a narrative, containing a mix of numbers and text, having elements of place, time, protagonists, locations, costs, consequences and outcomes. Structuring data in this way – by intuitive, narrative and visual logic – can work well for simple projects, but if your initiative grows or you want to use data in different ways, problems will soon emerge.
In this sense, a spreadsheet is a compromise tool: the method of storing information is the same as the means of looking at and working with that information. At some point, these two needs can't be reconciled, and one gets in the way of the other. Making data legible to the eye in a spreadsheet means making it far less useful analytically; the reverse makes the data largely unreadable and hence, less useful.
A database, however, separates the two: the way data is stored has far less influence over how it can be displayed. In fact, the way data is stored is often completely hidden from the user, enabling abstract, complex ways of storing data that in effect gives the user more power over it. A key benefit of a database is the ability to feature multiple tables of data and the technology to stitch them together to retrieve answers to specific questions.
This is how data might look in a database rather than a spreadsheet:
|Customer Name||Customer ID|
We also wanted to know who took the order:
And in what booth:
Behind the scenes, we can tell the database how these sorts of information are related. We can then ask it to create another table that combines data from 'Customers', 'Dishes', 'Waiter' and ‘Booth’, in addition to other information we need to know about an order:
|Customer ID||Dish Code||Waiter||Booth||Time||Order Number|
How many orders Benito, or any waiter, took. Freed from the need to be legible to the eye, the complexity has increased dramatically and it is clearly hard to track all the different sorts of data that are in the table. In this example, there are still only two customers, sitting at two different tables, ordering 3 different dishes. Imagine trying to manage this data if there were hundreds of customers every day, ordering from a large menu. Databases have better powers of storage and retrieval of data. Using something called Structured Query Language (SQL) we can ask the database questions. For example, we can ask it to tell us:
- How many orders were placed in a day, or week or month?
- How many portions of each dish were sold, and during which parts of the day?
- The average size of each dining party.
The database allows for a flexibility in how we can use the data that is harder to achieve with the spreadsheet. This flexibility creates opportunities for you and your audiences to use, present, publish and access data in ways that can serve your campaigning aims.
Moving work from a spreadsheet to a database is a challenge at many levels:
- The world of databases is full of different technical choices: database platforms, programming languages, interface types and so on that can be very intimidating for newcomers.
- It is unlikely that you will be able to do it yourself. You will need to work with technical people like information architects, programmers and interaction designers. Knowing who to trust, and what to expect from these people is hard, and getting it wrong can be costly.
- You need to ensure a database solves the right problem in your work and that you have the resources to use it sustainably.
- But perhaps most importantly, moving from a spreadsheet to a database represents a huge attitudinal shift away from making do with the things in front of you, and challenging established ways of working. For activists and journalists who make data central to their work, the question is: how much do the limitations of the tool waste your time, under-use the data or hold you back?
If you are interested in this area and want to move your ideas forward, the resources below contain more discussion about databases, how they are created and what it takes for an organisation to commission them:
LASA Knowledge Base: Databases (2002, updated 2006)
Should Nonprofit Agencies Build or Buy a Database? By Tom Battin, 2002
The Mythical Man Month: Essays on Software Engineering, Frederick P Brooks, 1995
Telling stories with data
Combining narrative with facts to trace a storyline through data is difficult, below is an extract from our book Visualising Information for Advocacy, which can be read onlinehere, that looks at how to tell stories with data.
Many advocates get nervous at the very idea of telling stories with data. It goes against the principles of working empirically, and we have to be careful not to fall into the trap of shoehorning facts into a pre-defined narrative. Creating a narrative for a campaign from data is a careful balancing act that involves working continually on four fronts at once:
- What is the point? What is it that we want the audience to understand and why?
- Work outwards from the data: be clear about what the data tells you. Consider whether the data needs to be simplified, contextualised or completed with other data to make your point.
- Design our information: how will we bring our story together in rough data? How can we frame it in succinct and compelling ways without misleading or over-generalising?
- Finding visual stories: what visual devices can be used to present the information in an engaging way? How can the visual design help organise and give meaning to the data?
Whether you are telling stories directly with your data or just trying to understand the data you have, data visualisation tools can be extremely helpful. When you are working with large amounts of information, being able to use data visualisation tools can help you to find and understand the stories in your data.
In analysing social networks surrounding ‘La Familia Michoacana’ – a drug cartel based in the Mexican state of Michoaca – Eduardo Salcedo-Albaran and Luis Jorge Garay-Salamanca used visualisation tools to map out the nodes in these networks and their connections. The data is based on the statements of witnesses. By examining judicial files, the team were able to pull out the details of individuals and their relationships and this enabled them to trace the extensive social network interacting with La Familia Michoacana. The resulting spherical network map shows the relationships between ‘narco-traffickers’ (mostly the bigger and more central nodes), and how these central nodes are connected to public servants and politicians. The map details 284 agents by name and depicts 880 social relationships.
Network graph of ‘La Familia Michoacana' by Eduardo Salcedo-Albaran
The result is a complex overview of the high levels of bribery and coercion, and the density of the network that allows the cartel to function. Visual representation serves in this example as an effective way to use the data to tell a story. It gives us an impression of the strength of the cartel and how it works, and also allows those involved in the study to get a bird’s-eye view.
- Visualising Information for Advocacy, written by Tactical Technology Collective, the PDF is availablehere.
\ Image created by John Bumstead
If everything is a network, nothing is a network
By Mushon Zer-Aviv
Let’s take a sheet of paper, draw a few points (we’ll call them nodes), connect them with lines, (we’ll call them edges) and there we go, we have ourselves a network. Right? Well, yes. And no. I mean, that’s not the whole story.
Stuck in traffic
A few years ago when I designed the maps for Waze Mobile, the GPS navigation company, we faced a complicated challenge. Essentially a street map is a network, with every junction being a node, and the roads being the edges. So, much like the network we’ve just drawn, the map did a pretty good job of visualising the geographical layout of nodes and edges but this was not enough. While the main purpose of the App is to tell you where to go, it also tries to visualise the traffic around you to better inform you of the traffic jams you have avoided and to prepare you for those you will have to drive through. As soon as we start asking what’s happening on the roads between every junction - the specific traffic on that segment (we’ll call that the flow), things start getting complicated, fast.
Traffic is a very dynamic type of flow, it can be light, it can slow down, it can be heavy and it can come to a complete standstill. Some roads are two-way while others flow only in one direction. We had to indicate which directions are drivable as well as visualising the traffic travelling in each of these directions. But wait, there’s more. What if the road has multiple lanes and the traffic jam is only for those turning left, while we want to turn right and our lane is clear? And what if we're travelling on a very long road and the traffic slows down only on a limited segment?
It gets even trickier. On Earth Day in 1990, New York City's Transportation Commissioner decided to close 42nd Street, a major road connecting Grand Central Train Station and Times Square. "Many predicted it would be doomsday," Commissioner Lucius J. Riccio toldthe New York Times. But surprisingly, not only did closing the street not result in bad traffic, it actually improved it. New York City drivers would usually flock to 42nd street, the widest street around, slowing traffic for everybody while the smaller streets around it remained largely empty. By taking the 42nd street option off the table, the flow of traffic actually improved. This surprising result could be attributed to what network theory calls Braess's Paradox. It states that adding edges to a network will not necessarily result in a better flow and may sometimes even lead to congestion.
Now let’s go back to the network we drew earlier. What can we say about its flows? Nothing really, hmm? Well, that’s a problem. Especially when in recent decades, we’ve come to think of more and more aspects of our lives in terms of networks.
Everything’s connected, man…
In 1964, RAND scientist Paul Baran drew this famous diagram that shows 3 possible network topologies and their levels of vulnerability in case of a nuclear attack. The distributed network was chosen for the military communication networks that set the foundation for the TCP/IP protocol which is at the core of the internet we use today. Though most of our online interactions today take place over proprietary centralised networks and infrastructure, the myth of the internet’s birth and the popularity of social networking platforms have captured our imagination and informed our conception of society and power relations in terms of a distributed network. More recently, advanced research into biotechnology, neuroscience and machine learning have popularised experimental research into neural networks such as the cutting-edge of artificial intelligence and science fiction-like post-humanistic visions.
There’s something very attractive about seeing everything as connected; it serves a basic need to rationalise everything in terms of cause and effect. It offers the mechanics of countless feedback loops that, if we could only count them all, would allow us to uncover ‘the big picture’. There’s also something extremely wonderful about the aesthetics of the network diagram, its volume, its physics, its emergence, its power. It is like the rendering of a hidden truth suddenly emerging before our eyes and taking us behind the scenes of everything we want to see through it.
And indeed if we like, we can think of almost everything in terms of networks. Every place, institution, being, object, word, concept, cell or atom could be a node and as soon as we find any possible way to connect them with a few other nodes, we’ve got ourselves a network. The network is a very flexible and abstract model and can even allow us to overlay other networks and create super-networks and then run network mathematics on them to further analyse the inner workings of the complex emergent system we just discovered.
It doesn’t take too many nodes and edges for a network to become very complex, and for us to get utterly trapped in its web. A whole sub-category in the science and design of data visualisation is devoted to network graph drawings, trying to come up with new graph layouts that would untangle the network and extend cognition. There are tree layouts and force-directed graph layouts, there are arc diagrams, radial layouts and globe layouts… all attempting to better understand and differentiate between the nodes and the edges on a two-dimensional plane.
But while these layouts may help communicate the network’s structure, they do little to expose the flow and more importantly, the protocol that governs it. Alex Galloway and Eugene Thacker defined protocols as all the conventional rules and standards that govern relationships within networks. They argued that “If networks are the structures that connect people, then protocols are the rules that make sure the connections actually work.”
If we can’t see the flow or understand the rules of the network embedded in its protocol, we can indeed allow ourselves to imagine the endless possibilities of connectivity. It is a part of what makes these network diagrams so inspiring and gets some of us excited about distributed networks and imagine them as the people’s answer to the centralised power of states and corporations. But it is also what makes networks so misleading. New York drivers on that Earth Day in 1990 had a simplistic model of the Midtown traffic flow, one that expects the largest road to be the fastest. Today a GPS navigation system might implement a protocol that would help these drivers circumvent this human error. But if we’re subject to the rules of an opaque protocol, it begs the question, who’s driving?
The network investigation model
They Rule Josh On
Today, as networks become the leading model for power and control, visualisation of complex networks has an important political role. In 2001, one of the pioneering websites to assume that role was Josh On’s TheyRule.net, a network diagram authoring tool which created graphs of America’s most powerful company boards and their members in an attempt to identify possible conflicts of interests in the echelons of power. TheyRule, later continued by LittleSis and other investigative activist tools like OCCRP’s VIS (Visual Investigative Scenarios), followed an investigative network model developed by law enforcement and investigative journalism and popularised by police dramas like The Wire.
According to the networked investigation model, the information puzzle can be solved by collecting and connecting the missing links. Data becomes the currency and the network is the model that structures it. The larger the network we draw, the more possibility we have of navigating within it. And when it grows faster than we can comprehend, network analysis algorithms step in to replace the human investigator. Hence every bit that can be captured is captured, in the hope that it might help unravel the network structure beneath.
At this point, it would not be unreasonable for your inner Edward Snowden to be sighing in silent frustration. The NSA and other governmental and commercial spying enterprises are using precisely the same networked investigation model that helps good-doers go after corrupt politicians and organised crime, though their motives, their objectives and, specifically, their resources are very different. The iconic image of scruffy detectives in smoky rooms connecting the dots between notes and blurry photographs pinned on a cork board has been infinitely extended to such a degree that it is now the cork board itself that does the pinning and connecting. The larger the cork board gets, the smarter it becomes and the more connections it can draw. Both governmental and non-governmental investigators often have to untangle huge associative webs of connections, following dead-ends, bordering on flimsy conspiracy theories, and like those New York drivers on Earth Day, getting caught in bad traffic caused by a misreading of the flow.
A new political role for network visualisation
Aspirational networking has become a managerial ideology with endless possibilities for intelligence and control. As we’ve seen, drawing dots and lines is easy, but the result quickly gets hard to comprehend. What we’re left with is the aspirational networking ideology, in which everything is connected (man) and nothing is impossible, as long as you don’t expect people to be able to make sense of it. Networks are not evil, they’re just largely misunderstood. This makes the task of visualising networks even more crucial for us puny nodes who still seek to maintain some political agency.
But how do we do this?
First, we should extend the core network terminology beyond nodes and edges to also include flows and (as per Galloway and Thacker) protocols. But can flow and protocol be visualised? Yes, they can, and they already are. The following are just a few examples.
Volume: not all flows are created equal
Decades of research into visual perception have proven that position is the most legible visual property we can use to map any type of data and hence most network layouts are differentiated by the formation of nodes and the data it represents. As long as the flow is not the question, a network diagram can prove quite useful. For example, social media analysis often uses a network diagram to map ‘@mention interaction’ on Twitter or the viral spread of a message. In these cases, the actual content of the messages (the flow) is only secondary to the questions of how they spread, and the Twitter networking protocol is simplified and familiar enough to not require any extra visual encoding.
Credit: Gilad Lotan, Betaworks
While most network layouts focus on formation, clustering and categorisation of nodes, some networks try to differentiate between edges and to visualise their dynamics. Investigative journalists and civic society organisations that ‘follow the money’ often use networks to map not just general connections but to specifically signify the volume and directionality of the financial flow. As long as the flow is standardised and comparable (like money is) encoding volume and directionality, if carefully handled, can make for a pretty insightful network diagram.
Follow The Money By Andrew Ross Sorkin, The New York Times, 2008
The traffic maps we worked on at Waze fall into this category as well. At the end of the day, we presented the flow of cars, and at the end of the day what mattered the most was how much time it takes for a car to get from point A to point B. While designing the map, I tried to subtly indicate the difference between travelling in busy urban areas versus choosing a more scenic route, going through parks and past lakes, but this was not something that the network could easily visualise. Neither could we simply visualise how potentially dangerous one path might be in comparison to another, or how fuel-efficient it would be to take this turn or the next.
When the flows themselves are incomparable the problem becomes even bigger, as the big picture loses its meaning when the collected relations fail to present an aggregated insight. When Google decided to turn all our email contact list into friends in one of their many failed attempts at social networking, it was due to their simplistic interpretation of their users’ social graphs. Your classmate, your sister, your boss, your student, your landlord, your client, your lover and some salesperson might all be nodes in your social graph, but the flows of your correspondence with them are incomparable and hence mean nothing to you in the aggregate. The same can be said about most attempts to visualise the social graph, which could never offer you much more than to serve a basic narcissistic pleasure of seeing your image portrayed at the centre of your social life.
Networks need narrative
Contemporary technology always serves as the popular metaphor for thinking. Whether it was the steam engine, or the computer, humans like to view their brains as machines and thinking as a sophisticated technological process. As neural networks are becoming a leading model for researching and modelling logic and data processing, the network is becoming the visual metaphor for thinking itself.
But when we visually examine a ‘mind map’ (a network diagram representing thought) our eyes wander, trying to grab a node to start from and follow through the complicated stream of consciousness. Reading is a linear process and there’s not much we can read from a non-linear aggregated view of thought. Mind maps often serve as note-taking tools, but their final result is not as valuable as the documentation process itself. A mind map that visualises its authoring process would be much more readable and useful when shared or inspected after the fact.
We experience life as a narrative, not as a map and certainly not as networks. A network diagram rarely represents static relations. Narrating a flow through the nodes in the network is a useful way of examining it, whether as an example of its dynamics or as a way of highlighting specific insights. How was the network constructed? How should it be read? If documenting a narrative through the nodes and edges helps explain the flow and even the protocol, it could become an essential feature in the diagram.
Yet we might find that not every network has a story to tell, or that not every story is worth telling. For that matter, not every network might be worth networking.
Directionality: an implicit protocol
If life and reading are experienced linearly, direction implies both a narrative and a protocol. A tree layout, for example, represents networks with an explicit hierarchy. Nodes can diverge only in one direction and their flows conform to the protocol embedded in their structure. Family trees visually represent the genealogical flow and maintain its protocol.
Mapping time onto the network can also serve to suggest a reading direction, an explicit flow and sometimes teach us about the protocol. For example, the distributed version control model of networked collaboration on software projects (like the one used by Git and Github) is heavily based on time as an organising principle. In that sense, Github’s network diagram models not only the development of the code but the network’s very dynamics of collaboration. And like text, like code and like time, it can be read in sequence from one side to the other.
Screenshot of the oBudget.org project’s Github collaboration network
Visualising algorithms: a humanistic call to action
As data grows ‘bigger’, and computer algorithms become more complex, more control is moved behind the scenes. Data is processed through computer networks following opaque protocols and what finally gets presented to us as the final visualisation is usually only the tip of the iceberg. There are stories processed in these huge rule-based systems of automation, but they are not the stories we tell, they are the stories that tell us.
Visualisation is for humans. Computers don’t need anything to be visualised to them, and in any case, are not programmed to visually interpret it the way we can. We’ve been using visualisation mainly to understand data, but more recently there is a growing need for visualisation to assume its humanistic role and visualise algorithms.
Visualising algorithms is still a small fringe in the visualisation world. It is mostly academic and so far has mainly served an internal maths and computer science discourse. But the potential for visualising network protocols is huge. Rather than aestheticize the opaque wonders of abstract networks, a humanistic visualisation could educate us about the protocols that govern us and potentially even provide us with the means to adjust them.
Only some of our proposed solutions for visualising the Waze traffic flows were ever put to the test. There’s only so much you can visualise when you want the driver’s eyes on the road rather than on the screen, interpreting nuanced network visualisations. But the trend towards self-driving cars, led by Waze’s new owner, aims to take the human factor out of the equation completely.
While the question of whether humans should drive their own cars is up for debate, I would strongly argue against the wider trend driving us away from our agency in relation to technology at large. It is quite mind-boggling to think that network algorithms do not see points connected by lines, while we cannot even imagine networks without them. As abstract, rudimentary and confusing as they may be, networks are an essential construct of our 21st century lives and we need the conceptual and technological tools to be able to analyse them.
Once we acknowledge the anatomy of the network as more than the formation of nodes and edges and their layout, we can use them carefully, bearing in mind that:
Not emphasising the visualisation of the flow implies that only the layout of nodes and edges is enough to tell the whole story.
By presenting a finite inventory of nodes and edges, we might be implying that what’s presented in front of us is the full network and no other nodes or links are involved.
A network is an extremely flexible and abstract model, and wandering through its nodes and edges might quickly lead you in circles, following dead-ends or developing dubious conspiracy theories. Handle with care.
Networks need narrative, both as a layer of annotation and as a way to present exemplary network flow.
Directionality is important and can be a useful way to lay out the flow and even the protocol of some networks.
Time is an organising principle in our lives and could sometimes serve a similar role in the visual representation of a network.
Algorithm visualisation is the next frontier in network diagrams and for data visualisation at large. This is a call for humanistic agency in complex systems.
Finally, before we rush to join the dots and think of everything in terms of networks, we should really ask what makes a network model necessary in this case? Do we want to examine the relationships of the nodes? To compare the capacity of the edges? Can we really analyse the intricacies of the flows? And are we able to analyse the network’s protocols? And if we can, can we affect them?
If everything is a network, nothing is a network. But if this thing is a network, this is why you should care.
The Exploit - A Theory of Networks / Alex Galloway and Eugene Thacker
Visualizing Complexity / Manuel Lima
Visualizing Algorithms / Mike Bostock
\ Image created by John Bumstead
This chapter will look at how metadata has been used to expose, protect and verify abuses and excesses of power. We will then focus on exactly what metadata is contained within what format and introduce tools to extract, strip and add metadata.
Metadata can be understood as a modern version of traditional book cataloguing. The small cards stacked in library drawers provide the title of the book, publication date, author(s) and location on the library shelves. Similarly in the digital sphere, a digital image may contain information about the camera that took the image, the date and time of the image, and often the geographic coordinates of where it was taken. Such multimedia-related metadata is also known as EXIF data, which stands for Exchangeable Image File Format.
The Australian National Data Service provides the following definition: “Metadata can actually be applied to anything. It is possible to describe a file on a computer in exactly the same way as one would describe a piece of art on a wall, a person on a job, or a place on a map. The only differences are in the content of the metadata, and how it refers to the “thing” being described.” Metadata is structured information that descibes, explains, locates or otherwise simplifies the retrieval, usage or management of an information resource. Metadata is often called data about data or information about information.
In an interview with Exposing the Invisible, Smári McCarthy, head of the technology team on the Organized Crime and Corruption Reporting Project, says that “every information source has metadata, sometimes it is very explicit, created as part of the documentation process of creating the data. PDF files, images, word documents, all have some metadata associated with them unless it has been intentionally scrubbed.”
To illustrate this point, McCarthy describes a small chip contained within all digital cameras which tracks all the metadata of that device. He explains that all of these chips, known as Charge-Coupled Device (CCD) chips, basically light-sensitive circuits, come with minor factory flaws that are unique to the individual CCD chip. This idiosyncrasy means that the data contained within all images taken with that device, data one would usually ignore and is invisible to the human eye, becomes a digital ‘finger print’ identifying all images taken with that particular CCD chip. This highlights the almost omnipresence of metadata as well as the possibilities of working with it. McCarthy calls metadata “a best friend, it helps with searching, it helps with indexing and with understanding the context of the information.” But even metadata enthusiasts like McCarthy admit that metadata can also become “a worst enemy”, and thus understanding it is crucial not only for people working with metadata, but also for the wider network of individuals and groups working on sensitive information.
The possibilities of using metadata are multiple and varied. The Australian National Data Service points out that:
“metadata generally has little value on its own. Metadata is information that adds value to other information. A piece of metadata like a place or person’s name is only useful when it is applied to something like a photograph or a clinical sample. There are, though, counter-examples, like gene sequence annotations and text transcripts of audio, where the metadata does have its own value, and can be seen as useful data in its own right. It’s not always obvious when this might happen. A set of whaling records (information about whale kills in the 18th century) ended up becoming input for a project on the changing size of the Antarctic ice sheet in the 20th century.”
Michael Kreil, an open data activist, data scientist and data journalist working at OpenDataCity, a Berlin-based data-journalism agency which specialises in telling stories with open data, says :
“metadata seems to be some kind of a by-product, yet it can be used to analyse certain behaviours, of political and social nature, for example. Let’s take something simple as an example, like a phone call. Making a phone call doesn't seem very important. It's hard to analyse one million phone calls or one million photos, with the analysis being based on speech recognition or face detection, both fields still being in a state of technological development. But it's pretty easy to analyse the metadata contained within them, because metadata has a simple, standardised format for every phone call: there is the date, the timestamp, the location and numbers of the caller and the callee. This standard allows us to analyse a huge amount of metadata in one big database. For example are there instances happening in the population that are represented in the metadata, such as who has depression or who is committing adultery?”
Often, this type of metadata is more valuable that the content of the phone call. This metadata provides information about networks, their scale, frequently visited locations and far more besides. There is currently no online communication method which does not leave metadata traces throughout or at some crucial point of the communication process.
Activists, experts, investigative journalists and human rights defenders are increasingly taking an interest in metadata, as are governments and corporations. Using metadata has proven helpful in various cases in fighting corruption, or as a weapon to crackdown on dissidents and human rights defenders.
It is important to understand how metadata works and how to use it as a tool. It is also vital to know also how to protect oneself and one’s work in relation to the metadata we generate. Whether exposed, stripped, or added and verified, standalone or used in cross-reference with other data found through other sources (conventional or not), metadata is key to today’s investigative journalism and human rights advocacy especially when it comes to documentation, to image and video activism as well as for evidence collection. Understanding metadata and how to use it is crucial for self-protection and the protection of one’s work.
Metadata is a powerful tool to expose and provide evidence. In 2009, data scientist Michael Kreil created Tell-all Telephone, a project that generated a visualisation of six months of German Green Party politician Malte Spitz's telephone data. Michael Kreil told Exposing the Invisible that he had received “an Excel sheet with 36,000 lines of whatever in there, and there was no tool at all to have a look inside. You could make a simple map, using just the geolocation data, but you wouldn't see the aspect of time. You wouldn't see the movement. So, I wrote a small prototype, just a simple map with a moving dot. This was actually the basis of the application that went online a few weeks later.”\
Exposing the individual... and everyone else
The data provided revealed much about Spitz’s behaviour, when he was walking down the street or when he was on a train, as well as his whereabouts during his private time. Some of the information was not provided by Spitz’s telecommunication company, like the phone numbers he called or texted, or those who contacted him. This would have made it easy to not only identify Spitz’s social and political circles and reveal much about him but also reveal personally identifiable information about the people with whom he is in contact. Kreil and Spitz were not granted access to this information, but the telecommunication company does have access to it and this means that the authorities can also acquire access to all of this information.
Kreil also used publicly available information, like Spitz’s online behaviour, appointments announced on the party’s website as well as his tweets to corroborate the data provided by the telecommunication company. By combining all of this data, it was possible to for Kreil to pinpoint Spitz’s movements even further and the result provided a thorough analysis of Spitz’s life and political activities. Kreil hoped to demonstrate how metadata can be used to not only track an individual’s every move, but also to reveal how (meta)data retention can expose an individual’s whole social and political network.
Image from the Tell-all Telephonewebsite
More than meets the eye
Illinois Republican congressman Aaron Schock was known as the 'most photogenic' congressman due, in part, to his Instagram account which featured him in eccentric and zany poses in exotic locations. He posted pictures of himself jumping into snow banks, on sandy beaches and in various private planes. The attention his photos attracted led to questions about where Schock’s public-office-related business trips stopped and his holidays started. The Associated Press (AP) began an investigation which extracted the geolocation data from the photos Schock posted and tagged along with his location on his Instagram account and then compared it to the travel expenses he was charging to his campaign expenses. The AP analysed his travel expenses, his flight records of airport stopovers and the data extracted from his Instagram account and found that taxpayer’s money and campaign funds had been spent on private plane flights. It wasn't only Schock's Instagram that was revealing. The account of a former Schock intern showing an image from a Katy Perry concert with the tag-line "You can't say no when your boss invites you. Danced my butt off," was connected to a \$1,928 invoice paid to the ticket service StubHub.cm listed as a “fund-raising event” on Schock's expenses. The AP published their findings in February 24, 2015 and on March 17, 2015 he announced his resignation from Congress.
Image of Aaron Schock from his, now closed, Instagram account
In most cases, it is necessary to employ a variety of software, tools and resources to make sense of the extracted metadata and extract meaningful information. A good example of these creative investigation techniques using metadata is the case of Dmitry Peskov, Putin's spokesperson. Peskov was questioned about his income as a state official when he was spotted wearing an 18-carat gold Richard Mille watch, worth almost £400,000. The watch was visible on his wrist in a photo posted from his wedding. During the ensuing controversy, Peskov stated that the watch was a gift from his new wife, a claim which was later refuted by a photograph on his daughter’s Instagram account. There, a photo posted by his daughter months before the wedding showed Peskov wearing the same watch.
Images found of Peskov's watch at his wedding and the watch in question the Richard Mille RM 52-01.
Peskov was hit with another scandal when rumours emerged about his spending during his honeymoon, which he spent with friends and family aboard the Maltese Falcon, one of the 25 most expensive yachts in the world. The weekly rent of the Maltese Falcon far exceeds the politician’s declared economic means. Bellingcat reports that anti-corruption activist Aleksey Navalny, who broke the news about the watch with the help of other activists and supporters, took up the investigation regarding the yacht. By using the yacht's website, yacht-spotting websites and Instagram photos from Peskov's daughter and one of Peskov's friends, they were able to provide reasonable doubt of Peskov's denial of personally renting the Maltese Falcon. Peskov's friend had posted photos of two yachts on his Instagram profile, and by using VesselFinder, Navalny and co. managed to place the two yachts in the same area as the Maltese Falcon at the same time. Navalny's team matched a small portion of a door that appeared in a photo Peskov's daughter posted of herself on Instagram, to a video of the Maltese Falcon showing the same door with two distinctive marks.
A lot of attention is focused on the metadata that can be extracted from images or from communications. However, text files can be equally useful for an investigation, or pose an equal threat as images. In 2005, the former prime minister of Lebanon, Rafik Hariri, was killed along with 21 others. Though the United Nations investigators used metadata to investigate the assassination of Hariri by looking through communication metadata they had received from telecommunications companies, they did not pay attention to the metadata they left behind. When their long-awaited report on Syria's suspected involvement in the assassination, known as the Mehlis Report, was published, it caused a stir not only for its findings but for what a deeper look into its metadata revealed. The metadata attached to the editing changes were shown along with the exact times they were made. The key changes included the deletion of names of officials allegedly involved in the assassination, including Bashar al-Assad’s brother and brother-in-law. This not only jeopardised the (deleted) mentioned individual, and various governments and international bodies involved in a gravely destabilised region; but the United Nations and the Mehlis team too. The incident was considered extremely serious and lead to the UNissuing a response to the concerns regarding the deletion.
There are many tools available that can be used to reveal the metadata in files and images, though as can be seen in the case studies, in most cases a wider investigation is required to make sense of the metadata. See the section on tools for information and description of the tools.
Metadata is a double-edged sword: it can be extremely useful for investigating social justice and corruption cases but it is also being used to troll and doxx. Human rights defenders, women, female journalists & LGBTIQ individuals vocal on social media are all prime targets. The increased usage of smartphones in protests and mobilisations worldwide has increased and expanded the risks of sharing one’s location or whereabouts at a certain time, and one’s identity can be determined through mobile phone tracking using the images posted. The geolocation data available in images can be used to track anyone and anything, including endangered species. In a South African reserve, visitors were advised not to disclose the whereabouts of the animals spotted and to switch off the geotag function on their phones and social media platforms as poachers and hunters were using this information posted online to locate animals.
Image taken by Eleni de Wet in South Africa and posted on her twitter on 4 May 2012.
Metadata: Vice & the fugitive
In 2012, millionaire and controversial computer programmer and developer John McAfee, founder of McAfee Virus Protection, was arrested based on metadata found on an image posted by the media company, Vice. Vice journalists gained exclusive access to McAfee and accompanied him on his escape from an investigation in Belize regarding the murder of one of his neighbours. Vice not only posted the image, but bragged about their scoop by reporting on the time they spent with McAfee. When the image was posted with its metadata revealing where it was taken in addition to Vice’s publishing information on when they had seen him, it was simple to determine McAfee’s whereabouts. Though the image was most probably sent from the person who took it in Belize to Vice offices to be later uploaded on their website, it still retained the metadata of where McAfee was. The Vice journalists in question should arguably have known how to better protect their sources, as well as the subject of their reports, leading Vice to issue an official statement about the event.
Image by Robert King taken from the article "We are with John McAfee Right Now Suckers", published on Vice on December 3 2012.
One might assume that persons operating in high-risk areas and industries and taking part in high-risk activities would be more careful about revealing their whereabouts, but this was not the case for Michelle Obama or US soldiers in Iraq. In 2007, insurgents in Iraq used geotags from images shared online by US soldiers to attack and destroy several US AH-64 Apache helicopters. Michelle Obama's Instagram photos were geotagged revealing either her whereabouts when taking the images, or the whereabouts of the person managing her account. In both cases, this could and did pose a serious security threat not foreseen by those posting the images.
Image taken from Michelle Obama's Instagram published on Fusion
Metadata can be and has been used to curtail freedom of speech and intimidate people online. For example, it was used to doxx - a practice of targeting individuals for their political views or personal lives. It has been used to target women activists online, women game developers, human rights activists and journalists, among others. Managing metadata correctly is crucial for an individual with a high profile, especially on social media, and those who engage in political activities or lead their lives in ways that counter the mainstream or the status quo. The manual “Zen and the Art of Making Tech Work For You" discusses this particular aspect of metadata with recommendations and resources on the topic written from a gender and tech perspective.
A project by OpenDataCity also highlights how metadata can be used to put people in danger, often unwittingly.
“Years later, Balthasar Glättli (a Swiss politician) also wanted an analysis of his data. In the end, he didn't just give me his telephone data, he gave me everything else that is collected by the data retention in Switzerland. Additionally, Balthasar had a few problems, because he's also in the Defence Committee of the National Council. In his metadata was the location of a secret hideout that he visited. It was secret, but his phone provider collected Balthasar’s locations and, by publishing this data, some journalists found the hideout and published it. It was too late to remove it. It’s an interesting thing that when cellphones are tracked all the time, you should, actually, constantly think about when to switch off your cellphone in your pocket.”
Metadata also takes centre stage in the discussion around intellectual property, especially for artists. Some websites, like Facebook for example, strip out the metadata to minimise the size of the files (metadata occupies file space), and to protect the privacy of the users. This was a point of contention for people retaining intellectual property of their work. Many photographers, for example, needed to keep the metadata in their photos, especially in this age of mass sharing online without crediting. Here, the metadata provides a guarantee that the artist is assigned the credit they are entitled to for their work. Flickr, on the other hand, retains and shares the metadata, and though users can deactivate this feature, many are not aware it exists. On Flickr, a simple click on ‘show EXIF’ under the image reveals a lot of details which the user themselves might not be aware that they are sharing publicly.
Various tools can be used to remove the metadata from files and images, and there is always the option of tweaking the settings of the device or platform used to stop the registry of certain metadata. But to minimise the risks, it is recommended that one always double check what metadata is being shared (using the tools recommended in the Expose section), and then strip away any data left there and not intended for sharing. See the section on tools for information and description of the tools.
Metadata can also be used to verify information and evidence by 'proving' that a certain event took place at the time and place it was said to have taken place. In recent years, and with the viral spread of social media videos and images, verification has proven key to political participation, not just as a tool to prove something has happened at that time and place, but also to refute the spread of false videos and images that can discredit movements for social justice. In the Verification Handbook for Investigative Reporting, Christoph Koettl from Amnesty International explains how metadata helped verify the participation of the Nigerian army in extrajudicial killings.
We explored this in more detail in our interview with Harlo Holmes, the former technical lead on CameraV, and with a tool review of CameraV, a mobile App that enables users to verify photographs and videos in order for them to be able to be used as part of additional evidence in a court of law.
“CameraV which begun its life as a mobile App named InformaCam, was created by The Guardian Project and WITNESS. It's a way of adding a whole lot of extra metadata to a photograph or video in order to verify its authenticity. It's a piece of software that does two things. Firstly it describes the who, what, when, where, why and how of images and video and secondly it establishes a chain of custody that could be pointed to in a court of law. The App captures a lot of metadata at the time the image is shot including not only geo-location information (which has always been standard), but corroborating data such as visible WiFi networks, cell tower IDs and bluetooth signals from others in the area. It has additional information such as light meter values, that can go towards corroborating a story where you might want to tell what time of the day it is.
All of that data is then cryptographically signed by a key that only your device is capable of generating, encrypted to a trusted destination of your choice and sent off over proxy to a secure repository hosted by a number of places such as Global Leaks, or even Google Drive. Once received, the data contained within the image can be verified via a number of fingerprinting techniques so the submitter, maintaining their anonymity if they want to, is still uniquely identifiable to the receiver. Once ingested by a receiver, all this information can then be indexable and searchable.”
This raises a question regarding the forging and insertion of metadata. Looking at CameraV for instance, Harlo Holmestalks about this issue and raises an important point about the trustworthiness of the device used:
“Technically speaking, it’s very difficult for those things to be manually forged. If someone took the metadata bundle and changed a couple of parameters or data-points - what they ultimately send to us in order to trick us would not verify with PGP, and each instance of the App has its own signing key. That said, I do realise that devices need to be trustworthy. This is an issue beyond CameraV: any App that uses digital metadata and embeds it into a photograph or video is going to have to be a trustworthy device.”
Holmes elaborates on the importance of this trust by explaining that the
“verification in CameraV works the same way as with PGP. Key parties exist because human trust is important. CameraV easily allows you to export your public key from the App. If you give this key to someone when they're in the room with you, and compare fingerprints, then you trust that person's data more than if a random person just emailed you their public key unsolicited. If organisations want to earnestly and effectively use the App in a data-gathering campaign, some sort of human-based onboarding is necessary.”
Another useful tool for the purposes of verification is eyeWitness, a tool that allows users to capture photos or videos through their mobile camera App “with embedded metadata showing where and when the image was taken and verifying that the image has not been altered. The images and accompanying verification data are encrypted and stored in a secure gallery within the App. The user then submits this information directly from the App to a storage database maintained by the eyeWitness organisation, creating a trusted chain of custody. The eyeWitness storage database functions as a virtual evidence locker, safeguarding the original, encrypted footage for future legal proceedings.”
In addition to that, the eyeWitness team includes an expert legal team who will analyse the received images and identify the appropriate authorities, including international, regional or national courts, in order to investigate further. In some cases, eyeWitness will bring situations to the attention of the media or other advocacy organisations to prompt international action.
Multiple tools and workarounds can be used to verify metadata in files and images; experts and enthusiasts are constantly coming up with new ways to verify information. It is also important to note that verification is not always completed simply by using an App, but may in some cases require cross-referencing the data with other sources and undertaking creative investigative approaches.
Working with metadata
To better understand how to work with and around metadata, it is important to know in practical terms what is generally meant when metadata is mentioned. Below is a list of the metadata that may be stored along with different types of data:
- The location (latitude and longitude coordinates) where the photo was taken if a GPS-enabled device, such as a smartphone, is used.
- Camera settings, such as ISO speed, shutter speed, focal length, aperture, white balance, lens type…etc. (please note that some cameras do include the location coordinates)
- Make and model of the camera or smartphone.
- Date and time the photo was taken.
- Name of the program used to edit the photo.
- Author’s name, usually the name assigned when the program used to create the file was first installed.
- Version and name of the program used to create the file
- Title of the document
- Certain keywords
- Date and time of file creation / last modification
Depending on the program used to create the document, the data may include:
- The names of all the different authors
- Lines of text and comments that have been deleted in previous versions of the document
- Creation and modification dates.
Metadata in video files can be divided in two sections
- Automatically generated metadata: creation date, size, format, CODECS, duration, location.
- Manually added metadata: information about the footage, text transcriptions, tags, further information and notes to editors..etc.
* Recommended reading: A thorough overview on video metadata and working with it fromWITNESS.
Audio metadata is similar to video but more widely used especially to register property of the file. In addition to that it can include:
- Creation date, size, format, CODECS, duration and a set of manually added data like tags, artist information, art work, comments, track number on albums, genre..etc.
Metadata in communication depends on the type of communication used (i.e. email, mobile phone, smartphone..etc). But in general it can reveal the following (if no tools to hide the metadata are used):
- Ids of the sender and the receiver
- Date and time of communication
- Mode of communication..etc\
There are various ways to extract metadata from files. The options vary according to operating systems, from tools to plug-ins, to desktop versions or in-browser tools.
Disclaimer: When using online platforms to extract metadata, it is important to keep digital privacy and security in mind. There is not enough information available to guarantee the confidentiality of the process. These platform _might_track your online behaviour, store your data, or share it with third parties or the authorities.
There are various ways to reveal or look at metadata-methods and tools that will be detailed later on in the chapter. Some tools can read the metadata of the in-built file information (like e.g. Photoshop) which means they will show the data in their format. Others have a more detailed output.
Though metadata can be removed or altered after a file is created, it is sensible to consider certain elements before creating the file. For example, it may be advisable to change the settings on your phone, use a certain App, modify user details on the software used, etc. Below are two examples of using a smartphone’s camera.
Fig. A: Photo taken with an Android phone using CyanogenMod. Does not show the geolocation or the type of phone used.
Fig. B: Photo taken with an iPhone. Notice the extra details revealed including address, type of phone, type of camera and program used.
There are various tools for viewing metadata, and the choice of tool may depend on the objective. In addition to the softwares that include a metadata feature (like Photoshop, Adobe Acrobat, etc.), below is list of tools to view metadata.
Disclaimer: Please note that to extract and edit metadata, some online platforms might track your online behaviour, store your data or share it with third parties or the authorities. It is important to keep digital privacy and security in mind. There is not enough information available to guarantee the confidentiality of the process.
Compatibility: Windows, Mac OS, and Linux
Proprietary status: Free and open source
This tool comes highly recommended, though it might require some effort to navigate since it depends on command lines. However it is quite comprehensive in the file formats it covers and the outcome it gives. ExifTool allows the user to read, write and edit metadata. The tool's website provides information, downloads and workarounds.
Compatibility: Online, no compatibility issues\
Type: Use online through a browser
This is an online tool based on Phil Harvey’s ExifTool, with the option of uploading an image or using the URL of an image online. It does offer a button to be added to Mozilla or Safari allowing a short-cut for a faster extraction of metadata.
Compatibility: Online, no compatibility issues
Type: Use online through a browser
This is an online tool based on Phil Harvey’s ExifTool. It has direct access to DropBox, Flickr and Google Drive. A user can log in from the Exifer website and edit their images directly from there. Exifer has a privacy disclaimer stating that: “pictures will be temporary downloaded just to let you edit them. The temp files will be deleted as soon as you'll refresh the home page of this site, or automatically after 15 minutes from the download time.”
Compatibility: Online, no compatibility issues\
Type: Use online through a browser
Compatibility: Android mobile phones
Proprietary status: Free and open source
Type: Mobile phones
CameraV is a mobile App created by The Guardian Project and WITNESS. The V in the App's name stands for verification and it was created to add a large amount of extra metadata to a photograph or video in order to verify its authenticity. This piece of software does two things. First it describes the who, what, when, where, why and how of images and video. Secondly, it establishes a chain of custody that could be pointed to in a court of law.
Compatibility: Linux, Mac OS
Proprietary status: Free and open source
Just as the title suggests, this script allows the extraction of geolocalisation metadata from a bulk of images. It can be a valuable time-saver when processing large numbers of images. The script written by the Exposing the Invisible team members should be placed in a file called geobatch.rb and run in the folder with all the images in it.
Compatibility: Mac OS
Type: Mobile Phone
TrashEXIF is an iPhone App that allows users to strip all metadata from images or to control which metadata should be removed or kept. The App also allows for presetting a protocol to be applied to all images taken.
There are various ways to remove metadata from files. Here are few suggestions taken from the Security in-a-Box toolkit.
You can prevent a specific kind of metadata like GPS location from being captured by:
Switching off wireless and GPS location (under location services) and mobile data (this can be found under data manager -> data delivery).
When taking a photo, make sure that the settings of the tag-location from the photo App is off too.
Using tools like Metanull (for Windows), you can ensure that all metadata is removed before you share it. This tool is discussed in detail below.
Note: Some files like DOCs and PDFs can hold image files within them. If you do not exercise the necessary caution, you can scrub the metadata on the document that is holding the image, but the metadata for the embedded image will be retained! Using Metanull before adding the image to the DOC will remove all metadata from it beforehand.
Removing metadata from documents and other files
As noted above, other commonly used file types such as Portable Document Files (PDFs) or word processing documents created by applications such as Microsoft Office or LibreOffice contain metadata which may include:
the username of the person who created a document
the name of the person who most recently edited saved a document
the date when a document was created and modified.
In some cases, your document might also contain additional personally identifiable information such as addresses, email addresses, government ID, IP addresses or unique identifiers associated with personally identifiable information in another program on your computer.
Some of this information is easily accessible by viewing the file properties (which can be accessed by right-clicking the file icon and selecting properties). Other information or hidden data requires specific software to be viewed. In any case, depending on your context, this information might put you at risk if you are working and exchanging sensitive information.
Removing metadata from PDF files
Windows or MAC OS users can use programs such as Adobe Acrobat XI Pro (for which a trial version is available) to remove or edit the hidden data from PDF files.
Opening any PDF file with Acrobat will allow you to edit the metadata by going to the File menu and then selecting properties. Here, you can modify the document author’s name, title, subject, keywords and any additional metadata. You can remove information about the creation time, modification time, type of device used for creation the file, and other hidden data you don't see by going to the Tools menu, then Protection, and selecting Remove hidden information.
For GNU/Linux users, PDF MOD is a free and open source tool to edit and remove metadata from PDF files. However, it doesn't remove the creation or modification time, nor does it remove the type of device used for creating the PDF.
Removing metadata from LibreOffice documents
In LibreOffice documents, the metadata can be viewed by selecting the File menu, then Properties. Under the General tab, can click Reset to reset the general user data, such as total editing time and revision number. You can also make sure that the Apply user data checkbox on this screen is unchecked, so that the name of the creator is removed. When you are finished, go to the Description and the Custom Properties tabs to clear any data there that you don't want to appear. Finally, click on the Securit** tab and uncheck the *Record change box, if it's not unchecked by default.
Note: If you use the Versions feature, you can delete older versions of the document which may be stored there by going to the File menu and Versions. If you use the Changes feature, go to the Edit menu, then Changes to accept or reject to clear the data relating to changes made to the document at any time, if you no longer need this information.
Other strategies for scrubbing metadata
Some file types contain more metadata than others, so if you don't want to play around with software, and the formatting of a file doesn't matter, you can change files from ones that contain a lot of metadata (such as .DOCs and .JPEGs for example) to ones that don't (.TXTs and .PNGs for example)
Avoid using your real name, address, company or organisation name when registering copies of software such as Microsoft Office, Open Office, Libre Office, Adobe Acrobat and others. If you must give a name or address, use a fake one.
- Metadata Investigation: Inside Hacking Team by Share Lab
- Verification Handbook for Investigative Reporting, in particular chapter 7
- Making Data Speak, Smari McCarthy's Exposing the Invisible interview on metadata
Image created by John Bumstead
Misinformation can spread at a feverish pace and in this chapter we will address the essential verification questions of where, what, why and who as well as introduce a number of tools and techniques to assist investigators with verifying data that they find online.
The focus of this chapter starts from the Syrian conflict. The conflict is the most documented war in history with a range of documentation efforts under way: from tracking the extent of damage to Syrian archaeological sites to listing abuses against women and missing persons. Each of these efforts draws information from a range of media sources such as text documents, photographs and videos.
This is not just documentation that shapes public opinion about Syria and makes the truth of the Syrian reality accessible; it is equally important documentation for the future. When the Syria regime is held accountable for their crimes, a lot of this documentation has the potential to serve as evidence and be presented in an international court of law. However, for this information to be used in courts of law or for a source to be trusted, the content must be verified.
In media coverage of emergencies, information often spreads quickly and without others checking its veracity before sharing. The prevalence of User Generated Content (UGC), content that is generated from tweets, digital images, blogs, chats, forum discussions and so on, also means that more and more people are documenting human rights violations and images of war and disaster. Newsrooms often have tight deadlines and can prioritise speed over accuracy, leading to the spread of images and videos that have been taken out of context or digitally enhanced and text documents that contain misinformation.
Journalists from Syria and other countries have been putting lots of effort to verify digital content related to the conflict as there has been many proven incidents of fabricated content. An example of this can be seen from Abdulaziz Alotaibi posting an image on Instagram of a child sleeping between his parents' graves as a depiction of a Syrian child who just lost his family. It went viral on social media with people including politicians discussing it and even some news agencies used it to write breaking news stories.
Image taken by Abdulaziz Alotaibi
No effort was made to verify this image and no-one asked questions like:
Where was the photo taken?
What date was it taken?
Why was it taken? What is the story behind it?
When Alotaibi saw that the image had been used in a wrong context, he released another photo to show that this was actually an art project. The photo was not taken in a grave yard and the child was his nephew.
Image taken by Abdulaziz Alotaibi
Misinformation can spread at a feverish pace and in this chapter we will address the essential verification questions of where, what, why and who as well as introduce a number of tools and techniques to assist investigators with verifying data that they find online.
Before this chapter begins, we would like to highlight two important factors: there is no single, magic tool that can be used for verification purposes, and how using online tools could compromise you if you do not take specific precautions.
One Tool to Rule Them All
Later in the chapter we will introduce various tools that could assist you in working out the veracity of online content. However, these tools are often useful only in combination with other tools and in many cases it is helpful to speak to sources on the ground. But make sure you communicate with them securely so you don't put them or yourself at risk, especially if you are investigating sensitive issues.
One Tool to Find Them
Many tools and services featured below belong to private companies and are closed source. This means that when uploading content to these services you will not be able to control how these companies use it nor who they share it with.
Some of these services do not facilitate a secure connection to the internet, which might put you at risk if you are on a public Wi-Fi network. Also, your location will be accessible to these sites if you do not take measures to obscure it. At the end of this chapter we will delve deeper into using open source tools for verification and how you can protect yourself while carrying out an online investigation.
The Who, What and When
In 2014 a video was published on YouTube that featured a child being shot by the Syrian regime. It was watched over eight million times. The video was then cross-posted on BBC Trending alongside an article saying that the video was most probably not a fake.
A few days later a Norwegian film director stated that he was the one who staged the video showing a “Syrian hero boy” under gunfire. It was shot in Malta in May 2014, not Syria. The video was picked up by news agencies and social media activists to spread information about the suffering of children in war. Unfortunately, this had a negative political impact overall. Releasing this fake video and spreading it on social media made it easier for war criminals to dismiss credible images of abuse by saying most videos online are fake.
In both incidents above it was difficult to find the original source of the content. It becomes even harder when content is downloaded from social media websites such as Facebook, YouTube, Instagram and Twitter and uploaded again on the same platform from different users accounts and channels or uploaded onto different platforms.
News aggregators likeShaamNetwork S.N.N 'scrape' content from original uploaders onto their own YouTube channel, making it harder to find the source that created or first shared the content. The photo below shows how they scraped content from a media centre in Daraya (Damascus Suburbs) showing a helicopter dropping bombs.
The same thing happened when the U.S. Senate Intelligence Committee released a playlist of 13 videos that had originally appeared on YouTube which they had used to look for evidence related to the 2013 chemical weapons attack on Damascus Suburbs in Syria.
A number of these videos were taken from the YouTube channel of a well-known Syrian media aggregator,ShaamNetwork S.N.N, which regularly republishes videos from other people’s channels. Félim McMahon of Storyfulwas able to discover the original versions of these videos by using a range of different verification techniques including checking the original upload date of the videos and examining their profiles to assess whether they looked real or fake. This is a very good example of how verified videos can be used to strengthen the investigation of an incident.
One of the key issues in verification is confirming the Who and the What:
- Source: Who uploaded the content?
Provenance: Is this the original piece of content?
Date: When was this content captured?
Location: Where was this content captured?
Identifying the original source is essential when verifying digital content. It is essential that human rights investigators confirm the authenticity of any information or content they get online via social media websites and other platforms as it can be easily fabricated. For example, it is very easy to fake a tweet using this website, which can then be shared as a picture.
The image above can then be shared on twitter, creating the appearance of it being an authentic tweet. Another approach to spread misleading information is to retweet fake information such as: (Good news! RT@PresidentSY I'm announcing my retirement from politics).
In this section we will introduce a number of tools and techniques to verify that the person or organisation that you believe has uploaded or shared the content you want to verify is in fact the individual or group you believe they are.
First, a few questions to think about when checking an account to confirm it as the original source:
Has the account holder been reliable in the past?
Where is the uploader based?
Are the descriptions of videos and photos consistent and mostly from a specific location?
Is their logo consistent across the videos?
Does the uploader 'scrape' videos/photos, or do they upload only user-generated content?
How long have these accounts been active? How active are they?
What information do affiliated accounts contain that indicates the recent location, activity, reliability and bias or agenda of the account holder?
Once you have some answers to the questions above like the name of the uploader from his or her YouTube channel or websites linked to the uploader's social media accounts you can use tools to get more information about the source.
Verification technique: Check the check-mark
Facebook, Twitter and YouTube have a way of verifying personal profiles or pages through blue ticks added to personal profiles or social media pages. Hover over the blue tick and you will see the text “verified account” pop up. If it’s not there, then it is not a verified account. Since, those who spread misleading information can also add a blue verification check mark to the cover photo of the faked accounts, here are a few steps to check the authenticity of the content:
Twitter verified account
Facebook verified account
YouTube verified account
However we can't depend on these official verification programmes as it is not available for all users. As a result we end up most of the time checking profiles or pages that do not include any blue tick on them at all.
Verification technique: Delving into their profiles
Check out details available on the profile to confirm that it's original and not fake by checking the following pieces of content:
Are there any websites linked to this profile?
View the previous pictures and videos.
If they share content, where do they usually post about/from?
How many followers, friends or subscribers they have?
Who are they following?
For example, let’s say that someone shared a YouTube video on a specific platform about a human rights violation incident. The first thing we need to do is to go to the user's YouTube profile. In the case below, you will see that his name is Yasser Al-Doumani. He has been uploading daily videos about human rights violations in Syria which are all located in the Damascus suburbs. We understand from this that he is a Syrian journalist, most probably based in the suburbs of Damascus.
When we check out the 'about' page on his YouTube profile we can see a number of important pieces of information:
Website links: There are two linked URLs to the Facebook pages of a coordination group in the Damascus suburbs which usually do media work. The description says that this YouTube channel is dedicated to Douma Coordination group.
Number of subscribers: He has 590 subscribers.
Joining date: He joined in 1 January 2014.
Profile views: His profile has 281,169 views.
All this information provides more clarity as to whether this account is fake or not. In this case, the account is authentic. Indicators of possibly fake accounts include a recent joining date, few profile views, a low number of subscribers and whether other websites are linked to in the 'about' section.
You can check the original person who uploaded a video on YouTube. If you come across a specific video and you want to get to the original uploader of this video, you need to use the filter to sort by upload date as shown below. In this case, we got a video from social media showing alleged chemical attacks in Idlib province in Syria. By typing the title of the video in the search and sorting by 'Upload date' we get to the account of the original uploader, which is 'Sarmeen Coordination Group'.
As we mentioned earlier, you also need to check for websites linked to their channel and the number of subscribers and viewers to make sure that this is not a fake channel.
When checked, Sarmeen Coordination Group had 2,074 subscribers, over a million views and around 3,000 followers on Twitter. They also have been sharing visual evidence from the location for the last four years. Taken together we can confidently verify that the source Sarmeen Coordination Group was the original source of the video.
Verification technique: Bot identification
On Twitter there are many fake accounts called ‘bots’ created to spread information, or sometimes to spy on people by following them. Most bots, and other unreliable sources, will use stolen photographs of other people as their avatars. For example, the Twitter bot @LusDgrm166 seen highlighted in red in the image below:
A quick investigation of the account's avatar reveals that the Twitter account is not likely to be operated by a human. Indeed, all the accounts on the screenshot above are likely not operated by a human. They are tweeting about Syria with the hashtag #NaturalHealing and the content of their tweets is taken from Wikipedia and other pages.
After copying the URL of the avatar of @LusDgrm166 or downloading the avatar image, paste the link/photo into a Google reverse image search to find similar pictures elsewhere online. As you can see in the image of the search results below, the image has been used by many twitter users as their profile picture.
Verification technique: The internet's phonebook
If you have the name or username of the person who uploaded the content to YouTube, Facebook, Twitter, etc., you can run their name through a service called Webmii to find more information about them from websites, news outlets and social media accounts.
Most importantly, contact the source directly to get information verified when possible. Make sure to ask how they know the specific information. They might be able to send you additional photos and videos to help you verify specific incidents. Always connect to sources securely so that you don't put them at risk.
For example, you have found visual evidence on YouTube, but you don't know the uploader. You don't know if he's the original source, and you don't know if he's located in the area where the footage was taken.
You can get more information about this source by running his name on Webmii as below.
You will find most of the photos or videos that he has uploaded online.
You will also find other digital content from different social media accounts that he shared online, or that others shared with him.
This will give you a better understanding if this source reliable or not, especially if you find information that answers the questions we have mentioned above to verify the source.
In our example, the person works in a local media centre in a city called Al-Safira. He has been doing this for years from the same location which gives him more credibility.
You can use a different technique to check the source in cases where you don't find the name of the person who uploaded the content online through social media platforms. The technique is often used when you are looking at verifying information that has been uploaded onto a website rather than social media.
Verification technique: Who is? by using whois services
If you are looking at a website that contains information that you would like to verify but the website doesn't include the name of person who runs it, or if you want to find more information about the person who runs the website such as their location or phone number, then you can use a number of online services that provide this information.
When you register a domain name with a domain provider they will usually ask you for a number of identifying details including:
Name of the person registering the websites
Email (The email below was obfuscated, but usually you can find clear email as email@example.com )
Many websites offer a service to see all this registered data, such as: https://who.is/. Most of domain registration websites offers this service as well.
Below are the results of running whois on a website such as www.example.org.
Note: Some domain providers offer services to keep this information from being public, in other instances individuals purposely obscure personally identifying information for privacy reasons.
In this section we will look at three different verification approaches that can help you to determine 'the What'. These are the provenance (whether this piece of content is the original or whether it is a duplication of a previously posted piece of content), the date the content captured and the location of the captured content. Finding the answers to these questions will help you to identify the veracity of the content.
Verifying visual evidence
If you are looking into visual evidence such as photo or video, you need to investigate whether this is the original content, how it is used and if modified copies exist.
Verification technique: Reverse image search
Use reverse image search tools such as TinEye or Google Reverse Image Search to find out if the image you are looking at has been posted online previously.
Note: Make sure to read the Protecting Data chapter on how to securely carry out online investigations before using this tool.
How TinEye works:
Go to TinEye's website.
Upload the image you want to search for. We will use the photo taken by the Norwegian film director from the earlier example.
We sort the filter by the “oldest” which will lead us to the people who used this image first, or to the originator of the image. You can also sort by “Biggest Image” because sometimes the originator will be the one uploading a high quality version of the image. In the case below, we see that this image was first used by a Norwegian online news website.
Verification technique: EXIF data
Every image has metadata attached to it (read more in the chapter on Metadata), which can include details about the type of device the image was taken on, camera settings, date and location information. There are various free tools that will analyse a photograph’s metadata and compression information, allowing further verification of an image’s veracity. There's also a possibility to verify the dates and location of images if it's included in the metadata of the image.
FotoForensics is a toolbox that provides a suite of photo analysis tools. The public site permits viewing metadata, visualizing the JPEG error level potential and identifying the last-saved JPEG quality. FotoForensics does not draw any conclusions. Nothing says "it's photoshopped" or "it's real". The website highlights artifacts that may not be otherwise visible in the picture.
You can upload the image you want to analyse for metadata. This works best if you get a raw photo from the source. You won't get the same results when analysing images taken from social media networks such as Facebook, Twitter and Instagram because they strip most metadata when the images are uploaded to their platforms.
You can upload the image on the FotoForensics website or enter the URL of where the image is stored online and then click on 'Metadata' to see the information below the image. In this case, the image uploaded was taken from MMC marramedia center in Syria which is run by activists in that area. The photo shows remains of the weapons used recently in Idlib.
The results are not always clear, however, and depend on the copy of the file uploaded.
For example, a JPEG that has been resized, recompressed or changed from the original file will have much less reliable data than the original full-resolution image recorded by a camera.
However, if an image was edited in Photoshop by the originator, it doesn't necessarily mean that it was manipulated. Izitruis a tool that can help you figure out if the image is modified or not.
You can upload an image on Izitru to check whether it has been manipulated or edited, which will help you confirm if it is a raw image or if it has been edited.
In this image below, Izitru indicates that it is not the original image and that the image has been edited.
There's a different technique to analyse an image which is called error level analysis (ELA.) It's a forensic method to identify portions of an image with a different level of compression. You can do this through fotoforensics.com as well.
Verification technique: Metadata as verification
There are applications that can be useful to automatically capture the date of a video or photo as well as capturing other important details such as GPS data. One such application is Camera V which we introduced in the Metadata chapter of this guide.
Verifying video evidence
There are no services available for reverse video searches such as we saw with Google image reverse or TinEye so it's not as easy to verify the provenance and original source of videos. However, there are ways to carry out a reverse-verification of a video to see if the video has been used and shared in the past or not. This requires you to capture a screenshot of the video at an important moment to get the best results (the most opportune moment to capture the screenshot is when an incident happens). Alternately, you can capture a screenshot of the video thumbnail as it could have been used previously on YouTube or other video hosting services. Then run the captured screenshot through TinEye or Google image reverse as we did earlier in this chapter.
Amnesty International created a tool that will help you implement this technique which you can find here.
Enter the YouTube URL that you are interested in, as seen in the example below, then select ‘Go to get Thumbnails’ which you can then run a reverse image search on.
In this case, this video was uploaded by 'Abu Shadi AlSafrany', who works in a local media centre in a city called Al-Safira. They haven't been shared online before which means that the videos haven't been used in different countries or contexts. Once you verify that a video is unique, it will be important to then verify the location and the date of the incident to make sure that the video is not fabricated. Reverse image searches do not always reveal any duplications of videos or photos so there's a need to carry out other forms of verification as well.
Verification technique: Confirming the date
Verifying the date is one of the most important elements of verification. The key questions to consider when finding visual evidence online are:
When was the content created?
When did the incident happen?
This is made easier when people in the video mention the date of when the event happened or show newspapers or write the date on a piece of paper and show it to the camera such as in the example below.
In the Syrian context, most of the time the original uploaders of the videos on YouTube will write the exact date of the incident with the video title. In most cases this is the correct date, especially if you are looking to a video from a vetted, verified source.
This does not often happen in other contexts, however, and even in Syria it was a challenge to verify the date of some events without adding other ways to confirm it such as looking at the weather during the event or connecting with the person who took the original footage and obtaining raw photo/video with metadata attached that verified the date of the event.
Hundreds of videos were uploaded on YouTube by media activists in Syria during the chemical attack that happened in Damascus on 21 August 2013. The uploaded videos were accused of being fabricated because the YouTube videos uploaded by activists were dated as 20/08/2013, while activists were saying that the attack happened in 21/08/2013. This happened because YouTube time and date stamps videos according to Pacific Standard Time (PST) rather than Eastern European Time (EET), as in this case, something it's important to be aware of.
Checking the weather (if possible) from a photo/video is another helpful way to confirm the date of the event.
Below is a video posted on Al-Aan TV claiming that clashes had been stopped in few areas in Syria because of the snow.
To verify this, enter the same date posted on YouTube to the website WolframAlpha, as shown below, to see if the weather was indeed snowy or not.
As you can see above, you can verify that the date posted on YouTube is likely to be correct based on the similar weather conditions.
Verification technique: Confirming the location
The process of geolocating visual evidence is essential to verifying whether or not the evidence you find is in the location it claims to be. Mapping platforms such as Open Street Map, Google Earth, Google Maps, Wikimapia and Paronamio will help you to locate those materials when possible. The key issue for geolocation is to collect as many images as possible and use them all to verify as it will be harder to verify an incident from only one piece of footage.
Some details to consider for confirming the location:
License/number plates on vehicles
Landmarks such as schools, hospitals, religious places, towers, etc.
Type of clothing
Identifiable shops or buildings in the photo
The type of the terrain/environment in the shot
Verification technique: License/number plates on vehicles
In this case, we wanted to identify the location of a suspected ISIS member. For many years she posted many pictures from Guyana, South America on a social media account. To confirm that she was actually in Guyana, we looked at images where vehicle numbers are clearly visible. The white car in this picture holds the number BMM-5356.
When searching this number through Wikipedia, we found the page below that confirmed to us that those particular plate numbers match vehicle registration plates in Guyana.
Verification technique: Looking at Landmarks
Looking at landmarks such as schools, hospitals, towers and religious buildings is very helpful when you are trying to geolocate visual evidence. Mapping platforms such as Wikimapia, Google Earth, Panoramio and Google Maps are tagged with thousands of photos that can be used to geolocate your evidence.
A search for 'schools' on Wikimapia, returns all schools in the area as shown below.
Panoramio works in a different way; it shows you all photos in a specific area that are tagged to Google Maps.
With both services you can find photos that will help you geolocate your evidence as demonstrated below. You can see a screenshot from Panoramio of a photo of a shop found through the site. The photo includes the phone number of the shop and a small board with the full address on it.
Once you find the suspected location of your evidence, Google Earth can be very helpful to confirm if this is indeed the actual location.
Use Google Earth to:
Look at structures
Look at terrain
View Google satellite image history
Verification technique: Looking at structures
Below is an image of a mosque which was captured by an activist who claims that this mosque is located in Jisr al-Shughur, Idlib. We located the mosque on Google Earth and compared the structure in the satellite image to the image provided by the activist to make sure that it is actually located in the claimed location. In this case we looked at the building's black windows and its structure. For more information about this technique, see the work of
Watch From My Point of View, a documentary by Exposing the Invisible that features Eliot Higgins and his early work.
Eliot Higgins and Bellingcat, who have a series of detailed case studies and tutorials on this technique.
Verification technique: Look at terrain
Verify location by checking out the terrain of the claimed location in satellite imagery.
Below is one of thousands of photos leaked depicting violations of human rights in Syrian prisons. The leaked photo was geolocated by looking at the satellite image below; note the terrain which shows the hill that with communication towers.
Verification technique: Using google history satellite image
Below are images of Aleppo in 2010 and in 2013. You can see clearly the damaged areas. This can help you geolocate streets in this area even an attack as you can see the images before the damage/attack.
As you can see from the above examples, technology has changed how we find and deal with sources and information as witnesses and activists share events in text, photos and videos on social media and blogs in real time. This can help human rights investigators verify events that are happening through visual evidence by using different techniques and tools. Remember to read our chapter on how to use verification tools as securely as possible before using any of them.
How to carry out an online investigation as securely as possible
Using online search tools and investigation techniques can be very helpful to verify user generated digital content such as photos and videos as we saw earlier in this chapter. But there are security issues related to this that you need to consider before carrying out your online investigation.
Some important questions are:
- How sensitive is the investigation you are carrying out? Are you going to be at risk if anyone knows that you are working on this investigation? Will other people involved in this investigation be at risk too?
- How sensitive are the videos and photos you are dealing with and verifying? Would it put you at risk if someone saw that you have it? Is it safe to carry these materials with you while you are travelling and moving around?
- Do you know the security situation and the Terms of Service of the online tools that you are using to verify your content? Do you know if they would share the uploaded content with other parties?
- Do you connect securely to the internet while carrying on an online investigation so people connected to the same Wi-Fi network won't be able to see what you are looking at?
- Do you hide your location while carrying out an online investigation so the sites you are visiting won't be able to collect information that could identify you personally later on?
These are some of the questions that you need to think about when using online tools and cloud-based services to conduct an investigation. Below we will go through basic steps on how you carry out an online investigation as securely as possible. We will also go through the open source tools that you can use for investigation instead of commercial software. By open source, we mean tools that let you review how they are built so you, or a technical expert that you know, can understand if violates your security and privacy at any point. Commercial software doesn't let you do this so you won't be able to understand if it's respecting your privacy or if it is secure to use it.
Basic steps for investigating online securely:
Step 1: Connecting securely to the internet:
Make sure that you are browsing sites securely, ones you are investigating and those you are just reviewing. You can do this by encrypting your communication with these sites when possible through SSL (Secure Socket Layer).
Note: Some sites don't support secure communication so people can see the websites you visit, and the information you send (log-in info, text, photos, video etc.). This can be very risky if you are working in a public space using Wi-Fi network.
Step 2: Hiding your identity with Tor:
You leave behind many traces while you look at websites, social media sites and use online verification tools to carry out your investigation. Most of the websites that you visit collect information about you such as your location through your IP address, your browser fingerprint, the device you are using to access the internet (mobile, tablet or computer), the unique number for your device called the MAC address, the websites you visited online, how long you stayed on a specific page and more. The Me and My Shadow project has more details about these traces.
All this collected information can create a profile of you which makes you identifiable. It's important to make sure that you are browsing the internet securely and anonymously if you don't want:
hackers connected to the same Wi-Fi network as you to see what you are doing,
the websites you are visiting to collect information about you, or
your internet service provider to see what you are doing online.
You can do this by installing and using the Tor browser bundle when doing an online investigation. Learn how to install and use this tool here. You can also use Tails which is an operating system that allows you to remain anonymous on the internet by default.
Once you install this tool you can use services such as Pipl or Webmii to verify sources, who.is to help you verify the source if it's mentioned on the website registration page and FotoForensics without divulging your identity or your investigation.
Step 3: Use safer tools for verification and investigation when possible:
Below is a list of open source tools that can be used for safer online investigation as alternatives to their closed sourced cousins.
Confirming sources of information
You can use Maltego instead of cloud based services like Pipl and Webmii to get more information such as accounts on social media, related websites, phone numbers and email addresses of a specific source so to verify who she/he really is. Maltego is a program available for Windows, Mac and Linux that can be used to collect and visually aggregate information posted on the internet which can be helpful for an online investigation. Once mastered, this tool is extremely useful but it is fairly complicated for a first time user, so be prepared to invest some time into it.
Consider ExifTool as an alternative to FotoForensics which was presented earlier as a tool to review the exif data on a photo. With it you will be able to extract metadata from photos such as the device used to take the photo, date, location and last programme used to edit the photo (if it has been edited). By using ExifTool you won't need to upload your photo to a cloud based service that you don't trust with your data. Everything is done locally on your computer with ExifTool with no 3rd party involved. However, there are downsides to using ExifTool as well. First, it doesn't support error level analysis analysis and second, ExifTool is a command-line application so it doesn't have graphical interface. The command-line is easy to use and install.
Google Maps and Google Earth are among the most used tools to verify locations and geolocate incidents. Everything you do on Google Earth and Google Maps is connected to your Gmail once you sign-in, which means that it is possible to know that you are looking at specific locations to verify. Use a separate Gmail account from your personal one if you want to use Google Earth and Google Maps for your investigation without exposing your real email. This will make it harder to identify you as a person working on a specific investigation.
Reverse image search
Unfortunately there are no open source tools currently available that rival the functionality that TinEye or Google reverse image search offer. However the following tools are a good place to start when looking at open source reverse image tools:
Step 4: Communicate securely with your sources so you don't put them or yourself at risk:
It's very easy to put yourself and your sources of information at risk while communicating with them about your online investigation, especially if you are investigating a sensitive event. Make sure that you communicate with them securely and make sure that you share data securely too. There are easy-to use-tools that you can install such as Signal for safer communication and miniLock for secure file sharing. Read more about secure communication here.
Step 5: Always back up information:
Make sure to back up relevant content you find, sites you visit, social media accounts and conversations you have with sources of information. Visual evidence such as photos and videos that you find online can disappear from the internet for many reasons. Back up everything you find online so you can access it later for verification and analysis even if it's no longer online.
You can use cloud based services to backup websites and social media accounts such as:
The archives will provide you with an ID for every page you submit. Copy the codes to a file on your local computer so you can get back to pages you want to analyse later on. You can encrypt this file on your computer with a tool called VeraCrypt so no one other than you will be able to open it. You can find more info about how to install and use VeraCrypt here.
Note: Make sure that you use the above online archives over the Tor browser to hide your location.
Resources and Tools
- The Verification handbook
- Witness Media Lab - Verification section that include verification tools
- The Observers guide to verifying photos by France 24
- Arab Citizen Media - Guide on fact checking
- Arab Citizen Media - Guide on verification tools
- GlobalVoices - Verifying social media content guide
- Storyful verification process with case studies hereand here
- Checkdesk for collaborative verification, an open sourcetool
- FB graph search tool, especially to locate photos that you cannot find on their timeline
- Extracting images from a webpage with thistool
- This tool allows you to tweet from ‘anywhere’
- Find out when new satellite imagery is available with this website
- Tracking geotagged tweets per account
- Geolocation skype users using a skype id
- Collecting, sourcing and organising content on Facebook and Instagram with this website
Image created by John Bumstead
This chapter is split into two parts. The first part focuses on the differences between private and public data and procedures to determine how sensitive the data you are working with is. In the second part we'll suggest a number of free, open-source and easy-to-use tools that can help you gather, store and move your sensitive information in a secure way.
Working on critical investigations that involve handling and sharing data and information is not only difficult, but at times risky. Security is an essential part of this work, not only for the risks you accept but because those you work with are also put at risk. Activists must assume that their work will be visibly successful and prepare for the security risks that this will undoubtedly bring and thus should adopt good security practices from the beginning and throughout the process, rather than as an afterthought.
The relationship between privacy and security can be described as: privacy concerns a very specific kind of information (personally identifying information), which must be protected by security (for exampleprocedures and technologies). Security is also used to protect other kinds of data, like financial data, contracts and other information vital for running businesses.
The differences between private and public data
Private data is any kind of data that identifies an individual. This can be a name, birthdate, postal address, online user account or a private bank account number. But it can also be a record of people’s movements in a city or their sexual or religious preferences.
Conducting business using new media and digital communication allows for greater ease and efficiency when organising, mobilising, communicating and sharing. However, doing so leaves exposed our primary assets: people, networks and knowledge. Protecting people, networks and knowledge is thus essential in order to have an impact.
The German Chaos Computer Club defines the issue of private versus public data this way: “Protect private data, exploit public data”. This can be interpreted as: all data that identifies persons who have no power over society should be protected, and all data which is public or concerns persons who hold power over society must be utilized to the fullest for the benefit of society. Public data also includes all data that has been produced with public money, such as a national curriculum, legal texts and the data collected by tax and statistics authorities. Making such data transparent creates opportunities to hold these powers accountable. However, transparency alone is not enough. This information must be used to keep powers in check.
Laws, treaties and other legal constraints
Michael Poulshock maintains a continuously updated list of the data protection laws that govern what can and cannot be done with someone's personal data in countries around the world. It is important to note that many countries do not have data protection laws and the contents of national laws can vary greatly in regards to how long data can be retained and how these laws are enforced.
“In the European Union, there's the Data Protection Directive and the proposed General Data Protection Regulation, as well as numerous other treaties, regulations, and court opinions governing personal data. In the U.S., there are dozens of federal regulations pertaining to personal data, such as the Health Information Portability and Accountability Act, the Online Privacy Protection Act, and the Electronic Communications Privacy Act. Additionally, many U.S. states have their own laws, and common law principles such as tort and libel apply. In Australia, there's the Australian Federal Privacy Act. In Canada, the Personal Information Protection and Electronic Documents Act. The UK has the Data Protection Act of 1998. France, Switzerland, India, Nigeria, and Pakistan all have laws that apply to personal data. And that's just scratching the surface.”
The privacy protection laws of many countries can also be applied to organisations and are usually enforced in Private Law or by a domestic Data Protection Supervisor. Whether or not such laws are currently strictly enforced, it is good practice to protect data relating to activists’ supporters. A database of donors like CiviCRM is vulnerable to theft or confiscation in a ploce raid, which can subsequently lead to malicious publication and the exposure of the financial supporters of an organisation. If there is danger of such a police raid, it might be better not to have a supporter database at all (a practice known as data minimisation).
Anonymisation of data
No data exists in a vacuum. Even if a piece of information is not itself directly personally identifying information (PII), when combined with other data it can become PII. Any data point can be cross-referenced and refined to accurately identify persons. A name like John Smith is probably not PII, but combining it with a unique place of birth can be enough to make it PII. Few John Smiths are probably born in Ulaanbaatar, Mongolia. The more data we add, the more accurate the identification. This also applies to all kinds of anonymisation techniques, as demonstrated by the Netflix and the AOL de-anonymisation results. The problem with anonymisation is that the more thorough the anonymisation, the less useful the data becomes. In short, anonymisation doesn't work: either the data is so anonymised that it is useless, or it can be de-anonymised. The seminal paper on this is Paul Ohms “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymisation.”
Data retention and 3rd parties
In some cases, laws mandate that organisations retain PII. Examples include requirements that hotels keep information about their guests and that internet service providers maintain user information. In the case of internet service providers, it makes sense to practice data minimisation and not keep any logs. Many organisations collect statistics on web visitors which they then surrender to Google Analytics, giving Google, the US government and possibly other law enforcement agencies access to this data. More privacy-conscious organisations might use a free software solution like Piwik; even still, this data is vulnerable to malware, hacking attacks on the server and seizure by law enforcement agencies. Websites with Facebook 'like' buttons, advertising banners and pictures from 3rd-party image or video hosters (like Flickr, Imgur, and YouTube) leak even more PII. When visitor information is handled by such a 3rd party company, it is accessible to the law enforcement agencies of that company's country as well.
The architecture and business model of the internet is one which enables multiple third parties to constantly collect, process, aggregate, share, sell and store data in various countries around the world. This means that while our data might initially be collected in one country, it might end up travelling to multiple other countries before it is ultimately stored in a final country --only to then be shared again with parties located in other countries. In other words, it is very difficult to pinpoint the precise location of our data at any given moment, which in turn makes its regulation and protection even harder. For more information about this look at trackography.org, a project that illustrates who tracks us when we browse the internet.
Image from the Trackography project that looks at who tracks you when you read news online.
Privacy Impact Assessments
How do you know how sensitive the information you are collecting is? The first thing to think about is: what information are you collecting? When working on sensitive issues, one logical approach is simply not to collect information you don't need. For example, if collecting evidence on corruption that won't be used in a court case, is it necessary to collect the names of individual sources or would a more anonymous description suffice? The question of what information to leave out for the sake of security can be a difficult one, especially as we often don't know what we are looking for when initially investigating an issue. It can be better to collect a broad swathe of data in order to later identify the subset that's interesting to us or that tells a compelling story.
It is useful to carry out a Privacy Impact Assessment to help make some of these decisions. The steps below can aid in assessing the privacy risks associated with your investigation.
Some of the key steps in a Privacy Impact Assessment
- Identify all of the personal information related to a programme or service and then look at how it will be used.
Apply this four-part test for the necessity and proportionality of highly intrusive initiatives or technologies:
- Is the measure demonstrably necessary to meet a specific need?
- Is it likely to be effective in meeting that need?
- Is the loss of privacy proportional to the need?
- Is there a less privacy-invasive way of achieving the same end?
Apply the ten privacy principles set out below:
- Accountability:There must be someone in charge of making sure privacy policies and practices are followed.
- Identifying purposes: Data subjects must be told why their personal information is being collected at or before the time of collection.
- Consent: Data subjects must give their consent to the collection, use and disclosure of their personal information.
- Limiting collection: Only information that is required should be collected.
- Limiting use, disclosure and retention: Personal information can only be used or disclosed for the purpose for which it was collected. Further consent is required for any other purposes. Personal information should be kept only as long as necessary.
- Accuracy: You must make every effort to reduce the risk that incorrect personal information is used or disclosed.
- Safeguards: You must protect personal information from loss or theft. You must create safeguards to prevent unauthorized access, disclosure, copying, use or modification.
- Openness: You must make your privacy policies readily available to the data subjects.
- Individual access: Data subjects have the right to ask to see any of their personal information held by you and they have the right to know who the information has been given to. They can challenge the accuracy of personal information and ask for corrections.
- Challenging compliance:Data subjects must be able to challenge your privacy practices.
- Map where personal data is sent after it is collected
- Identify privacy risks and the level of those risks.
- Find ways to eliminate privacy risks or reduce them to an acceptable level.
In this section we will suggest a number of free, open-source and easy-to-use tools that can help you gather, store and move your sensitive information in a secure way.
Securing data in motion
Investigations are not often carried out in isolation. We collaborate with many people remotely and send emails, chat and transfer files with a number of individuals and organisations. How are you storing information as you gather it? Is it stored statically on your computer or other devices, or is it uploaded to the internet, communicated via e-mail or otherwise transferred? It is essential to take precautions to store and send data securely, even as you gather it. This section looks at how to secure data when it is in motion and how to move your sensitive information in a secure way.
Searching within browsers:
Mozilla Firefox is an open-source web browser which enables add-ons to be installed for greater security.
To circumvent censorship and maintain anonymity while browsing, you can use TOR Browser* or a trustable Virtual Private Network such as Riseup VPN.
- For a secure, open-source alternative to Skype or Viber, consider one of the many clients which enable “Off the Record” chat encryption such as Pidgin, Adium, or Jitsi, which also facilitates encrypted voice and video chat.
- Use PGP encrypted* for the most secure email communication. A number of programmes can help you do this, including the enigmail add-on for Mozilla Thunderbird and gpg4usb.
In the field:
What information are you carrying with you in the field? Do you need to carry your laptop with all your personal and professional information on it?
What pictures are on your camera or phone?
How great is the risk of theft or confiscation? Try to only bring what's necessary or take a laptop that has minimal information on it.
Disclaimer: Cryptography might be illegal or not possible in different countries. Before downloading or using these tools you should first check what is appropriate in your specific circumstances and consider all risks related to your choice.
Securing data at rest
Distinguishing sensitive information from non-sensitive information necessarily means that some people shouldn't have access to it, wherever it is stored. Think about if it is necessary for everyone in the organisation to have access to your research. This is true for both printed and electronic material: just as sensitive printed documents should be separated from public documents and kept in a locked room or container, sensitive electronic documents should be stored securely, ideally using a file encrypted programme. This will ensure that even if the material is stored online it can only be unencrypted and accessed with a password. Moreover, you can use it to create an encrypted backup of your sensitive material, ensuring that if you lose your backup, it won't be accessible to whoever finds it. Finally, if you have sensitive information that is not of use to you any more, you should securely destroy it. Mix paper documents with non-sensitive material and shred them together. Electronic documents need to undergo a similar process: it is not enough to simply put them into your recycle bin and empty it; these files can be easily recovered. In fact, the only way to securely delete files is to use a program which overwrites files with randomly generated material.
Have a back-up plan:
Considering the multitude of threats that exist to the well-being of our data – from power outages to malware infection, hardware failure and theft – the loss of some or all of our data is ultimately more certainty than risk, regardless of issues of security or safety. This issue can be quite simply solved by having a regular personal or organisational back-up strategy in place.
It may help to begin with mapping out what information you are storing (including originals and copies) and where it is stored (such as on a server, hard-drive, flash drive, online storage, etc.).
The question “How many days of work can I afford to lose?” may help guide you as to how regularly you should back up your information.
Remember to store your backups in a separate physical location to your master copies, especially as protection against theft or investigation.
'Cloud storage', or storing originals or backups of files online, is becoming increasingly popular. While it does facilitate the sharing and remote access of files, there are a number of important things to keep in mind. First, if the cloud is the only place you are storing your backup, it will be useless should you lose internet connectivity. Second, how secure or vulnerable is the connection between your computer and the cloud servers (for example, does it use HTTPS?) and how much are you willing to trust the provider of the service not to hand over your information to adversaries should a legal or other type of request be made? In general, it's not recommended to store sensitive information online unless it is encrypted. Only store such information on your own servers or with a trusted provider.
- Encryption programmes such as VeraCrypt, allow you to create encrypted file containers, or even encrypt the entire hard-drive of your computer, and much more. To recover files you previously deleted we recommend Recuva, whereas to make sure they are gone for good, you can use Eraser.
Protecting your sources:
What information are you storing about your sources or others involved in your study and the nature of their involvement? Is it truly necessary, and could it put them at risk?
Consider the places this information may be available like membership records, email inboxes and text messages. Delete anything that's not strictly necessary.
Have a strong passphrase:
- While all of the above tools are very strong from a technological perspective, many of them will only be as strong as the password or passphrase you use to interact with them. For easy and very effective ways of creating and maintaining stronger passwords, see Security in-a-Box. For a great free, open-source tool for storing and generating passwords, we recommend Keepass.
Although social networking websites offer a powerful tool for collaborating publishing and spreading your research, there are a few points to keep in mind when taking advantage of this medium:
Read the End User Licence Agreement and privacy and data use policies carefully. How are your content and personal data treated? Who are they shared with?
Who owns the intellectual property you upload to their servers? Might you be granting the owners of the social network or online platform ownership of your research?
Where are the servers of the social network located, and to which jurisdiction are they subject? Could they be pressured to hand over your information?
You may wish to use social networking sites as an interactive publishing platform to solicit feedback, debate, disseminate results or gather information. You should consider whether you want these activities linked to your personal Facebook account or if it is best to create a fake account with different details which you access through anonymity tools in order to better protect yourself.
Keep in mind that everything you do on a social networking site leaves traces behind which can be handed over should a legal request be made and which can easily be compromised. You should consider whether the risk of giving away your own details, your working process, your sensitive information or your contacts and sources is really worth it.
Similar considerations should be made when using other publishing tools, such as WordPress, Blogger, or YouTube. Be aware of the information you are handing over to these service providers and the traces you are leaving behind. If anonymity is important to you, take the necessary measures to ensure it.
If you intend to publish images or documents in PDF or Office formats, be aware of metadata, that is, the data contained within files, including information about the owner of the hardware of software (i.e. authors or collaborators on a document), the camera used and the location where a picture was taken. You may wish to reduce the amount of identifying metadata in your files by limiting the information about yourself that you provide when registering hardware and software products, and adjusting the settings (such as GPS) on your digital camera or smartphone.
Remember that when you register a website, and especially if you pay with a credit card, this information may be linked to you. If avoiding this is paramount, consider getting someone else to register the domain for you or think about using Tor's hidden services to host your site. All these decisions are about finding the right balance; we have to be aware that using anonymous services may also attract suspicion.
Working with online tools:
Many tools that are used for investigations are available in your browser, otherwise known as cloud-based tools. When you use verification tools such as Google's reverse image search and metadata viewers such as Fotoforensics, you are accepting a trade-off between easily accessible and useful tools and the privacy of your information. There are a number of questions you should ask to decide whether the trade-off is worth it:
Do you give up ownership of your information when you use these tools and online services?
Do you know which country your data and visualisations are located in, and what laws govern the use of your information?
Who else has access to the information that you put into these services? For instance, if you use the reverse image search tool TinEye, remember that means that you are uploading the image to their servers.
What happens to your information when you stop using these tools?
What happens to your data if the company who owns the online tool decides to stop providing this service to you?
Would interception of your use of these tools put yourself and your organisation in physical danger?
Data we leave behind while working with data
Carrying out research on the internet is an increasingly important aspect of data gathering. However, unbeknownst to many users our online activities and communications are not by default anonymous or private. Many website administrators can trace visitors back to their very homes, and email and chat service providers can transmit our communications with as little privacy as a postcard. When we use off-the-shelf tools and social networking tools to work with others, we can lose control of our data and the connections between us and others can be immediately accessible.
Your browsing history, cookies and temporary internet files provide a rich cache of information about your online research. Make sure they are either not collected, or delete them after each use.
Keep your visits to sensitive websites anonymous by using circumvention and anonymity tools such as VPN or Proxy, such as Tor. Be sure you are aware about the use of anonymity tools in your own context; in some cases using the tool can also raise a red flag so research this for your specific environment first.
If you are soliciting information from the public via your own website, be sure that you provide them with a Secure Socket Layer (SSL, also known as HTTPS) connection over which to communicate.
Similarly, only communicate via e-mail providers which provide an SSL connection.
Encrypt your emails for the most secure email communication possible Encrypting emails ensures they can only be read by their intended recipients. You can do this by using PGP encryption. Again you should be aware that in some countries the use of encryption may be legally restricted or outlawed and the use of encrypted communications may attract attention to your activities. Analyse the environment you are working in first. If you are being monitored, will the sudden appearance of encrypted email raise suspicion or possible legal consequences?
Many common VOIP providers such as Skype and Viber do not provide any evidence that your communications are secured and may hand over logs of your chat and call history if requested by authorities. Remember that these tools store your call history by default, which may be a problem if your computer is confiscated. You can ensure your chats are secure by using open-source chat clients which facilitate encryption.
Even though it's tempting to use social networking tools because of their large reach, think carefully about if this is really the best way. If you don't want something to be public and accessible to everyone, avoid sending it over Facebook or Twitter.
Networks of associations: communication metadata
Communication metadata is often more valuable that the content of the phone call. This metadata provides information about who knows who, the scale of networks, frequently visited locations and much more. All online communication leaves behind metadata traces throughout the communication process. As a general rule, limit sensitive information should over mobile phone and text messages.
- Me and My Shadow website
- Security in-a-box website
- For more information on communication metadata, read the interview with Michael Kreil, An Honest Picture of Metadata.
- A good place to go is the cryptolaw survey website maps out existing and proposed laws and regulations on cryptography.
- Paul Ohms seminal paper “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymisation.”
Image created by John Bumstead
This chapter focuses on the ethics of data collection and the publishing of that data. We will look at two types of data collection: the data that you collect and the data that others collect that you use as part of your investigation. The final part of the chapter looks at the ethical considerations concerning publishing data, specifically focusing on reporting on research and monitoring impact.
This guide has introduced many different examples of the impacts that evidence and data can have. If we believe in the power of evidence and data and recognise them as the real currency of our work, then we have to examine the possible consequences of that very currency being used to undermine trust, destroy reputations and compromise the security of individuals, or enabling such a situation. Data is never neutral, and that is sometimes its nature, but sometimes its neutrality depends on the ways in which we treat it.
Data is often labelled as 'sensitive' by us or others as a catch-all term for data that, for various reasons either intuitive to us or because we were warned by others, might be sensitive. At the same time, we are often left with no methodology to judge what might make compromising this sensitivity fatal and for whom. This compromise could happen for a variety of reasons intentional or otherwise, from interested individuals, groups or institutions who are willing to invest time and resources into observing what is happening and who is involved, or even in some cases preventing the information from reaching the public, to sources leaking information or even investigators leaking sources' information. The biggest compromise could be the loss of data because its sensitivity derives from its uniqueness or simply that it might not be verified and releasing it in a raw form might lead to drawing complicated conclusions.
Depending on your context, there are different strategies you can adopt. It's important to consider the way in which you handle your data; how it's gathered, stored and shared and the ways you choose to publish your evidence. In some situations this could mean working with data and sources very privately, in other cases it may be that working in an open, public and transparent manner might be better. Or, indeed, a mix of both strategies may be appropriate, once you have defined which information and actions are 'sensitive' and which can be 'public'.
There are no automatically right or wrong answers for developing a strategy on how to handle your information and the security of your work when gathering information. Whatever policy is adopted, the important thing is that the choices you make are based on a good understanding of the context and the potential and future risks, as your investigation will become part of or even shape the current landscape. This is particularly true if you are looking into something that is, or could be controversial, investigating events or actions that are not widely documented or if you are gathering information that could provide new insights.
This chapter focuses on the ethics of data collection and the publishing of that data. We will look at two types of data collection: the data that you collect and the data that others collect that you use as part of your investigation. The final part of the chapter looks at the ethical considerations concerning publishing data, specifically focusing on reporting on research and monitoring impact.
Working with data you have collected
When collecting data online or offline, there is a tendency shared by many individuals as well as by technology corporations, to collect everything, just in case it becomes useful in the future. Before thinking about collecting data for a new dataset, it’s important to ask yourself, “What is the minimum amount of data I need to collect?” There is no single right answer to this question. What is important is to spend time really thinking about what you need.
There are no set rules on how to go about doing data investigations in an ethical way as assessing risk is always context specific. However, there are several things to consider:
Study risk assessment and safety planning: Before starting to collect data, it is important to think about online and offline risks that collecting data may pose both to you and others involved in the investigation as well as those who are giving their data. To learn more about how to do this, we recommend looking at the Protecting Data chapter in this guide.
Verbal informed consent: Whether collecting qualitative (e.g. through interviews) or quantitative (e.g. through surveys) data, it’s important that respondents consent to their stories being part of the dataset. Keep in mind that people can’t consent without knowing what they are consenting to. To address this, make sure to tell people about the research purpose before beginning to collect data. Those involved should be told the expected duration of their participation as well as any potential foreseeable risks, discomforts or benefits to participating in the research. Make sure to clearly explain how paper and digital records will be handled and maintained and also be clear about who will have access to them. Finally, talk to them about the intended uses of the research findings and offer contact details to them if they should have any questions or concerns after the data collection.
Social media: Data investigators conducting primary desk-based research are increasingly scraping user generated data from social media. We feel data can only be used if it’s collected in an ethical manner. To obtain user generated content without first obtaining the consent of users might be unethical. But what if end users have consented to their data being used and shared when they signed the Terms of Service (ToS) or End User License Agreements (EULAs) of a particular platform? What if end-users didn’t read or understand what they were agreeing to? Oftentimes ToS and EULAs are written in a complex and unclear way. As data investigators, we need to recognise that the extent to which users are actively able to consent often becomes questionable. This doesn’t mean it shouldn’t be done, however.
Take the case of the Brazilian campaign “Virtual Racism, Real Consequences” which takes hate speech comments from Twitter and Facebook, and uses a geo-location tool to find out the location of the person who posted the comment. Organisers want to raise the question: does a comment on the internet cause less damage than a direct offence? After taking a screen shot of the picture, the activists rent billboard space and post the comment for the public to see, taking the precaution of pixilating the user’s name and profile picture to maintain the anonymity of the poster. Users post hate speech voluntarily, but don’t necessarily consent to having their words and images posted publicly on billboards.
An image of one of the billboards in Brazil created by Virtual Racism, Real Consequences
Supportive interview style: Sometimes doing interviews is difficult for those being interviewed, particularly in the case of bringing up violence and structural repression. Be sure your questioning style is open and supportive and ask your sources how they feel about the issues being brought up in the interview. Your job as an interviewer is to uphold a psychological
The manual Zen and the Art of Making Tech Work For You discusses recommendations and resources on creating safe spaces during interviews both online and offline.
Ongoing operational and logistical planning on location: Related to a supportive interview style is ensuring that those being interviewed feel physically safe while being interviewed. Be sure to ask the people you talk to where and at what time they feel most comfortable talking and be flexible about changing the location spontaneously to adapt to their shifting schedules or security concerns.
Confidential and anonymous data collection practices: The security and safety of those involved in research projects, and the information they share is crucial. Think about how you document information during and after interviews, even if you are not using a recording device. Make sure to code interviews as to not reveal the names of those being interviewed and keep all digital recording devices on you at all times to protect from an operational security breach. Additionally, once you as a data investigator have a dataset, it’s important to think about how you hold and share that data. There is a tension between promoting transparency by publishing datasets and the fact that often data can be de-anonymised, meaning that the publishing of datasets could expose people to additional online or offline risks. To learn more about this, have a look at the Protect Chapter.
Maintaining consistency of data entry: Whether intentionally or not, defining tagging systems, indexing systems and putting data in context inherently builds in the personal biases of the data investigator. Think about how data is interpretive—it represents a judgement made by the person entering the data. For example, two people may interpret guidance as to whether someone is 'happy' or 'very happy' about something or it could be something more serious such as whether a human rights violation involves 'moderate violence' or 'severe violence.' It is important to consider whether you have clear guidance to help people make these choices and have processes to check that all the people entering data are applying it in the same way. You can address this by having:
A 'double entry' system where the same data is entered separately by two people. Where differences arise, the data is flagged as problematic.
Regular 'levelling' and meeting of people working on the data to discuss different data and how they should enter it.
A single person entering a particular field of data that requires some specialised knowledge.
Working with data others have collected
On 13 October 2015 Frontex, the European border agency, posted on Twitter that “the total number of migrants who crossed the EU’s external borders in the first nine months of this year rose to more than 710,000” proceeding to say that this figure is much higher than the 282,000 recorded in all of 2014.
Tweet posted by @Frontex on 13 October 2015
After requesting clarification from researchers about their methodology, Frontex clarified that they counted an individual migrant each time they crossed an external border. “Irregular border crossings may be attempted by the same person several times in different locations at the external border,” the agency stated, “this means that a larger number of the people who were counted when they arrived in Greece were again counted when entering the EU for the second time through Hungary or Croatia." News agencies, however, were unlikely to share this methodological note in their reporting. As a result, coverage of the Frontex report vastly overstated the number of people who actually entered the EU.
Those doing data investigations don’t have always have the time, resources or skills to conduct their own research or create new datasets. Instead, what we see most often are investigations based on existing datasets compiled by other people, which may or may not be stripped of personally identifiable information or be published at an aggregate level.
What’s important to consider when using datasets compiled by others is that the indexing, categorisation and aggregation of datasets is not neutral and the datasets may not depict the situation you presume they are. Be critical of the data you are using when conducting data investigations.
Reporting on Research
Sparked by the death of Mark Duggan by Metropolitan police, several London neighbourhoods and towns across the United Kingdom were immersed in five days of rioting in August 2011. Over 3,100 people were arrested, of which over 1,000 people were charged. By some estimates widespread looting, arson and the destruction of private and police property tied to the rioting accounted for a loss of over £200 million.
In the aftermath of the unrest, a team of researchers from The Guardian and London School of Economics conducted a major research study to gain a better understanding who those involved where, whether they engaged in violence, arson and attacks on police or looting, how they went to the streets and their motivation for doing so and how they feel about what they did and what occurred.
Researchers contacted over 1,000 individuals who had been arrested and charged during the time of the rioting. Overwhelmingly, though, researchers spoke with participants they found through word of mouth—people who participated in the riots but were not charged.
Image taken by Kerim Okten/EPA of the third day of the disturbances
This presented an ethical challenge: at the time research was conducted police were continuing to arrest and charge individuals they suspected of participating in the rioting. On the one hand, those involved wanted to share the stories of why they participated with researchers. On the other, researchers did not want to expose the research participants to risk of prosecution because of participating in the project.
In this case researchers did in fact go through a formal ethical review board, but we’re highlighting this research because they managed these ethical issues in an interesting way. Although researchers did collect basic demographic data in order to identify the backgrounds of those who participated, they told participants about the research they were carrying out and ensured them that the findings would remain anonymous.
Researchers also made sure that those participating in research felt as comfortable as possible. They often conducted interviews at the homes of participants or at youth centres, cafés, coffee shops and fast-food restaurants—places where those being interviewed felt comfortable and that wouldn’t expose them to additional risk.
Researchers presented findings globally in a series of online articles and reports published in The Guardian. Seeing the need to 'give back' to those who participated in the research but also to create a space where those who did not agree with research findings could voice their concerns, a series of town hall-style debates were organised in the seven communities most effected by the rioting. The debates were attended by more than 600 people.
At the conclusion of projects it’s often important to present research findings to those who were involved in the research process. A workshop setting, for instance, can provide a space for reflection both for researchers and research participants as well as a space for feedback by participants to researchers. However, in some cases sharing research findings with those who participated may potentially increase the online and offline risks of those who participated. It is important to consider the context when determining how to go about doing your outreach efforts.
Video for Change, a global network of video advocacy projects, has developed an ethical framework and set of principles within which it becomes possible to evaluate the complex kind of work NGOs do. They have found that most NGOs tend to determine impact by focusing on outreach measures rather than, say, the process and practices of video advocacy, participation and engagement with communities and accountability. Video for Change finds this is due in part to the ability of new technologies and practices to measure the reception and audience of their work, including statistics on how many people have viewed the piece, where viewers are located, what type of device viewers are viewing the material on, how viewers came across the material and more.
Relying on only quantitative indicators for impact assessment is dangerous from a funding perspective, but it is also dangerous from a data investigation perspective. Without quantitative data, however, we are faced with a series of questions: How do we, as investigators, know that what we are doing is working?
How can we determine whether our research, as an intervention, is having an effect (positive or negative) on the communities we are partnering with? And who do we, as human rights defenders, hold ourselves accountable to?
Uptake in terms of scale and scope of research and materials is important in determining the success of a particular project. However, alternative measures of impact assessment are needed as well. In their ethical guidelines, Video for Change asks, what if “the people the video was meant to support were inadvertently harmed or hampered by making it or the community the video sought to support were against its release and felt it unjustly portrayed or represented them?” This element of advocacy-based practitioner research requires a strong ethical component.
For networks like Video for Change, relationships with communities are not kept at an objective distance; the strength of their work comes from participation, sharing and engagement. Thus, accountability to the communities they work with is both a part of their ethical practice as well as a gauge by which they measure their impact.
- The Engine Room runs the Responsible Data Forum, a collaborative effort to develop useful tools and strategies for dealing with the ethical, security and privacy challenges facing data-driven advocacy. They have a lot of great resources on how to use data responsibly, from data retention to data visualisation.
- Gabriella Coleman, Coding Freedom: The Ethics and Aesthetics of Hacking, read the PDF here.
- Jacob Metcalf. Ethics codes: History, context, and challenges, read the PDF here.
Decoding Data was produced by the Exposing the Invisible team of Tactical Technology Collective. All chapters were written and rewritten by a large team of contributors. The Networks chapter is the exception as the entire chapter was written by Mushon Zer-Aviv. We would like to thank: Fernanda Shirakawa, Gabi Sobliye, Jeff Deutch, Hadi Al Khatib, Leil-Zahra Mortada, Marek Tuszynski, Stf. Previously published content that was re-published in this Guide was compiled and written by Marek Tuszynski, Maya Indria Ganesh, Stephanie Hankey and Tom Longley.
Many thanks to John Bumstead who supplied all the Chapter images for this Guide, these images and others can be viewed here.
This guide would not have been made possible without the support of theOmidyar Network.
The contents of Decoding Data have been chosen with the aim of providing advocates with a selection of tools and strategies chosen by practitioners in the field. Despite the fact that they have been chosen with the needs of this audience in mind, there is of course no ‘one size fits all’ solution. The legality, appropriateness and relevance of these tools will vary from one situation to another. In providing these tools, we advise you to select and implement them with a common sense approach. If you have any questions about appropriate use within your specific context or country, please seek the advice of a trusted local technical expert or request more information firstname.lastname@example.org.
The tools and strategies referenced herein are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. In no event shall the Tactical Technology Collective or any agent or representative thereof be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption), however caused under any theory of liability, arising in any way out of the use of or inability to make use of this software, even if advised of the possibility of such damage.
Decoding Data is a product of Tactical Technology Collective is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All chapter images were created by John Bumstead and are not licensed under a Creative Commons License.