A deep dive on scraping and parsing, reverse engineering a digital document to make the data in it more useful.
(This post was first published by Tactical Tech in our Data & Design How-Tos on drawingbynumbers.org)
Most of us have come across information 'locked' in a PDF document. It is possible to unpick these documents and to “reverse engineer” them to make their content easier to work with and analyse. This can be done using a technique called scraping and parsing. In the example below we look at data produced by a single organisation in Zimbabwe, but the ideas and techniques are applicable to anywhere that a digital publication format gets in the way of using the data inside it. The idea applies equally to extracting data from a website.
Zimbabwe Peace Project (ZPP), a Zimbabwe-based organisation documents political violence, human rights violations and the politicised use of the emergency food distribution system. They have a nationwide network of field monitors who submit thousands of incident reports every month, covering both urban and rural Zimbabwe. Between 2004 and 2007, ZPP released comprehensive reports detailing the violence occurring in the country. The reports are dense PDFs and Microsoft Word documents that are digests of incidents, unique in their comprehensiveness. As documents, they are also pretty inaccessible and get in the way of trying to see what happened and how the situation changed over those years. Locked inside PDFs, it is hard to do anything else with the data, such as search, filter, count or plot it on maps. What can we do about this?\
All documents are arranged in a particular, pre-defined way. Whether they are reports or web pages they will have a structure that includes:
- different types of data, such as text, numbers and dates.
- text styles like headings, paragraphs and bullet points.
- a predictable layout such as a heading, a sub-heading, then two paragraphs, another heading, and so on.
Here's a single page from one of ZPP's report about political violence in Zimbabwe in 2007 (PDF).
This structure of the above page can be broken down as follows:
How does it appears in the report? | What is it really? | What type of data is it? | |
---|---|---|---|
1 | Northern Region | Heading 1 | Geographic area (Region) |
2 | Harare Metropolitan | Heading 2 | Geographic area (District) |
3 | Budiriro | Heading 3 | Geopraphic area (Constituency) |
4 | A date | Heading 4 | Date (of incident) |
5 | Paragraph | Text | Text describing an incident |
4 | A date | Heading 4 | Date (of incident) |
5 | Paragraph | Text | Text describing an incident |
This structure repeats itself across the full document. You can see a regular, predicable pattern in the layout if you zoom out of the report and look at 16 pages at once:
So there's lots of data there, but we can't get at it. The report is very informative, containing the details of hundreds of incidents of politically-motivated violence. However, it has some limitations. For example, without going through the report counting them yourself, it is impossible to find out what incidents happened on any specific day across Zimbabwe. This is because the information is not structured to make it easy for you to find this out. It is written in a narrative form, and is contained in a format that makes it hard to search.
To do something about this, the format of the information has to change to allow it to be searched better. Try to imagine this report as a spreadsheet:
Geographical area (Region) | Geographic area (District) | Geographic area (Constituency) | Date of incident | Incident |
---|---|---|---|---|
Northern region | Harare Metropolitan | Budiriro | 4 Sep 2007 | At Budiriro 2 flats, it is alleged that TS, an MDC youth, was assaulted by four Zimbabwe National Army soldiers for supporting the opposition party. |
Northern Region | Harare Metropolitan | Budiriro | 9 Sep 2007 | In Budiriro 2, it is alleged that three youths, SR, EM and DN, were harassed and threatened by Zanu PF youths, accused of organising an MDC meeting. |
Northern Region | Harare Metropolitan | Budiriro | 11 Sep 2007 | Along Willowvale Road, it is alleged that AM, a youth, who was criticising the ruling party President RGM in a commuter omnibus to town, was harassed and ordered to drop at a police road block by two police officers who were in the same commuter omnibus. |
A spreadsheet created in something like Open Office Calc or Microsoft Excel enables this information to be sorted, filtered and counted, which helps us explore it more easily.
However making this spreadsheet from the original ZPP reports would require lots of cutting and pasting – time that we don't have. So what can help us? If you can read it, a computer might be able to read it as well. Thankfully, documents that are created by computers can usually be “read” by computers.
With a little technical work, a report like the one in our example can be turned from an inaccessible PDF into a searchable and sortable spreadsheet. This is a form of machine data conversion. Knowing how this works can change how you see a big pile of digital documents. The computer programs that are used to convert data in this way are called scraper-parsers. They grab data from one place (scraping) and turn it into what we want it to be through filtering (parsing).
Scraper-parsers are automatic, super-fast copying and pasting programs that follow the rules we give them. The computer program doesn't “read” the report like we do, but it looks for the structure of the document, which as we saw above is quite easy to identify. We can then tell it what to do based on the elements, styles and layouts it encounters. Using the ZPP reports, our aim is to create a spreadsheet of the violent incidents, including when and where they happened. We would give the scraper the following rules:
-
Rule 1: If you see a heading that is a) at the top of a page, b) in bold capitals, you shall assume this is a Geographical area (Region) and print what you find in Column 1 of the spreadsheet.
-
Rule 2: If after seeing Geographical area (Region), you shall assume that until you see another heading at the top of the page in bold capitals that is different from the previous one, you are looking at things that have happened in that geographical region.
-
Rule 3: Until then, whenever you see a paragraph of text that has one line on top of it, and one line beneath it, that is preceded by a date in the form “Day Month Year”, these are incidents that happened in this geographical region, so you will copy them to the column called “Incident”.
Once the rules are set, the scraper-parser can be run. Very quickly it will have gone through this 100 page document pulling out the data you have told it to. The scraper might not get it right the first time, and there will be errors. The point is that you can improve a scraper-parser, run it hundreds of times and check by hand what it has put in your spreadsheet, and it will still be faster than trying to re-type out the content yourself.\
Scraper-parsers have to be written especially for each document because the rules will be different, though the task is the same. However, in most cases it is not a major challenge for a programmer, the challenge is for you to understand it is possible, and explain clearly what you want!
Dull repetitive stuff is really what computers live for, so it makes them happy. You might think that it is not worth writing a scraper-parser for a one problem task. However, what if you have hundreds of documents, all with the same format, all containing information you want? In the Zimbabwe example, there are 38 reports produced over nearly 10 years. Each is dense, and in total contain data on over 25,000 incidents of political violence. The format gets in the way of being able to use this data.
A scraper-parser can:
- Go through all 38 documents you tell it to, whether on your computer, or on the Internet (scraper-parsers can browse the internet as well).
- Pull out the data that you tell it to, based on the rules that you make for it.
- Copy all that data in to a single spreadsheet.
A scraper-parser can also:
- Check each day on the website where ZPP publishes its reports and if there is a new one, download it, then email you to let you know, before adding it to the list of reports it “reads” into your spreadsheet.
- Include new columns for the date the report was published, and the page number where the incident was recorded in the report (so you can check the data has come across properly).
- Change the format of every date for you e.g. from 27 September 2004 to 17/09/2004.
- Automatically turn the spreadsheet into an online spreadsheet (like Google Spreadsheets) that can be shared freely online, and update it when data from a new report becomes available.
Scraping and parsing can be technical, but if you are trying to extract data that is already organised in a table then it is much easier and there are tools that can help you. To unlock data from more complicated layouts, you may need to get a programmer involved. Here are further resources that can help deepen your understanding of this technique and give it a try yourself:
- School of Data offers various courses on scraping: how to scrape data from websites without using code , how to scrape data from tables in PDFs and more advanced tutorials for programmers.
- Scraperwiki is a tool to unlock data held in PDFs, tweets and websites. Great for technical and non-technical users alike.
- Pro Publica produced a [guide on how they used scrapers (http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data) to collect data to show the connections between pharmaceutical companies and doctors in the US.