Encaged documents final20120413 web 0 1.jpg  770x220 q85 crop

Deep dive on scraping and parsing: reverse engineering a digital document to make the data in it more useful

(This post was first published by Tactical Tech in our Data & Design How-Tos on drawingbynumbers.org)

Most of us have come across information 'locked' in a PDF document. It is possible to unpick these documents and to “reverse engineer” them to make their content easier to work with and analyse. This can be done using a technique called scraping and parsing. In the example below we look at data produced by a single organisation in Zimbabwe, but the ideas and techniques are applicable to anywhere that a digital publication format gets in the way of using the data inside it. The idea applies equally to extracting data from a website.

Zimbabwe Peace Project (ZPP), a Zimbabwe-based organisation documents political violence, human rights violations and the politicised use of the emergency food distribution system. They have a nationwide network of field monitors who submit thousands of incident reports every month, covering both urban and rural Zimbabwe. Between 2004 and 2007, ZPP released comprehensive reports detailing the violence occurring in the country. The reports are dense PDFs and Microsoft Word documents that are digests of incidents, unique in their comprehensiveness. As documents, they are also pretty inaccessible and get in the way of trying to see what happened and how the situation changed over those years. Locked inside PDFs, it is hard to do anything else with the data, such as search, filter, count or plot it on maps. What can we do about this?

All documents are arranged in a particular, pre-defined way. Whether they are reports or web pages they will have a structure that includes:

Here's a single page from one of ZPP's report about political violence in Zimbabwe in 2007 (PDF).

This structure of the above page can be broken down as follows: 


How does it appears in the report?

What is it really? What type of data is it?
1 Northern Region Heading 1 Geographic area (Region)
2 Harare Metropolitan Heading 2 Geographic area (District)
3 Budiriro Heading 3 Geopraphic area (Constituency)
4 A date Heading 4 Date (of incident)
5 Paragraph Text Text describing an incident
4  A date Heading 4 Date (of incident)
5 Paragraph Text Text describing an incident

This structure repeats itself  across the full document. You can see a regular, predicable pattern in the layout if you zoom out of the report and look at 16 pages at once:

So there's lots of data there, but we can't get at it. The report is very informative, containing the details of hundreds of incidents of politically-motivated violence. However, it has some limitations. For example, without going through the report counting them yourself, it is impossible to find out what incidents happened on any specific day across Zimbabwe. This is because the information is not structured to make it easy for you to find this out. It is written in a narrative form, and is contained in a format that makes it hard to search.

To do something about this, the format of the information has to change to allow it to be searched better. Try to imagine this report as a spreadsheet:

Geographical area (Region) Geographic area (District) Geographic area (Constituency) Date of incident  Incident..........................................
Northern region Harare Metropolitan Budiriro
4 Sep 2007 
At Budiriro 2 flats, it is alleged that TS, an MDC youth, was assaulted by four Zimbabwe National Army soldiers for supporting the opposition party.
Northern Region Harare Metropolitan Budiriro
9 Sep 2007 
In Budiriro 2, it is alleged that three youths, SR, EM and DN, were harassed and threatened by Zanu PF youths, accused of organising an MDC meeting.
Northern Region
Harare Metropolitan Budiriro
11 Sep 2007 
Along Willowvale Road, it is alleged that AM, a youth, who was criticising the ruling party President RGM in a commuter omnibus to town, was harassed and ordered to drop at a police road block by two police officers who were in the same commuter omnibus.

A spreadsheet created in something like Open Office Calc or Microsoft Excel enables this information to be sorted, filtered and counted, which helps us explore it more easily.

However making this spreadsheet from the original ZPP reports would require lots of cutting and pasting – time that we don't have. So what can help us? If you can read it, a computer might be able to read it as well. Thankfully, documents that are created by computers can usually be “read” by computers.

With a little technical work, a report like the one in our example can be turned from an inaccessible PDF into a searchable and sortable spreadsheet. This is a form of machine data conversion. Knowing how this works can change how you see a big pile of digital documents. The computer programs that are used to convert data in this way are called scraper-parsers. They grab data from one place (scraping) and turn it into what we want it to be through filtering (parsing).

Scraper-parsers are automatic, super-fast copying and pasting programs that follow the rules we give them. The computer program doesn't “read” the report like we do, but it looks for the structure of the document, which as we saw above is quite easy to identify. We can then tell it what to do based on the elements, styles and layouts it encounters. Using the ZPP reports, our aim is to create a spreadsheet of the violent incidents, including when and where they happened. We would give the scraper the following rules:

Once the rules are set, the scraper-parser can be run. Very quickly it will have gone through this 100 page document pulling out the data you have told it to. The scraper might not get it right the first time, and there will be errors. The point is that you can improve a scraper-parser, run it hundreds of times and check by hand what it has put in your spreadsheet, and it will still be faster than trying to re-type out the content yourself.

Scraper-parsers have to be written especially for each document because the rules will be different, though the task is the same. However, in most cases it is not a major challenge for a programmer, the challenge is for you to understand it is possible, and explain clearly what you want!

Dull repetitive stuff is really what computers live for, so it makes them happy. You might think that it is not worth writing a scraper-parser for a one problem task. However, what if you have hundreds of documents, all with the same format, all containing information you want? In the Zimbabwe example, there are 38 reports produced over nearly 10 years.  Each is dense, and in total contain data on over 25,000 incidents of political violence. The format gets in the way of  being able to use this data.
A scraper-parser can:

A scraper-parser can also:

Scraping and parsing can be technical, but if you are trying to extract data that is already organised in a table then it is much easier and there are tools that can help you. To unlock data from more complicated layouts, you may need to get a programmer involved. Here are further resources that can help deepen your understanding of this technique and give it a try yourself: