This guide presents a series of steps that can be used to automate the collection of online HTML tables and the transformation of those tables into a more useful format. This process is often called "Web scraping."
Public data come in handy at times. Meaningful investigations have made use of everything from live flight tracking data to public registries of companies to lobbying disclosures to hashtags created on Twitter, among countless other examples. While we are clearly seeing a proliferation of such information online, it is not always available in a format we can use. Leveraging public data for an investigation typically requires that it be not only machine readable but structured. In other words, a PDF that contains a photograph of a chart drawn on a napkin is less useful than a Microsoft Excel document that contains the actual data presented in that chart.
This can be an issue even in cases where governments are compelled by Freedom of Information (FoI) laws to release data they have collected, maintained or financed. In fact, governments sometimes obfuscate data intentionally to prevent further analysis that might reveal certain details they would rather keep hidden.
Investigators who work with public data face many obstacles, but two of the most common are:
- Horrendous, multi-page HTML structures embedded in a series of consecutive webpages; and
- Pleasantly formatted tables trapped within the unreachable hellscape of a PDF document.
This guide seeks to address the first challenge. It presents a series of steps that can be used to automate the collection of online HTML tables and the transformation of those tables into a more useful format. This process is often called "Web scraping." In the examples discussed below, we will produce comma separated values (CSV) documents ready to be imported by LibreOffice Calc or Microsoft Excel. (For advice on dealing with PDF tables, have a look at this article and watch this space for an upcoming guide on Tabula, a PDF scraping tool.)
The structure of this guide
This guide has four sections. The first discusses our rationale for choosing the tools and techniques covered in the rest of the guide. The second section is a brief primer on HTML structure, CSS selectors (short sequence of keywords that identifies one or more specific elements on a webpage) and how to use the Tor Browser's built-in Element Inspector to learn what you need to know for section three. In the third section, we walk through the process of plugging those selectors into Scrapy, pulling down HTML data and saving them as a CSV file. Finally, we present three external web tables that are both more interesting and more complex than the example used previously.
If you don't need convincing, you should probably skip down to the section on "Using Scrapy and Tor Browser to scrape tabular data".
Who is this guide for?
The short answer is, anyone with a Debian GNU/Linux system — be it a computer, a virtual machine or a boot disk — who is willing to spend most of a day learning how to scrape web data reliably, flexibly and privately. And who remains willing even when they find out that less reliable, less flexible and less secure methods are probably less work.
More specifically, the steps below assume you are able to edit a text file and enter commands into a Linux Terminal. These steps are written from the perspective of a Tails user, but we have included tips, where necessary, to make sure they work on any Debian system. Adapting those steps for some other Linux distribution should be quite easy, and making them work on Windows should be possible.
This guide does not require familiarity with the python programming language or with web design concepts like HTML and CSS, though all three make an appearance below. We will explain anything you need to know about these technologies.
Finally, this guide is written for variations of the Firefox Web browser, including the Tor Browser and Iceweasel. (From here on out, we will refer specifically to the Tor Browser, but most of the steps we describe will work just fine on other versions of Firefox.) Because of our commitment to Tails compatibility, we did not look closely at scraping extensions for Chromium, the open-source version of Google's Chrome web browser. So, if you're using Windows or a non-Tails Linux distribution — and if you are not particularly concerned about anonymity — you can either use Firefox or you can have a look at the Web scaper extension for Chromium. It's a few years old, but it looks promising nonetheless. It is free and open-source software, licensed under the GNU Public License (GPL) just like Scrapy and the other tools we recommend in this guide.
In defense of The Hard Way
If you have skimmed through the rest of this guide, you might have
noticed a startling lack of screenshots. There are a few, certainly, but
most of them just show the Tor Browser's built-in Inspector being used
to identify a few inscrutable lines of poetry like, td.col1 div.resource > a::attr(href)
. And if that stanza gives you a warm fuzzy feeling, you might consider skipping down to the section on "Using Scrapy and Tor Browser to scrape tabular data". But if it looks a bit intimidating, please bear with us for a few more paragraphs.
Put the question to your favourite search engine, and you will find any number of graphical web scraping tools out there on the Internet. With varying degrees of success, these tools provide a user interface that allow you to:
- Pick and choose from the content available on a webpage by pointing and clicking;
- Download the content you want; and
- Save it in the format of your choosing.
All of which sounds great. But if you start digging into the details and test driving the software, you will find a number of caveats. Some of these tools are commercial, closed source or gone in a puff of smoke.
Some of them have limited functionality, or work for a limited time, until you pay for them. Some of them are cloud-based services run by companies that want to help you scrape data so they can have a copy for themselves. Some of them were written using an outdated and insecure browser extension framework. Some of them only work on very simple tables. Some of them don't know how to click the "next" button. Some of them ping Google analytics. That sort of thing.
And none of this ought to surprise us. It takes work to develop and maintain software like this, and the data industry is a Big Deal these days. Of course, it might not matter for all of us all of the time. Plenty of good investigative work has been done by chaining together half a dozen "15 Day Trial" licenses. But our goal for these guides is to provide solutions that don't require you to make those sorts of trade-offs. And that don't leave you hanging when you find yourself needing to:
- Scrape large quantities of data, incrementally, over multiple sessions;
- Parse complex tables;
- Download binary files like images and PDFs;
- Hide your interest in the data you are scraping; or
- Stay within the legal bounds of your software licenses and user agreements.
Why you might want to scrape with Tails?
There are a number of reasons why you might want to use Tails when scraping web data. Even if you intend to release that data — or publish something that would reveal your access to it— you might still want to hide the fact that you or your organisation are digging for information on a particular subject. At least for a while. By requesting relevant pages through Tor, you retain control over the moment at which your involvement becomes public. Until then, you can prevent a wide variety of actors from knowing what you are up to. This includes other people at your internet cafe, your internet service provider (ISP), surveillance agencies, whoever operates the website you are scraping and their ISP, among others.
Tails also helps protect your research, analysis and draft outputs from theft and confiscation. When you enable Persistence on Tails, you are creating a single folder within which you can save data, and that folder's contents are guaranteed to be encrypted. There are other ways to store encrypted data, of course, but few of them make it this difficult to mess up. Tails disks are also small, easy to backup and even easier to throw away. Sticking one in your pocket is sometimes a nice alternative to traveling with a laptop, especially on trips where border crossings, checkpoints and raids might be a cause for concern.
More generally, Tails is an extremely useful tool if you are looking to keep your sensitive investigative work separate from your personal activities online. Even if you only have one laptop, you can effectively compartmentalise your high risk investigations by confining your acquisition and analysis of related data to your Tails system. And even if you are scraping banal data from a public website, it is worth considering whether you should have to make that decision every time you start poking around for a new project. Unless the data you are seeking cannot be accessed through Tor (which does happen), there are very good reasons to err on the side of compartmentalisation.
Using Scrapy and Tor Browser to scrape tabular data
Scraping web data reliably and flexibly often requires two steps. Unless you are willing and able to use an all-in-one graphical scraper, you will typically need to:
- Identify the selectors that identify the content you want (and only the content you want), then
- Use those selectors to configure a tool that is optimised for extracting content from webpages.
In the examples presented below, we will rely on Tor Browser to help us with the first stage and Scrapy to handle the second.
Identifying selectors
Before we go digging for selectors, we will start with brief introduction to Hyper-Text Markup Language (HTML) and Cascading Style Sheets (CSS). Feel free to skip it. And, if you don't skip it, rest assured that, when you start using these tools for real, your web browser will do most of the heavy lifting.
A remarkably short HTML tutorial
By the time they reach your web browser, most websites arrive as collections of HTML pages. The basic underlying structure of an HTML document looks something like this:
<html>
<body>
<p>Page content...</p>
</body>
</html>
HTML tables
are often used to format the content of these pages,
especially when presenting data. Here's an example:
Title of Column one | Title of Column two |
---|---|
Page one, row one, column one | Page one, row one, column two |
Page one, row two, column one | Page one, row two, column two |
If we add the table and the navigation links above to our simplified
HTML page, we end up with the following collection of elements
:
<html>
<body>
<table>
<thead>
<tr>
<th><p>Title of Column one</p></th>
<th><p>Title of Column two</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>Page one, row one, column one</p></td>
<td><p>Page one, row one, column two</p></td>
</tr>
<tr class="even">
<td><p>Page one, row two, column one</p></td>
<td><p>Page one, row two, column two</p></td>
</tr>
</tbody>
</table>
<p><a href="/guides/scraping">Previous page</a> - <a href="/guides/scraping-2">Next page</a></p>
</body>
</html>
For our purposes here, the structure is more important than the meaning of the elements themselves, but a few of those "tags" are a little obscure, so:
<p>
and</p>
begin and end a paragraph<a>
begins a clickable link (and</a>
ends it)- The text ("Next page") between the
<a>
and</a>
tags is the thing you click on - The
href=/guides/scraping-2
inside the<a>
tag tells your browser where to go when you click it <tr>
and</tr>
begin and end a row in a table<td>
and</td>
begin and end a cell within a table row<th>
and</th>
are just like<td>
and<td>
but they're meant table headings<table>
and</table>
you already figured out
We will discuss how to view the HTML behind a webpage later on, but if you want to have a look now, simply right-click the table above and select "Inspect Element." The actual HTML is slightly more complex than what is shown above, but the similarities should be clear. (You can close the "Inspector" pane by clicking the tiny X in its upper, right-hand corner.)
A relatively short CSS tutorial
In order to make some (but not all) HTML elements of a particular type
behave in a particular way, they can be assigned a class
or an id
or
some other "attribute." Here is a slightly different version of the
table above:
Title of Column one | Title of Column two |
---|---|
Page one, row one, column one | Page one, row one, column two |
Page one, row two, column one | Page one, row two, column two |
And here's what that might look like:
<html>
<body>
<table class="classy">
<thead>
<tr>
<th><p>Title of Column one</p></th>
<th><p>Title of Column two</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td class="special"><p>Page one, row one, column one</p></td>
<td><p>Page one, row one, column two</p></td>
</tr>
<tr class="even">
<td class="special"><p>Page one, row two, column one</p></td>
<td><p>Page one, row two, column two</p></td>
</tr>
</tbody>
</table>
<p><a id="prev_page" href="/en/guides/scraping">Previous page</a> - <a id="next_page" href="/en/guides/scraping2">Next page</a></p>
</body>
</html>
Notice the class="special"
inside some (but not all) of the <td>
tags. At the moment, these cells are not particularly special. They're
just grey. And those prev_page
and next_page
elements look just like
any other link. But that's not the point. The point is that we can use
these "attributes" to identify certain types of data in a consistent
way. When writing CSS rules, web designers rely on this technique to
apply graphical styling to webpages. We will use it to extract data.
To do this, we need to determine the CSS Selectors for the content we care about.
Let's start with that id="next_page"
. There are several different
selectors we could use to identify the "Next page" link in the example
above. The longest and most specific is:
html > body > p > a#next_page
But, since there's only one html
document and it only has one body
,
we could shorten this whole thing to p > a#next_page
. Then again,
since there is only one next_page
, we could just use a#next_page
. Or
even #next_page
. All of these are perfectly valid selectors.
At this point, you may be wondering what all those symbols mean. As you
probably noticed, we put a #
at the beginning of an id
value when we
refer to it. And the >
symbol represents a "child" relationship. It
can be used to help identify the right-most element as precisely as
possible. Looking at the HTML snippet above, you can see that the <a>
we want is the child of a <p>
which is the child of a <body>
, which
is the child of an <html>
.
Below are a few examples that might help (but that you can safely ignore if they don't):
p
would select every paragraph in the example aboveth > p
would only select the column headingstd > p
would only select the contents of the four main table cellsbody > p
would select both "Previous page" and "Next page" links
If we separate two elements with a space, rather than a >
, it
signifies a "descendant" relationship. This allows us to skip some of
the steps in the middle. So body a#next_page
is valid, even though it
leaves out p
in between, whereas body > a#next_page
is invalid
(which just means it will fail to "select" anything if used). Descendant
relationships cast a wider net. For example, body > p
will only select
the links at the bottom, whereas body p
will select every paragraph on
the page.
So what about our class="special"
attribute? You can use a class the
same way you'd use an ID, but with a fullstop (.
) instead of a #
.
So, the following would all select the same data. See if you can figure
out what it is:
td.special > p
.special > p
.special p
table td.special > p
Answer: The contents of all of the main table cells in column one
Understanding which selectors you need
In order to scrape data from an HTML table on a webpage, you will need one selector that identifies all of the rows in that table and one selector for each column. So, to scrape the second (green) table above, you would need three selectors. There are many variations that would work, but here is one possibility:
- Row selector:
html body table.classy > tbody > tr
- Column one selector:
td.special > p
- Column two selector:
td:nth-child(2) > p
We will discuss how to obtain these values shortly, but first:
- Why don't the column selectors begin with
html body
or at leasttable.classy tr
? - What does
nth-child(2)
mean? - What about tables that are spread over multiple pages?
1. A column selector is applied to a single row, not to the entire page:
In the following section, when we finally start pulling down data, each
of your column selectors will be used once for each row identified by
your row selector. Because of the way we have written our scraping file,
the row selectors should be relative to the row they are in. So they
should not begin with tr
or anything "above" tr
.
2. The nth-child(n)
trick:
Depending on the configuration of the table you are scraping, you might
not have to use this trick, but it can be quite helpful when dealing
with tables that do not include class="whatever"
attributes for every
column. Just figure out which column you want, and stick that number
between the parentheses. So, in this example, td.nth-child(1)
is the
same as td.special
. And td.nth-child(2)
means "the second column."
We use this method here because we have no other way to match the second
column without also matching the first.
3. Scraping a multi-page table:
If you want your scraper to "click the next page link" automatically,
you have to tell it where that link is. Both of the example tables above
have a next page link beneath them. The link under the green
table is easy:
a#next_page
is all you need. The link under the first
table is a little tricker,
because it does not have a class
, and most pages will include at least
one other <a>...</a>
link somewhere in their HTML. In this simplified
example, html body > p:nth-child(1) > a
would work just fine because
the link is in the first paragraph that's inside <body>
but not inside
the <table>
.
Determining these selectors is the most difficult part of scraping web data. But don't worry, the Tor Browser has a few features that make it easier than it sounds.
Using the Tor Browser's Inspector to identify the selectors you need
If you are using a Linux system that does not come with the Tor Browser pre-installed, you can get it from the Tor Project website or follow along with the Tor Browser Tool Guide on Security-in-a-Box.
Follow the steps below to practice identifying selectors on a real page.
Step 1: The row selector
Right-click one of the data rows in the table you want to scrape and select Inspect element. We will start by looking at the green table above. This will open the Tor Browser's Inspector, which will display several nearby lines of HTML and highlight the row on which you clicked:
Now hover your mouse pointer over the neighboring rows in the Inspector while keeping an eye on the main browser window. Various elements on the page will be highlighted as you move your pointer around. Find a line of HTML in the Inspector that highlights an entire table row (but that does not highlight the entire table.) Click it once.
The content indicated by the red box in the third screenshot thumbnail
above (table.classy > tbody > tr.odd
) is almost your row selector.
(You might have to click the
or arrows to see the whole thing.)
But remember, you need a selector that matches all of the table rows,
not just the one you happened to click. If you still get the same value
for two consecutive rows, simply copy it down and you're probably done.
Otherwise, you might have to do a bit of manual editing. Specifically,
you might have to remove a few #id
or .class
values that narrow your
selection down to a particular row or a subset of rows rather than
just a row.
Now, you might think you could just right-click something to copy this value. And eventually you will be able to do just that, as Firefox version 53 will implement a "Copy CSS Selector" option. For now, unfortunately, you will have to write it down manually.
In the example above, the full value of that row with the red box in the Inspector is:
html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr.odd
But there is only one table.classy
on the page, so we can ignore
everything before that. Which leaves us with:
table.classy > tbody > tr.odd
And the second row ends with tr.even
instead of tr.odd
, so we need
to make one more adjustment to obtain our final row selector:
table.classy > tbody > tr
If you haven't already tried it yourself:
-
Right-click a data row in the green table above
-
Select Inspect Element,
-
Hover your mouse pointer over the neighboring lines of HTML in the Inspector,
-
Find one that highlights the entire table row and click it,
-
Note the selector in the top row of the Inspector,
-
Click the next table data row in the Inspector, and
-
If both selectors are identical, write that down. If they are not, try to come up with a more generalised selector and use the information below to test it.
Step 2: Testing your selector
You can test your selectors using another built-in feature of the Tor Browser. This time, we will be looking at the Inspector's right-hand pane, which is highlighted below. If your Inspector only has one window, you can click the tiny "Expand pane" () button near its upper, right-hand corner.
Now you can:
- Right-click anywhere in the new pane and select Add new Rule. This will add a block of code like the one shown in the second screenshot thumbnail above. (The highlighted bit will depend on what's currently selected in the Inspector.)
- Delete the highlighted contents, paste in the selector you want to
test and press
<Enter>
. - Click the nearly-invisible, "Show elements" crosshairs () just to the right of your new "Rule." This will turn the crosshairs blue and highlight all of the page elements that match your selector.
- If this highlighting includes everything you want (and nothing you don't), then your selector is good. Otherwise, you can modify the selector and try again.
When testing your row selector, in particular, you want all of the table rows to be highlighted.
You can click the icon again to cancel the highlighting. You can close the right-hand Inspector pane by clicking the "Collapse pane" () button. These new "Rules" are temporary and will disappear as soon as you reload the page.
Step 3: The "next page" selector
If the table data you want to scrape is spread over multiple pages, you will have to configure a next page selector as well. Fortunately, determining this selector is usually much easier. Because you only need to select a single element, you can often use the "Copy Unique Selector" option in the Tor Browser's Inspector as shown below.
To do this, simply:
- Right-click the next page link on the first page containing the table you want to scrape,
- Select Inspect Element,
- Hover your mouse pointer over nearby HTML lines in the Inspector
until you find the correct
<a>...</a>
link - Right-click that element in the Inspector and select Copy Unique Selector.
You may be able to use this selector as it is. If you want to make sure, just click the "next page" link manually and follow the same steps on the second page. If the selectors are the same, they should work for all pages that contain table data.
If you have to modify the selector — or if you choose to shorten or simplify it — you can test your alternative using the method described in Step 2, above. Don't worry if your final selector highlights multiple next page links, as long as it doesn't highlight anything else. The Scrapy template we recommend below only pays attention to the first "match."
Try following these steps for the "next page" link just beneath the green table above . The Tor Browser's Copy Unique Selector option should produce the following:
#next_page
(Which is equivalent, in this case, to a#next_page
) And, if you test
this selector, you should see
that clicking the Show elements crosshairs only highlights this one
page element.
Step 4: The column selectors
You will often need to identify several column selectors. The process for doing so is nearly identical to how we found our row selector. The main differences are:
- You will need a selector for each column of data you want to scrape,
- Your selectors should be relative to the row, so they will not begin
with segments like
html
,table
,tr
, etc.
For the first column of our green table, the Tor Browser's Inspector gives us:
html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr > td.special > p
If we removing everything up to and including the row (tr
) segment, we
end up with:
td.special > p
If we test this selector by expanding the Inspector, adding a new "rule" for it and clicking the Show elements crosshairs, as discussed in Step 2, we should see highlighting on both cells in column one.
As mentioned above, the second column requires a bit of cleverness
because it has no class
. If you test the following selector, though,
you will see how we can use the nth-child(n)
trick to get the job
done:
td:nth-child(2) > p
Something to keep in mind when testing column selectors: as mentioned above, a column selector is relative to its "parent" row. As a result, when you are testing your row selectors, you might occasionally see highlighted content elsewhere on the page. This is fine, as long as nothing else in the table is highlighted. (If you're concerned about this, you can stick your row selector on the front of your column selector and test it that way. In this example, we would use the following:
table.classy > tbody > tr td:nth-child(2) > p
Just be sure not to use this combined, row-plus-column selector later on, when we're actually trying to scrape data.)
Step 5: Selector suffixes
There is one final step before we can start scraping data. The selectors
discussed above identify HTML elements. This is fine for row selectors,
but we typically want our column selectors to extract the contents of
an HTML element. Similarly, we want our next page selectors to match the
value of the actual link target (in other words, just
/guides/scraping-2
rather than the following:
<a href=/guides/scraping-2>Next page</a>
To achieve this, we add a short "suffix" to column and next page selectors. The most common suffixes are:
::text
, which extracts the content between the selected HTML tags. We use it here to get the actual table data.::attr(href)
, which matches the value of thehref
attribute inside an HTML tag. We use it here to get the "next page" URL so our scraper can load the second page, for example, when it's done with the first.::attr(src)
, which matches the value of asrc
attribute. We do not use it here, but it is helpful when scraping tables that include images.
So, in the end, our final selectors are:
- Row selector:
table.classy > tbody > tr
- Next page selector:
#next_page::attr(href)
-
Column selectors
- Column one:
td.special > p::text
- Column two:
td:nth-child(2) > p::text
- Column one:
Configuring Scrapy
Now that we have a basic understanding of HTML, CSS and how to use the Tor Browser's Inspector to distill them into a handful of selectors, it is time to enter those selectors into Scrapy. Scrapy is a free and open-source Web scraping platform written in the Python programming language. In order to use it, we will:
-
Install Scrapy;
-
Create a small python file called a "spider" and configure it by plugging in:
- The URL we want to scrape,
- Our row selector, and
- Our column selectors;
-
Tell Scrapy to run our spider on a single page;
-
Check the results and make any necessary changes; and
-
Plug in our next page selector and run the spider again.
Step 1: Installing Scrapy
You can install Scrapy on Tails by launching Terminal, entering the command below and providing your passphrase when prompted:
sudo apt-get install python-scrapy
To install software on Tails, you need to set an administration
passphrase
when you boot your system. If you are already running Tails, and you did
not do this, you will have to restart your computer. You might also want
to configure the "encrypted persistence"
feature,
which allows you to save data within the /home/amnesia/Persistent
folder. Otherwise, anything you save will be gone next time you boot
Tails.
You do not have to configure Persistence to use Scrapy, but it makes things easier. Without Persistence, you will have to save your "spider" file on an external storage device and re-install Scrapy each time you want to use it.
Even with Persistence enabled, you will have to reinstall Scrapy each
time you restart unless you add a line that says python-scrapy
to the
following file:
/live/persistence/TailsData_unlocked/live-additional-software.conf
To do this, you can launch Terminal, enter the following command and provide your passphrase when prompted:
sudo gedit /live/persistence/TailsData_unlocked/live-additional-software.conf
This will open the GNU Edit text editor, give it permission to modify
system files and load the contents (if any) of the necessary
configuration file. Then add the line above (python-scrapy
) to the
file, click the [Save] button and quit the editor by clicking the
X in the upper, right-hand corner. Unless you have edited this file
before, while running Tails with Persistence, the file will be blank
when you first open it.
On Debian GNU/Linux systems other than Tails, you will need to install Tor, torsocks and Scrapy, which you can do by launching Terminal, entering the following command and providing your passphrase when prompted.
sudo apt-get install tor torsocks python-scrapy
Step 2: Creating and configuring your spider file
There are many different ways to use Scrapy. The simplest is to create a single spider file that contains your selectors and some standard python code. You can copy the content of a generic spider file from here.
You will need to paste this code into a new file in the
/home/amnesia/Persistent
folder so you can edit it and add your
selectors. To do so, launch Terminal and enter the command below:
gedit /home/amnesia/Persistent/spider-template.py
Paste in the contents from here, and click the [Save] button.
Now we just have to name our spider, give it a page on which to start scraping and plug in our selectors. All of this can be done by editing lines 7 through 14 of the template:
### User variables
#
start_urls = ['https://some.website.com/some/page']
name = 'spider_template'
row_selector = '.your .row > .selector'
next_selector = 'a.next_page.selector::attr(href)'
column_selectors = {
'some_column': '.contents.of > .some.column::text',
'some_other_column': '.contents.of > a.different.column::attr(href)',
'a_third_column': '.contents.of > .a3rd.column::text'
}
The start_urls
, name
, row_selector
and next_selector
variables
are pretty straightforward:
start_urls
: Enter the URL of the first page that contains table data you want to scrape. Usehttps
if possible and remember to keep the brackets ([
&]
) around the URL.name
: This one is pretty unimportant, actually. Name your scraper whatever you like.row_selector
: Enter the row selector you came up with earlier. If you want to to test your spider on the green table above, you can usetable.classy > tbody > tr
next_selector
: The first time you try scraping a new table, you should leave this blank. To do so, delete everything between the two'
characters. The resulting line should readnext_selector = ''
The column_selectors
item is a little different. It is a collection
that contains one entry for each column selector. You can set both a
key — the text before the colon (:
) — and a value for each of
those entries. The example keys in the template above are
some_column
, some_other_column
and a_third_column
. The text you
enter for these keys will provide the column headings of the .csv
file you are going to create. The example values are as follows:
.contents.of > .some.column::text
.contents.of > a.different.column::attr(href)
.contents.of > .a3rd.column::text
These selectors are completely made up and pretty much guaranteed to fail on every webpage ever made. You should, of course, replace them with the selectors you identified in the previous section.
If you want to test your spider on the green table above, these lines should look something like the following:
### User variables
#
start_urls = ['https://exposingtheinvisible.org/en/guides/scraping']
name = 'test-spider'
row_selector = 'table.classy > tbody > tr'
next_selector = ''
column_selectors = {
'first': 'td.special > p::text',
'second': 'td:nth-child(2) > p::text'
}
Everything else in this file should remain unchanged. And remember, you
are editing python code, so try not to change the formatting.
Punctuation marks like quotes ('
& "
), commas (,
), colons (:
),
braces ({
& }
) and brackets ([
& ]
) all have special meaning in
python. As does the indentation: four spaces before most of these lines
and eight spaces before the lines that contain column selectors. Also,
lines that begin with a number sign (#
) are "comments," which means
they will not affect your spider at all. Finally, when adding or
removing column selectors, pay attention to the commas (,
) at the end
of all but the last line, as shown above.
After you have modified the start_urls
, name
, row_selector
and
next_selector
entries, and added both keys and values for each of
your column selectors, you should save your new spider by clicking the
[Save] button in GNU edit.
After configuring your spider, you should test it.
Step 3: Testing your spider
If you are using Tails, and if you did not change the location
(/home/amnesia/Persistent
) or the file name (spider-template.py
) of
your spider, you can test drive it by launching Terminal and entering
the commands below:
cd /home/amnesia/Persistent
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv
These commands include the following elements:
cd /home/amnesia/Persistent
moves to the "Persistence" folder in the Tails home directory, which is where we happened to put our spidertorsocks -i
tells Scrapy to use the Tor anonymity service while scrapingscrapy runspider spider-template.py
tells Scrapy to run your spider-o extracted_data.csv
provides a name for the output file that will contain the data you scrape. You can name this file whatever you want, but Scapy will use the three letter file extension at the end (.csv
in this case) to determine how it should format those data. You can also output JSON content by using the.json
file extension.
While it does its job, Scrapy will display all kinds of information in
the Terminal where it is running. This feedback will likely include at
least one "ERROR" line related to torsocks
:
Unable to resolve. Status reply: 4 (in socks5_recv_resolve_reply() at socks5.c:683)
You can safely ignore this warning (along with most of the other
feedback). If everything works, you will find a file called
extracted_data.csv
in the same directory as your spider. That file
should contain all of the data scraped from the HTML table. Our example
spider will extract the following from the green
table above:
second,first
"Page one, row one, column two","Page one, row one, column one"
"Page one, row two, column two","Page one, row two, column one"
As you can see, the resulting columns may appear out of order, but they
will be correctly associated with the key values you set for your
column descriptors, which make up the first line of data. You can easily
re-order the columns once you import your .csv
file into a
spreadsheet.
Troubleshooting tips:
-
If your spider fails, look for lines that include
[scrapy] DEBUG:
. They might help you figure out what broke. -
If you want to quit Scrapy while it is still running, just hold down the
<Ctrl>
key and pressc
while in the Terminal application. -
If the named output file already exists (
extracted_data.csv
, in this case), Scrapy will append new data to the end of it, rather than replacing it. So, if things don't go according to plan and you want to try again, you should first remove the old file by enteringrm extracted_data.csv
. -
When Scrapy is done, it will display the following:
[scrapy] INFO: Spider closed (finished)
Step 4: Opening your CSV output in LibreOffice Calc
Follow the steps below to confirm that your spider worked (or, if you've already got what you came for, to begin cleaning and analysing your data):
-
Launch LibreOffice Calc by clicking the Applications menu in the upper, left-hand corner of your screen, hovering over the Office sub-menu and clicking LibreOffice Calc.
-
In LibreOffice Calc, click File and select Open
-
Navigate to your
.csv
file and click the [Open] button -
Configure the following options if they are not already set by default:
- Character set:
Unicode (UTF-8)
- From row:
1
- Check the "Comma" box under Separator options
- Uncheck all other Separator options
- Select
"
under Text delimiter
- Character set:
-
Click the [OK] button.
Step 5: Running your spider on multiple pages
If everything in your .csv
file looks good, and you are ready to try
scraping multiple pages, just configure the next page selector in your
spider and run it again.
If you are using Tails with Persistence, you can open your spider for editing with the same command we used before:
gedit /home/amnesia/Persistent/spider-template.py
Those first few lines should now look something like:
### User variables
#
start_urls = ['https://exposingtheinvisible.org/en/guides/scraping']
name = 'test-spider'
row_selector = 'table.classy > tbody > tr'
next_selector = '#next_page::attr(href)'
column_selectors = {
'first': 'td.special > p::text',
'second': 'td:nth-child(2) > p::text'
}
The only difference is the addition of an actual selector (instead of
''
) for the next_selector
variable.
Finally, click the [Save] button, remove your old output file and tell Scrapy to run the updated spider:
cd /home/amnesia/Persistent
rm extracted_data.csv
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv
Your .csv
output should now include data from the second page:
second,first
"Page one, row one, column two","Page one, row one, column one"
"Page one, row two, column two","Page one, row two, column one"
"Page two, row one, column two","Page one, row one, column one"
"Page two, row two, column two","Page one, row two, column one"
Troubleshooting tips:
-
If you are using Tails but have not enabled Persistence, copy your output file somewhere before shutting down your Tails system.
-
If you want to keep an old version of your spider while testing out a new one, just make a copy and start working with with the new one instead. You can do this by entering the following command in your Terminal:
cp spider-template.py spider-new.py
Of course, you will then use the following, instead of the command shown above, to edit your new file:
gedit /home/amnesia/Persistent/spider-new.py
And to run your new spider, you will use the following:
torsocks -i scrapy runspider spider-new.py -o extracted_data.csv
Real world examples
The sections below cover three multi-page HTML data tables that you might want to scrape for practice. These tables are (obviously) longer than the example above. They are also a bit more complex, so we will point out the key differences and how you can address them when configuring your spider.
For each website, we include sections on:
- What is different about this table;
- The URL, name and selector values for your spider; and
- Other custom settings
Resources for refugees in a Berlin neighborhood
The city of Berlin publishes a list of resources for refugees in various neighborhoods. The listing for Friedrichshain and Kreuzberg includes enough entries to be useful as an sample scraping target.
[Copy and paste the spider code into a new file]
What is different about this table
This table has three interesting characteristics:
- It has a long, complex starting URL;
- The data we want to scrape include an internal link to a page elsewhere on the site; and
- One of the cells we want to scrape contains additional HTML markup.
1. Complex starting URLs:
The URL we will use to scrape this table corresponds to the results page of a search form. It is:
http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults'
This does not affect the configuration of our spider or the command we will use to run it, but it is worth noting that you will often need to browse through a website — navigating to subpages, using search forms, filtering results, etc. — in order to find the right starting URL.
2. Internal links:
The table column (Träger) that corresponds to the organisation
responsible for the resource in that row includes a Mehr… link to a
dedicated page about that resource. We want the full URL, but the HTML
attribute we will scrape only provides one that is relative to the
website's domain. Our spider template will try to handle this
automatically if you use 'link':
as the key value for that column
selector, as shown below. If it does not work properly when scraping
some other website, you can just choose a different key for that
column selector ('internal_link':
, for example) and fix the URL
yourself once you get it into a spreadsheet.
3. Internal HTML markup:
That same (Träger) column also includes HTML markup inside the text
description of some organisations. As a result, the selector suffix we
would normally use (::text
) does not work properly. Instead, we have
to extract the entire HTML block. (Notice that the organisation
column
selector below does not include a suffix.) If you are a python
programmer, you can fix this quite easily inside your spider. Otherwise,
you will have to clean it up in your spreadsheet application.
URL, name and selectors
Spoiler alert. Below you will find all of the changes you would need
to make in the spider-template.py
file to scrape several pages of
worth of refugee resources in Friedrichshain and Kreuzberg. Before you
continue, though, you should visit the
webpage
in the Tor Browser and, if necessary, consult the section on Using the
Tor Browser's Inspector to identify the selectors you
need.
Then try to determine selectors for the following:
- Row selector
- Next page selector
-
Column selectors:
- Supporting organisation
- Description of the resource being offered
- The language(s) supported
- The location of the resource
- An internal link to a more descriptive page
Corresponding User variables for your spider:
start_urls = ['http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults']
name = 'refugee_resources'
row_selector = 'tbody tr'
next_selector = '.pager-item-next > a:nth-child(1)::attr(href)'
item_selectors = {
'organisation': 'td:nth-child(1) > div > p',
'offer': 'td:nth-child(2)::text',
'language': 'td:nth-child(3)::text',
'address': 'td:nth-child(4)::text',
'link': 'td:nth-child(1) > a:nth-child(2)::attr(href)'
}
Other custom settings
To scrape this table, you can use the default collection of
custom_settings
in the spider template, which is shown below. As
mentioned above, the lines beginning with #
are comments and will not
affect the behaviour of your spider. For each of the two remaining
examples, you will uncomment at least one of those lines.
custom_settings = {
# 'DOWNLOAD_DELAY': '30',
# 'DEPTH_LIMIT': '100',
# 'ITEM_PIPELINES': {
# 'scrapy.pipelines.files.FilesPipeline': 1,
# 'scrapy.pipelines.images.ImagesPipeline': 1
# },
# 'IMAGES_STORE': 'media',
# 'IMAGES_THUMBS': { 'small': (50, 50) },
# 'FILES_STORE': 'files',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
'TELNETCONSOLE_ENABLED': False,
'DOWNLOAD_HANDLERS': {'s3': None}
}
\<
Hacker News
In this section, we will use YCombinator's Hacker News as a second "real world" example. Of course, in the real world, one would not need to scrape Hacker News because the website offers both an RSS feed and an application programming interface (API). Abusing Hacker News to practice scraping is something of a tradition, though, and it is a nice example of a multi-page HTML table that is not too complex.
[Copy and paste the spider code into a new file]
What is different about this table
If you restrict your scraping to the first line of each article listed on Hacker News, as we do here, this table is very similar to the tiny green example we've been using up until now. It's just bigger. We introduce one new challenge below, but it is quite straightforward.
Scraping a row attribute:
If you look at a Hacker News article using the Tor Browser's
Inspector, you will see that each table row (tr
) containing a story
has a unique id
attribute. IDs like this are often sequential, which
can be a useful way to keep track of the order in which content is
presented. Even though they are not sequential in this particular table,
it might still be useful to scrape them. But most of our column
selectors are inside table data (td
) elements, and we normally specify
their selectors relative to their parent row. So how do we capture an
attribute of the row itself? Just specify a suffix that is only a
suffix. In this case, as shown
below, that would be: ::attr(id)
.
URL, name and selectors
Spoiler alert. Below you will find all of the changes you would need
to make in the spider-template.py
file to scrape several pages of
worth of article on Hacker News. Before you continue, though, you should
visit the website in the Tor
Browser and, if necessary, consult the section on Using the Tor
Browser's Inspector to identify the selectors you
need.
Then try to determine selectors for the following:
- Row selector
- Next page selector
-
Column selectors:
- Article index
- Article title
- Web address for the article itself
Corresponding User variables for your spider:
start_urls = ['https://news.ycombinator.com/news']
name = 'hacker_news'
row_selector = 'table.itemlist > tr.athing'
next_selector = 'a.morelink::attr(href)'
item_selectors = {
"index": "::attr(id)",
"title": "td.title > a.storylink::text",
"external_link": "td.title > a.storylink::attr(href)"
}
Other custom settings
The Hacker News robots.txt
file specifies
a Crawl-delay
of 30
seconds, so we should make
sure that our spider does not scrape too quickly. As a result, it may
take up to ten minutes to scrape all of the table data.
custom_settings = {
'DOWNLOAD_DELAY': '30',
# 'DEPTH_LIMIT': '100',
# 'ITEM_PIPELINES': {
# 'scrapy.pipelines.files.FilesPipeline': 1,
# 'scrapy.pipelines.images.ImagesPipeline': 1
# },
# 'IMAGES_STORE': 'media',
# 'IMAGES_THUMBS': { 'small': (50, 50) },
# 'FILES_STORE': 'files',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
'TELNETCONSOLE_ENABLED': False,
'DOWNLOAD_HANDLERS': {'s3': None}
}
CCTV cameras in Bremen
The German city of Bremen provides a list of public video cameras, along with a thumbnail image of what each camera can see.
[Copy and paste the spider code into a new file]
What is different about this table
The Bremen CCTV table has three main quirks:
- The page structure makes it unusually difficult to identify a reliable next page selector,
- To solve the next page selector challenge, we have to give our spider a maximum number of pages to scrape, and
- We want to download the actual image file associated with each CCTV camera.
1. Elusive next page selector:
The CSS class used to style the "next arrow" is only applied to the
icon, not to the link itself, so we can not use it for our next page
selector. To make matters worse, the row of page navigation links only
displays "first page" and "previous page" links after you have left the
first page, so our usual :nth-child(n)
trick does not work. As a
result, we have to get creative to come up with a selector that will
work on every page. In the example below, we use a different trick
(nth-last-child(n)
) so we can count from the end rather than from
the beginning.
2. Specifying a "depth limit":
When we get to the end, the "last page" and "next page" links disappear,
which would normally make our scraper "click" on the link for a page
that it had already scraped. It would then go back a couple of pages.
Until it got to the end (again), at which point it would go back a
couple of pages (again). This would create an infinite loop and our poor
spider would end up scraping forever. By setting the DEPTH_LIMIT
to
11
in the spider's custom_settings
, we can tell it to stop after it
reaches the last page.
3. Downloading image files:
This is the first time we are asking our spider to download image files. Scrapy makes this quite easy, but it does require:
- Using the special
image_urls
key for the corresponding column selector, and - Uncommenting a few
custom_settings
by removing the#
character at the beginning of the corresponding lines.
URL, name and selectors
Spoiler alert. Below you will find all of the changes you would need
to make in the spider-template.py
file to scrape several pages of
worth of camera locations and image thumbnails. Before you continue,
though, you should visit the
website
in the Tor Browser and, if necessary, consult the section on Using the
Tor Browser's Inspector to identify the selectors you
need.
Then try to determine selectors for the following:
- Row selector
- Next page selector
-
Column selectors:
- Name of the camera
- Status of the camera
- Location of the camera
- A URL for the camera's current thumbnail image
Corresponding User variables for your spider:
start_urls = ['http://www.standorte-videoueberwachung.bremen.de/sixcms/detail.php?gsid=bremen02.c.734.de']
name = 'bremen_cctv'
row_selector = 'div.cameras_list_item'
next_selector = 'ul.pagination:nth-child(4) > li:nth-last-child(2) > a::attr(href)'
item_selectors = {
'title': '.cameras_title::text',
'status': '.cameras_title .cameras_status_text::text',
'address': '.cameras_address::text',
'image_urls': '.cameras_thumbnail > img::attr(src)'
}
Other custom settings
To address the issues described in the What is different about this table section, we will use the custom settings below.
custom_settings = {
# 'DOWNLOAD_DELAY': '30',
'DEPTH_LIMIT': '11',
'ITEM_PIPELINES': {
# 'scrapy.pipelines.files.FilesPipeline': 1,
'scrapy.pipelines.images.ImagesPipeline': 1
},
'IMAGES_STORE': 'media',
'IMAGES_THUMBS': { 'small': (50, 50) },
# 'FILES_STORE': 'files',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
'TELNETCONSOLE_ENABLED': False,
'DOWNLOAD_HANDLERS': {'s3': None}
The DEPTH_LIMIT
setting above makes sure that we stop scraping when we
get to the last page. The uncommented item in the collection of
ITEM_PIPELINES
tells Scrapy to download and save all images identified
by the images_url
column selector. IMAGES_STORE
tells it to put
those image files into a folder called media
, which it will create
automatically. IMAGES_THUMBS
tells it to create very small thumbnail
image for each one.
(scrapy.pipelines.files.FilesPipeline
, FILES_STORE
and the special
file_urls
column selector key can be used to download other sorts of
files from a webpage. This might include PDFs, Microsoft Word documents,
audio files, etc. We leave these options disabled below.)
After you run your spider, have a look at your .csv
output and explore
that media
folder to see exactly how this works.
Once you understand the configuration changes needed to make these three Scrapy spiders do their jobs, you should be able to securely and privately scrape structured data, images and files from most HTML tables on the web. As always, there is more work to do. You will still have to clean, analyse, present and explain the data you scrape, but we hope the information above will give you a place to start next time you stumble across an important trove of tabular data that does not have a "download as CSV" link right there next to it.
Published on 29 May 2017.
Follow us @Info_Activism, get in touch at 'eti@tacticaltech.org', or read another of our guides here.