Headerscrapingdata

Scraping web data


Public data come in handy at times. Meaningful investigations have made use of everything from live flight tracking data to public registeries of companies to lobbying disclosures to hastags created on Twitter, among countless other examples. While we are clearly seeing a proliferation of such information online, it is not always available in a format we can use. Leveraging public data for an investigation typically requires that it be not only machine readable but structured. In other words, a PDF that contains a photograph of a chart drawn on a napkin is less useful than a Microsoft Excel document that contains the actual data presented in that chart.

This can be an issue even in cases where governments are compelled by Freedom of Information (FoI) laws to release data they have collected, maintained or financed. In fact, governments sometimes obfuscate data intentionally to prevent further analysis that might reveal certain details they would rather keep hidden.

Investigators who work with public data face many obstacles, but two of the most common are:

  1. Horrendous, multi-page HTML structures embedded in a series of consecutive webpages; and
  2. Pleasantly formatted tables trapped within the unreachable hellscape of a PDF document.

This guide seeks to address the first challenge. It presents a series of steps that can be used to automate the collection of online HTML tables and the transformation of those tables into a more useful format. This process is often called "Web scraping." In the examples discussed below, we will produce comma separated values (CSV) documents ready to be imported by LibreOffice Calc or Microsoft Excel. (For advice on dealing with PDF tables, have a look at this article and watch this space for an upcoming guide on Tabula, a PDF scraping tool.)

The structure of this guide

This guide has four sections. The first discusses our rationale for choosing the tools and techniques covered in the rest of the guide. The second section is a brief primer on HTML structure, CSS

selectors

selector is short sequence of keywords that identifies one or more specific elements on a webpage.


Explore Resources
selectors and how to use the Tor Browser's built-in Element Inspector to learn what you need to know for section three. In the third section, we walk through the process of plugging those selectors into Scrapy, pulling down HTML data and saving them as a CSV file. Finally, we present three external web tables that are both more interesting and more complex than the example used previously.

If you don't need convincing, you should probably skip down to the section on Using Scrapy and Tor Browser to scrape tabular data.

Who is this guide for?

The short answer is, anyone with a Debian GNU/Linux system — be it a computer, a virtual machine or a boot disk — who is willing to spend most of a day learning how to scrape web data reliably, flexibly and privately. And who remains willing even when they find out that less reliable, less flexible and less secure methods are probably less work.

More specifically, the steps below assume you are able to edit a text file and enter commands into a Linux Terminal. These steps are written from the perspective of a Tails user, but we have included tips, where necessary, to make sure they work on any Debian system. Adapting those steps for some other Linux distribution should be quite easy, and making them work on Windows should be possible.

This guide does not require familiarity with the python programming language or with web design concepts like HTML and CSS, though all three make an appearance below. We will explain anything you need to know about these technologies.

Finally, this guide is written for variations of the Firefox Web browser, including the Tor Browser and Iceweasel. (From here on out, we will refer specifically to the Tor Browser, but most of the steps we describe will work just fine on other versions of Firefox.) Because of our commitment to Tails compatibility, we did not look closely at scraping extensions for Chromium, the open-source version of Google's Chrome web browser. So, if you're using Windows or a non-Tails Linux distribution — and if you are not particularly concerned about anonymity — you can either use Firefox or you can have a look at the Web scaper extension for Chromium. It's a few years old, but it looks promising nonetheless. It is free and open-source software, licensed under the GNU Public License (GPL) just like Scrapy and the other tools we recommend in this guide.

In defense of The Hard Way

If you have skimmed through the rest of this guide, you might have noticed a startling lack of screenshots. There are a few, certainly, but most of them just show the Tor Browser's built-in Inspector being used to identify a few inscrutable lines of poetry like, td.col1 div.resource > a::attr(href). And if that stanza gives you a warm fuzzy feeling, you might consider skipping down to the section on Using Scrapy and Tor Browser to scrape tabular data. But if it looks a bit intimidating, please bear with us for a few more paragraphs.

Put the question to your favourite search engine, and you will find any number of graphical web scraping tools out there on the Internet. With varying degrees of success, these tools provide a user interface that allow you to:

  • Pick and choose from the content available on a webpage by pointing and clicking;
  • Download the content you want; and
  • Save it in the format of your choosing.

All of which sounds great. But if you start digging into the details and test driving the software, you will find a number of caveats. Some of these tools are commercial, closed source or gone in a puff of smoke. Some of them have limited functionality, or work for a limited time, until you pay for them. Some of them are cloud-based services run by companies that want to help you scrape data so they can have a copy for themselves. Some of them were written using an outdated and insecure browser extension framework. Some of them only work on very simple tables. Some of them don't know how to click the "next" button. Some of them ping Google analytics. That sort of thing.

And none of this ought to surprise us. It takes work to develop and maintain software like this, and the data industry is a Big Deal these days. Of course, it might not matter for all of us all of the time. Plenty of good investigative work has been done by chaining together half a dozen "15 Day Trial" licenses. But our goal for these guides is to provide solutions that don't require you to make those sorts of trade-offs. And that don't leave you hanging when you find yourself needing to:

  • Scrape large quantities of data, incrementally, over multiple sessions;
  • Parse complex tables;
  • Download binary files like images and PDFs;
  • Hide your interest in the data you are scraping; or
  • Stay within the legal bounds of your software licenses and user agreements.

Why you might want to scrape with Tails?

There are a number of reasons why you might want to use Tails when scraping web data. Even if you intend to release that data — or publish something that would reveal your access to it— you might still want to hide the fact that you or your organisation are digging for information on a particular subject. At least for a while. By requesting relevant pages through Tor, you retain control over the moment at which your involvement becomes public. Until then, you can prevent a wide variety of actors from knowing what you are up to. This includes other people at your internet cafe, your internet service provider (ISP), surveillance agencies, whoever operates the website you are scraping and their ISP, among others.

Tails also helps protect your research, analysis and draft outputs from theft and confiscation. When you enable Persistence on Tails, you are creating a single folder within which you can save data, and that folder's contents are guaranteed to be encrypted. There are other ways to store encrypted data, of course, but few of them make it this difficult to mess up. Tails disks are also small, easy to backup and even easier to throw away. Sticking one in your pocket is sometimes a nice alternative to traveling with a laptop, especially on trips where border crossings, checkpoints and raids might be a cause for concern.

More generally, Tails is an extremely useful tool if you are looking to keep your sensitive investigative work separate from your personal activities online. Even if you only have one laptop, you can effectively compartmentalise your high risk investigations by confining your acquisition and analysis of related data to your Tails system. And even if you are scraping banal data from a public website, it is worth considering whether you should have to make that decision every time you start poking around for a new project. Unless the data you are seeking cannot be accessed through Tor (which does happen), there are very good reasons to err on the side of compartmentalisation.

 

Using Scrapy and Tor Browser to scrape tabular data

 

Scraping web data reliably and flexibly often requires two steps. Unless you are willing and able to use an all-in-one graphical scraper, you will typically need to:

  1. Identify the selectors that identify the content you want (and only the content you want), then
  2. Use those selectors to configure a tool that is optimised for extracting content from webpages.

In the examples presented below, we will rely on Tor Browser to help us with the first stage and Scrapy to handle the second.

 

Identifying selectors

Before we go digging for selectors, we will start with brief introduction to Hyper-Text Markup Language (HTML) and Cascading Style Sheets (CSS). Feel free to skip it. And, if you don't skip it, rest assured that, when you start using these tools for real, your web browser will do most of the heavy lifting.

 

A remarkably short HTML tutorial

By the time they reach your web browser, most websites arrive as collections of HTML pages. The basic underlying structure of an HTML document looks something like this:

<html>
  <body>
    <p>Page content...</p>  
  </body>
</html>

HTML tables are often used to format the content of these pages, especially when presenting data. Here's an example:

Title of Column one

Title of Column two

Page one, row one, column one

Page one, row one, column two

Page one, row two, column one

Page one, row two, column two

Previous page - Next page

If we add the table and the navigation links above to our simplified HTML page, we end up with the following collection of elements:

<html>
  <body>
    <table>
      <thead>
        <tr>
          <th><p>Title of Column one</p></th>
          <th><p>Title of Column two</p></th>
        </tr>
      </thead>
      <tbody>
        <tr class="odd">
          <td><p>Page one, row one, column one</p></td>
          <td><p>Page one, row one, column two</p></td>
        </tr>
        <tr class="even">
          <td><p>Page one, row two, column one</p></td>
          <td><p>Page one, row two, column two</p></td>
        </tr>
      </tbody>
    </table>
    <p><a href="/guides/scraping">Previous page</a> - <a href="/guides/scraping-2">Next page</a></p>
  </body>
</html>

For our purposes here, the structure is more important than the meaning of the elements themselves, but a few of those "tags" are a little obscure, so:

  • <p> and </p> begin and end a paragraph
  • <a> begins a clickable link (and </a> ends it)
  • The text ("Next page") between the <a> and </a> tags is the thing you click on
  • The href=/guides/scraping-2 inside the <a> tag tells your browser where to go when you click it
  • <tr>and </tr> begin and end a row in a table
  • <td> and </td> begin and end a cell within a table row
  • <th> and </th> are just like <td> and <td> but they're meant table headings
  • <table> and </table> you already figured out

We will discuss how to view the HTML behind a webpage later on, but if you want to have a look now, simply right-click the table above and select "Inspect Element." The actual HTML is slightly more complex than what is shown above, but the similarities should be clear. (You can close the "Inspector" pane by clicking the tiny X in its upper, right-hand corner.)

A relatively short CSS tutorial

In order to make some (but not all) HTML elements of a particular type behave in a particular way, they can be assigned a class or an id or some other "attribute." Here is a slightly different version of the table above:

Title of Column one

Title of Column two

Page one, row one, column one

Page one, row one, column two

Page one, row two, column one

Page one, row two, column two

Previous page - Next page

And here's what that might look like:

<html>
  <body>
    <table class="classy">
      <thead>
        <tr>
          <th><p>Title of Column one</p></th>
          <th><p>Title of Column two</p></th>
        </tr>
      </thead>
      <tbody>
        <tr class="odd">
          <td class="special"><p>Page one, row one, column one</p></td>
          <td><p>Page one, row one, column two</p></td>
        </tr>
        <tr class="even">
          <td class="special"><p>Page one, row two, column one</p></td>
          <td><p>Page one, row two, column two</p></td>
        </tr>
      </tbody>
    </table>
    <p><a id="prev_page" href="/guides/scraping">Previous page</a> - <a id="next_page" href="/guides/scraping-2">Next page</a></p>
  </body>
</html>

Notice the class="special" inside some (but not all) of the <td> tags. At the moment, these cells are not particularly special. They're just grey. And those prev_page and next_page elements look just like any other link. But that's not the point. The point is that we can use these "attributes" to identify certain types of data in a consistent way. When writing CSS rules, web designers rely on this technique to apply graphical styling to webpages. We will use it to extract data.

To do this, we need to determine the CSS Selectors for the content we care about.

Let's start with that id="next_page". There are several different selectors we could use to identify the "Next page" link in the example above. The longest and most specific is:

html > body > p > a#next_page

But, since there's only one html document and it only has one body, we could shorten this whole thing to p > a#next_page. Then again, since there is only one next_page, we could just use a#next_page. Or even #next_page. All of these are perfectly valid selectors.

At this point, you may be wondering what all those symbols mean. As you probably noticed, we put a # at the beginning of an id value when we refer to it. And the > symbol represents a "child" relationship. It can be used to help identify the right-most element as precisely as possible. Looking at the HTML snippet above, you can see that the <a> we want is the child of a <p> which is the child of a <body>, which is the child of an <html>.

Below are a few examples that might help (but that you can safely ignore if they don't):

  • p would select every paragraph in the example above
  • th > p would only select the column headings
  • td > p would only select the contents of the four main table cells
  • body > p would select both "Previous page" and "Next page" links

If we separate two elements with a space, rather than a >, it signifies a "descendant" relationship. This allows us to skip some of the steps in the middle. So body a#next_page is valid, even though it leaves out p in between, whereas body > a#next_page is invalid (which just means it will fail to "select" anything if used). Descendant relationships cast a wider net. For example, body > p will only select the links at the bottom, whereas body p will select every paragraph on the page.

So what about our class="special" attribute? You can use a class the same way you'd use an ID, but with a fullstop (.) instead of a #. So, the following would all select the same data. See if you can figure out what it is:

  • td.special > p
  • .special > p
  • .special p
  • table td.special > p

Answer: The contents of all of the main table cells in column one

 

Understanding which selectors you need

In order to scrape data from an HTML table on a webpage, you will need one selector that identifies all of the rows in that table and one selector for each column. So, to scrape the second (green) table above, you would need three selectors. There are many variations that would work, but here is one possibility:

  • Row selector: html body table.classy > tbody > tr
  • Column one selector: td.special > p
  • Column two selector: td:nth-child(2) > p

 

We will discuss how to obtain these values shortly, but first:

  1. Why don't the column selectors begin with html body or at least table.classy tr?
  2. What does nth-child(2) mean?
  3. What about tables that are spread over multiple pages?

 

1. A column selector is applied to a single row, not to the entire page:

In the following section, when we finally start pulling down data, each of your column selectors will be used once for each row identified by your row selector. Because of the way we have written our scraping file, the row selectors should be relative to the row they are in. So they should not begin with tr or anything "above" tr.

 

2. The nth-child(n) trick:

Depending on the configuration of the table you are scraping, you might not have to use this trick, but it can be quite helpful when dealing with tables that do not include class="whatever" attributes for every column. Just figure out which column you want, and stick that number between the parentheses. So, in this example, td.nth-child(1) is the same as td.special. And td.nth-child(2) means "the second column." We use this method here because we have no other way to match the second column without also matching the first.

 

3. Scraping a multi-page table:

If you want your scraper to "click the next page link" automatically, you have to tell it where that link is. Both of the example tables above have a next page link beneath them. The link under the green table is easy: a#next_page is all you need. The link under the first table is a little tricker, because it does not have a class, and most pages will include at least one other <a>...</a> link somewhere in their HTML. In this simplified example, html body > p:nth-child(1) > a would work just fine because the link is in the first paragraph that's inside <body> but not inside the <table>.

Determining these selectors is the most difficult part of scraping web data. But don't worry, the Tor Browser has a few features that make it easier than it sounds.

 

Using the Tor Browser's Inspector to identify the selectors you need

If you are using a Linux system that does not come with the Tor Browser pre-installed, you can get it from the Tor Project website or follow along with the Tor Browser Tool Guide on Security-in-a-Box.

Follow the steps below to practice identifying selectors on a real page.

Step 1: The row selector

Right-click one of the data rows in the table you want to scrape and select Inspect element. We will start by looking at the green table above or

here

Title of Column one

Title of Column two

Page one, row one, column one

Page one, row one, column two

Page one, row two, column one

Page one, row two, column two

Previous page - Next page


Explore Resources
here. This will open the Tor Browser's Inspector, which will display several nearby lines of HTML and highlight the row on which you clicked:
Opening the Inspector Hovering over a table cell Hovering over a table row Selecting a table row

Now hover your mouse pointer over the neighboring rows in the Inspector while keeping an eye on the main browser window. Various elements on the page will be highlighted as you move your pointer around. Find a line of HTML in the Inspector that highlights an entire table row (but that does not highlight the entire table.) Click it once.

The content indicated by the red box in the third screenshot thumbnail above (table.classy > tbody > tr.odd) is almost your row selector. (You might have to click the Inspector left arrow or Inspector right arrow arrows to see the whole thing.)

But remember, you need a selector that matches all of the table rows, not just the one you happened to click. If you still get the same value for two consecutive rows, simply copy it down and you're probably done. Otherwise, you might have to do a bit of manual editing. Specifically, you might have to remove a few #id or .class values that narrow your selection down to a particular row or a subset of rows rather than just a row.

Now, you might think you could just right-click something to copy this value. And eventually you will be able to do just that, as Firefox version 53 will implement a "Copy CSS Selector" option. For now, unfortunately, you will have to write it down manually.

In the example above, the full value of that row with the red box in the Inspector is:

html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr.odd

But there is only one table.classy on the page, so we can ignore everything before that. Which leaves us with:

table.classy > tbody > tr.odd

And the second row ends with tr.even instead of tr.odd, so we need to make one more adjustment to obtain our final row selector:

table.classy > tbody > tr

If you haven't already tried it yourself:

  1. Right-click a data row in the green table above, or 

    here

    Title of Column one

    Title of Column two

    Page one, row one, column one

    Page one, row one, column two

    Page one, row two, column one

    Page one, row two, column two

    Previous page - Next page


    Explore Resources
    here
  2. Select Inspect Element,
  3. Hover your mouse pointer over the neighboring lines of HTML in the Inspector,
  4. Find one that highlights the entire table row and click it,
  5. Note the selector in the top row of the Inspector,
  6. Click the next table data row in the Inspector, and
  7. If both selectors are identical, write that down. If they are not, try to come up with a more generalised selector and use the information below to test it.

 

Step 2: Testing your selector

You can test your selectors using another built-in feature of the Tor Browser. This time, we will be looking at the Inspector's right-hand pane, which is highlighted below. If your Inspector only has one window, you can click the tiny "Expand pane" (Expand pane button) button near its upper, right-hand corner.

Expanded pane Adding a new rule New rule added Modifying new rule Showing elements

Now you can:

  1. Right-click anywhere in the new pane and select Add new Rule. This will add a block of code like the one shown in the second screenshot thumbnail above. (The highlighted bit will depend on what's currently selected in the Inspector.)
  2. Delete the highlighted contents, paste in the selector you want to test and press <Enter>.
  3. Click the nearly-invisible, "Show elements" crosshairs (Show elements) just to the right of your new "Rule." This will turn the crosshairs blue and highlight all of the page elements that match your selector.
  4. If this highlighting includes everything you want (and nothing you don't), then your selector is good. Otherwise, you can modify the selector and try again.

When testing your row selector, in particular, you want all of the table rows to be highlighted.

You can click the Show elements icon again to cancel the highlighting. You can close the right-hand Inspector pane by clicking the "Collapse pane" (Collapse pane button) button. These new "Rules" are temporary and will disappear as soon as you reload the page.

Step 3: The "next page" selector

If the table data you want to scrape is spread over multiple pages, you will have to configure a next page selector as well. Fortunately, determining this selector is usually much easier. Because you only need to select a single element, you can often use the "Copy Unique Selector" option in the Tor Browser's Inspector as shown below.

Inspect next link Hovering over the whole paragraph Hovering over the anchor Copy Unique Selector

To do this, simply:

  1. Right-click the next page link on the first page containing the table you want to scrape,
  2. Select Inspect Element,
  3. Hover your mouse pointer over nearby HTML lines in the Inspector until you find the correct <a>...</a> link
  4. Right-click that element in the Inspector and select Copy Unique Selector.

You may be able to use this selector as it is. If you want to make sure, just click the "next page" link manually and follow the same steps on the second page. If the selectors are the same, they should work for all pages that contain table data.

If you have to modify the selector — or if you choose to shorten or simplify it — you can test your alternative using the method described in Step 2, above. Don't worry if your final selector highlights multiple next page links, as long as it doesn't highlight anything else. The Scrapy template we recommend below only pays attention to the first "match."

Try following these steps for the "next page" link just beneath the green table above or

here

Title of Column one

Title of Column two

Page one, row one, column one

Page one, row one, column two

Page one, row two, column one

Page one, row two, column two

Previous page - Next page


Explore Resources
here. The Tor Browser's Copy Unique Selector option should produce the following:
#next_page

(Which is equivalent, in this case, to a#next_page) And, if you test this selector, you should see that clicking the Show elements crosshairs only highlights this one page element.

Step 4: The column selectors

You will often need to identify several column selectors. The process for doing so is nearly identical to how we found our row selector. The main differences are:

  • You will need a selector for each column of data you want to scrape,
  • Your selectors should be relative to the row, so they will not begin with segments like html, table, tr, etc.

For the first column of our green table, the Tor Browser's Inspector gives us:

html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr > td.special > p

If we removing everything up to and including the row (tr) segment, we end up with:

td.special > p

If we test this selector by expanding the Inspector, adding a new "rule" for it and clicking the Show elements crosshairs, as discussed in Step 2, we should see highlighting on both cells in column one.

As mentioned above, the second column requires a bit of cleverness because it has no class. If you test the following selector, though, you will see how we can use the nth-child(n) trick to get the job done:

td:nth-child(2) > p

Something to keep in mind when testing column selectors: as mentioned above, a column selector is relative to its "parent" row. As a result, when you are testing your row selectors, you might occasionally see highlighted content elsewhere on the page. This is fine, as long as nothing else in the table is highlighted. (If you're concerned about this, you can stick your row selector on the front of your column selector and test it that way. In this example, we would use the following:

table.classy > tbody > tr td:nth-child(2) > p

Just be sure not to use this combined, row-plus-column selector later on, when we're actually trying to scrape data.)

Step 5: Selector suffixes

There is one final step before we can start scraping data. The selectors discussed above identify HTML elements. This is fine for row selectors, but we typically want our column selectors to extract the contents of an HTML element. Similarly, we want our next page selectors to match the value of the actual link target (in other words, just /guides/scraping-2 rather than the following:

<a href=/guides/scraping-2>Next page</a>

To achieve this, we add a short "suffix" to column and next page selectors. The most common suffixes are:

  • ::text, which extracts the content between the selected HTML tags. We use it here to get the actual table data.
  • ::attr(href), which matches the value of the href attribute inside an HTML tag. We use it here to get the "next page" URL so our scraper can load the second page, for example, when it's done with the first.
  • ::attr(src), which matches the value of a src attribute. We do not use it here, but it is helpful when scraping tables that include images.

So, in the end, our final selectors are:

  • Row selector: table.classy > tbody > tr
  • Next page selector: #next_page::attr(href)
  • Column selectors
    • Column one: td.special > p::text
    • Column two: td:nth-child(2) > p::text

Configuring Scrapy

Now that we have a basic understanding of HTML, CSS and how to use the Tor Browser's Inspector to distill them into a handful of selectors, it is time to enter those selectors into Scrapy. Scrapy is a free and open-source Web scraping platform written in the Python programming language. In order to use it, we will:

  1. Install Scrapy;
  2. Create a small python file called a "spider" and configure it by plugging in:
    • The URL we want to scrape,
    • Our row selector, and
    • Our column selectors;
  3. Tell Scrapy to run our spider on a single page;
  4. Check the results and make any necessary changes; and
  5. Plug in our next page selector and run the spider again.

Step 1: Installing Scrapy

You can install Scrapy on Tails by launching Terminal, entering the command below and providing your passphrase when prompted:

sudo apt-get install python-scrapy

To install software on Tails, you need to set an administration passphrase when you boot your system. If you are already running Tails, and you did not do this, you will have to restart your computer. You might also want to configure the "encrypted persistence" feature, which allows you to save data within the /home/amnesia/Persistent folder. Otherwise, anything you save will be gone next time you boot Tails.

You do not have to configure Persistence to use Scrapy, but it makes things easier. Without Persistence, you will have to save your "spider" file on an external storage device and re-install Scrapy each time you want to use it.

Even with Persistence enabled, you will have to reinstall Scrapy each time you restart unless you add a line that says python-scrapy to the following file:

/live/persistence/TailsData_unlocked/live-additional-software.conf

To do this, you can launch Terminal, enter the following command and provide your passphrase when prompted:

sudo gedit /live/persistence/TailsData_unlocked/live-additional-software.conf

This will open the GNU Edit text editor, give it permission to modify system files and load the contents (if any) of the necessary configuration file. Then add the line above (python-scrapy) to the file, click the [Save] button and quit the editor by clicking the X in the upper, right-hand corner. Unless you have edited this file before, while running Tails with Persistence, the file will be blank when you first open it.

On Debian GNU/Linux systems other than Tails, you will need to install Tor, torsocks and Scrapy, which you can do by launching Terminal, entering the following command and providing your passphrase when prompted.

sudo apt-get install tor torsocks python-scrapy

Step 2: Creating and configuring your spider file

There are many different ways to use Scrapy. The simplest is to create a single spider file that contains your selectors and some standard python code. You can copy the content of a generic spider file from here.

You will need to paste this code into a new file in the /home/amnesia/Persistent folder so you can edit it and add your selectors. To do so, launch Terminal and enter the command below:

gedit /home/amnesia/Persistent/spider-template.py

Paste in the contents from here, and click the [Save] button.

Now we just have to name our spider, give it a page on which to start scraping and plug in our selectors. All of this can be done by editing lines 7 through 14 of the template:

    ### User variables
    # 
    start_urls = ['https://some.website.com/some/page']
    name = 'spider_template'
    row_selector = '.your .row > .selector'
    next_selector = 'a.next_page.selector::attr(href)'
    column_selectors = {
        'some_column': '.contents.of > .some.column::text',
        'some_other_column': '.contents.of > a.different.column::attr(href)', 
        'a_third_column': '.contents.of > .a3rd.column::text'
    }

The start_urls, name, row_selector and next_selector variables are pretty straightforward:

  • start_urls: Enter the URL of the first page that contains table data you want to scrape. Use https if possible and remember to keep the brackets ([ & ]) around the URL.
  • name: This one is pretty unimportant, actually. Name your scraper whatever you like.
  • row_selector: Enter the row selector you came up with earlier. If you want to to test your spider on the green table above, you can use table.classy > tbody > tr
  • next_selector: The first time you try scraping a new table, you should leave this blank. To do so, delete everything between the two ' characters. The resulting line should read next_selector = ''

The column_selectors item is a little different. It is a collection that contains one entry for each column selector. You can set both a key — the text before the colon (:) — and a value for each of those entries. The example keys in the template above are some_column, some_other_column and a_third_column. The text you enter for these keys will provide the column headings of the .csv file you are going to create. The example values are as follows:

.contents.of > .some.column::text
.contents.of > a.different.column::attr(href)
.contents.of > .a3rd.column::text

These selectors are completely made up and pretty much guaranteed to fail on every webpage ever made. You should, of course, replace them with the selectors you identified in the previous section.

If you want to test your spider on the green table above, these lines should look something like the following:

    ### User variables
    # 
    start_urls = ['https://exposingtheinvisible.org/guides/scraping']
    name = 'test-spider'
    row_selector = 'table.classy > tbody > tr'
    next_selector = ''
    column_selectors = {
        'first': 'td.special > p::text',
        'second': 'td:nth-child(2) > p::text'
    }

Everything else in this file should remain unchanged. And remember, you are editing python code, so try not to change the formatting. Punctuation marks like quotes (' & "), commas (,), colons (:), braces ({ & }) and brackets ([ & ]) all have special meaning in python. As does the indentation: four spaces before most of these lines and eight spaces before the lines that contain column selectors. Also, lines that begin with a number sign (#) are "comments," which means they will not affect your spider at all. Finally, when adding or removing column selectors, pay attention to the commas (,) at the end of all but the last line, as shown above.

After you have modified the start_urls, name, row_selector and next_selector entries, and added both keys and values for each of your column selectors, you should save your new spider by clicking the [Save] button in GNU edit.

After configuring your spider, you should test it.

Step 3: Testing your spider

If you are using Tails, and if you did not change the location (/home/amnesia/Persistent) or the file name (spider-template.py) of your spider, you can test drive it by launching Terminal and entering the commands below:

cd /home/amnesia/Persistent
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv

These commands include the following elements:

  • cd /home/amnesia/Persistent moves to the "Persistence" folder in the Tails home directory, which is where we happened to put our spider
  • torsocks -i tells Scrapy to use the Tor anonymity service while scraping
  • scrapy runspider spider-template.py tells Scrapy to run your spider
  • -o extracted_data.csv provides a name for the output file that will contain the data you scrape. You can name this file whatever you want, but Scapy will use the three letter file extension at the end (.csv in this case) to determine how it should format those data. You can also output JSON content by using the .json file extension.

While it does its job, Scrapy will display all kinds of information in the Terminal where it is running. This feedback will likely include at least one "ERROR" line related to torsocks:

Unable to resolve. Status reply: 4 (in socks5_recv_resolve_reply() at socks5.c:683)

You can safely ignore this warning (along with most of the other feedback). If everything works, you will find a file called extracted_data.csv in the same directory as your spider. That file should contain all of the data scraped from the HTML table. Our example spider will extract the following from the green table above:

second,first
"Page one, row one, column two","Page one, row one, column one"
"Page one, row two, column two","Page one, row two, column one"

As you can see, the resulting columns may appear out of order, but they will be correctly associated with the key values you set for your column descriptors, which make up the first line of data. You can easily re-order the columns once you import your .csv file into a spreadsheet.

Troubleshooting tips:

  • If your spider fails, look for lines that include [scrapy] DEBUG:. They might help you figure out what broke.

  • If you want to quit Scrapy while it is still running, just hold down the <Ctrl> key and press c while in the Terminal application.

  • If the named output file already exists (extracted_data.csv, in this case), Scrapy will append new data to the end of it, rather than replacing it. So, if things don't go according to plan and you want to try again, you should first remove the old file by entering rm extracted_data.csv.

  • When Scrapy is done, it will display the following:

    [scrapy] INFO: Spider closed (finished)

Step 4: Opening your CSV output in LibreOffice Calc

Follow the steps below to confirm that your spider worked (or, if you've already got what you came for, to begin cleaning and analysing your data):

Importing into LibreOffice Calc Open in LibreOffice Calc
  1. Launch LibreOffice Calc by clicking the Applications menu in the upper, left-hand corner of your screen, hovering over the Office sub-menu and clicking LibreOffice Calc.
  2. In LibreOffice Calc, click File and select Open
  3. Navigate to your .csv file and click the [Open] button
  4. Configure the following options if they are not already set by default:
    • Character set: Unicode (UTF-8)
    • From row: 1
    • Check the "Comma" box under Separator options
    • Uncheck all other Separator options
    • Select " under Text delimiter
  5. Click the [OK] button.

Step 5: Running your spider on multiple pages

If everything in your .csv file looks good, and you are ready to try scraping multiple pages, just configure the next page selector in your spider and run it again.

If you are using Tails with Persistence, you can open your spider for editing with the same command we used before:

gedit /home/amnesia/Persistent/spider-template.py

Those first few lines should now look something like:

    ### User variables
    # 
    start_urls = ['https://exposingtheinvisible.org/guides/scraping']
    name = 'test-spider'
    row_selector = 'table.classy > tbody > tr'
    next_selector = '#next_page::attr(href)'
    column_selectors = {
        'first': 'td.special > p::text',
        'second': 'td:nth-child(2) > p::text'
    }

The only difference is the addition of an actual selector (instead of '') for the next_selector variable.

Finallly, click the [Save] button, remove your old output file and tell Scrapy to run the updated spider:

cd /home/amnesia/Persistent
rm extracted_data.csv
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv

Your .csv output should now include data from the second page:

second,first
"Page one, row one, column two","Page one, row one, column one"
"Page one, row two, column two","Page one, row two, column one"
"Page two, row one, column two","Page one, row one, column one"
"Page two, row two, column two","Page one, row two, column one"

Troubleshooting tips:

  • If you are using Tails but have not enabled Persistence, copy your output file somewhere before shutting down your Tails system.

  • If you want to keep an old version of your spider while testing out a new one, just make a copy and start working with with the new one instead. You can do this by entering the following command in your Terminal:

    cp spider-template.py spider-new.py

    Of course, you will then use the following, instead of the command shown above, to edit your new file:

    gedit /home/amnesia/Persistent/spider-new.py

    And to run your new spider, you will use the following:

    torsocks -i scrapy runspider spider-new.py -o extracted_data.csv

Real world examples

The sections below cover three multi-page HTML data tables that you might want to scrape for practice. These tables are (obviously) longer than the example above. They are also a bit more complex, so we will point out the key differences and how you can address them when configuring your spider.

For each website, we include sections on:

  • What is different about this table;
  • The URL, name and selector values for your spider; and
  • Other custom settings

Resources for refugees in a Berlin neighborhood

The city of Berlin publishes a list of resources for refugees in various neighbourhoods. The listing for Friedrichshain and Kreuzberg includes enough entries to be useful as an sample scraping target.

[Copy and paste the spider code into a new file]

What is different about this table

This table has three interesting characteristics:

  1. It has a long, complex starting URL;
  2. The data we want to scrape include an internal link to a page elsewhere on the site; and
  3. One of the cells we want to scrape contains additional HTML markup.

1. Complex starting URLs:

The URL we will use to scrape this table corresponds to the results page of a search form. It is:

http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults'

This does not affect the configuration of our spider or the command we will use to run it, but it is worth noting that you will often need to browse through a website — navigating to subpages, using search forms, filtering results, etc. — in order to find the right starting URL.

2. Internal links:

The table column (Träger) that corresponds to the organisation responsible for the resource in that row includes a Mehr… link to a dedicated page about that resource. We want the full URL, but the HTML attribute we will scrape only provides one that is relative to the website's domain. Our spider template will try to handle this automatically if you use 'link': as the key value for that column selector, as shown below. If it does not work properly when scraping some other website, you can just choose a different key for that column selector ('internal_link':, for example) and fix the URL yourself once you get it into a spreadsheet.

3. Internal HTML markup:

That same (Träger) column also includes HTML markup inside the text description of some organisations. As a result, the selector suffix we would normally use (::text) does not work properly. Instead, we have to extract the entire HTML block. (Notice that the organisation column selector below does not include a suffix.) If you are a python programmer, you can fix this quite easily inside your spider. Otherwise, you will have to clean it up in your spreadsheet application.

URL, name and selectors

Spoiler alert. Below you will find all of the changes you would need to make in the spider-template.py file to scrape several pages of worth of refugee resources in Friedrichshain and Kreuzberg. Before you continue, though, you should visit the webpage in the Tor Browser and, if necessary, consult the section on Using the Tor Browser's Inspector to identify the selectors you need. Then try to determine selectors for the following:

  • Row selector
  • Next page selector
  • Column selectors:
    • Supporting organisation
    • Description of the resource being offered
    • The language(s) supported
    • The location of the resource
    • An internal link to a more descriptive page

Corresponding User variables for your spider:

    start_urls = ['http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults']
    name = 'refugee_resources'
    row_selector = 'tbody tr' 
    next_selector = '.pager-item-next > a:nth-child(1)::attr(href)' 
    item_selectors = { 
        'organisation': 'td:nth-child(1) > div > p', 
        'offer': 'td:nth-child(2)::text', 
        'language': 'td:nth-child(3)::text', 
        'address': 'td:nth-child(4)::text', 
        'link': 'td:nth-child(1) > a:nth-child(2)::attr(href)' 
    }

Other custom settings

To scrape this table, you can use the default collection of custom_settings in the spider template, which is shown below. As mentioned above, the lines beginning with # are comments and will not affect the behaviour of your spider. For each of the two remaining examples, you will uncomment at least one of those lines.

    custom_settings = {
        # 'DOWNLOAD_DELAY': '30',
        # 'DEPTH_LIMIT': '100',
        # 'ITEM_PIPELINES': {
        #     'scrapy.pipelines.files.FilesPipeline': 1,
        #     'scrapy.pipelines.images.ImagesPipeline': 1
        # },
        # 'IMAGES_STORE': 'media',
        # 'IMAGES_THUMBS': { 'small': (50, 50) },
        # 'FILES_STORE': 'files',
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
        'TELNETCONSOLE_ENABLED': False,
        'DOWNLOAD_HANDLERS': {'s3': None}
    }

<

Hacker News

In this section, we will use YCombinator's Hacker News as a second "real world" example. Of course, in the real world, one would not need to scrape Hacker News because the website offers both an RSS feed and an application programming interface (API). Abusing Hacker News to practice scraping is something of a tradition, though, and it is a nice example of a multi-page HTML table that is not too complex.

[Copy and paste the spider code into a new file]

What is different about this table

If you restrict your scraping to the first line of each article listed on Hacker News, as we do here, this table is very similar to the tiny green example we've been using up until now. It's just bigger. We introduce one new challenge below, but it is quite straightforward.

Scraping a row attribute:

If you look at a Hacker News article using the Tor Browser's Inspector, you will see that each table row (tr) containing a story has a unique id attribute. IDs like this are often sequential, which can be a useful way to keep track of the order in which content is presented. Even though they are not sequential in this particular table, it might still be useful to scrape them. But most of our column selectors are inside table data (td) elements, and we normally specify their selectors relative to their parent row. So how do we capture an attribute of the row itself? Just specify a suffix that is only a suffix. In this case, as shown below, that would be: ::attr(id).

URL, name and selectors

Spoiler alert. Below you will find all of the changes you would need to make in the spider-template.py file to scrape several pages of worth of article on Hacker News. Before you continue, though, you should visit the website in the Tor Browser and, if necessary, consult the section on Using the Tor Browser's Inspector to identify the selectors you need. Then try to determine selectors for the following:

  • Row selector
  • Next page selector
  • Column selectors:
    • Article index
    • Article title
    • Web address for the article itself

Corresponding User variables for your spider:

    start_urls = ['https://news.ycombinator.com/news']
    name = 'hacker_news'
    row_selector = 'table.itemlist > tr.athing'
    next_selector = 'a.morelink::attr(href)'
    item_selectors = {
        "index": "::attr(id)",
        "title": "td.title > a.storylink::text",
        "external_link":  "td.title > a.storylink::attr(href)"
    }

Other custom settings

The Hacker News robots.txt file specifies a Crawl-delay of 30 seconds, so we should make sure that our spider does not scrape too quickly. As a result, it may take up to ten minutes to scrape all of the table data.

    custom_settings = {
        'DOWNLOAD_DELAY': '30',
        # 'DEPTH_LIMIT': '100',
        # 'ITEM_PIPELINES': {
        #     'scrapy.pipelines.files.FilesPipeline': 1,
        #     'scrapy.pipelines.images.ImagesPipeline': 1
        # },
        # 'IMAGES_STORE': 'media',
        # 'IMAGES_THUMBS': { 'small': (50, 50) },
        # 'FILES_STORE': 'files',
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
        'TELNETCONSOLE_ENABLED': False,
        'DOWNLOAD_HANDLERS': {'s3': None}
    }

CCTV cameras in Bremen

The German city of Bremen provides a list of public video cameras, along with a thumbnail image of what each camera can see.

[Copy and paste the spider code into a new file]

What is different about this table

The Bremen CCTV table has three main quirks:

  1. The page structure makes it unusually difficult to identify a reliable next page selector,
  2. To solve the next page selector challenge, we have to give our spider a maximum number of pages to scrape, and
  3. We want to download the actual image file associated with each CCTV camera.

1. Elusive next page selector:

The CSS class used to style the "next arrow" is only applied to the icon, not to the link itself, so we can not use it for our next page selector. To make matters worse, the row of page navigation links only displays "first page" and "previous page" links after you have left the first page, so our usual :nth-child(n) trick does not work. As a result, we have to get creative to come up with a selector that will work on every page. In the example below, we use a different trick (nth-last-child(n)) so we can count from the end rather than from the beginning.

2. Specifying a "depth limit":

When we get to the end, the "last page" and "next page" links disappear, which would normally make our scraper "click" on the link for a page that it had already scraped. It would then go back a couple of pages. Until it got to the end (again), at which point it would go back a couple of pages (again). This would create an infinite loop and our poor spider would end up scraping forever. By setting the DEPTH_LIMIT to 11 in the spider's custom_settings, we can tell it to stop after it reaches the last page.

3. Downloading image files:

This is the first time we are asking our spider to download image files. Scrapy makes this quite easy, but it does require:

  • Using the special image_urls key for the corresponding column selector, and
  • Uncommenting a few custom_settings by removing the # character at the beginning of the corresponding lines.

URL, name and selectors

Spoiler alert. Below you will find all of the changes you would need to make in the spider-template.py file to scrape several pages of worth of camera locations and image thumbnails. Before you continue, though, you should visit the website in the Tor Browser and, if necessary, consult the section on Using the Tor Browser's Inspector to identify the selectors you need. Then try to determine selectors for the following:

  • Row selector
  • Next page selector
  • Column selectors:
    • Name of the camera
    • Status of the camera
    • Location of the camera
    • A URL for the camera's current thumbnail image

Corresponding User variables for your spider:

    start_urls = ['http://www.standorte-videoueberwachung.bremen.de/sixcms/detail.php?gsid=bremen02.c.734.de']
    name = 'bremen_cctv'
    row_selector = 'div.cameras_list_item' 
    next_selector = 'ul.pagination:nth-child(4) > li:nth-last-child(2) > a::attr(href)' 
    item_selectors = {
        'title': '.cameras_title::text',
        'status': '.cameras_title .cameras_status_text::text',
        'address': '.cameras_address::text',
        'image_urls': '.cameras_thumbnail > img::attr(src)'
    }

Other custom settings

To address the issues described in the What is different about this table section, we will use the custom settings below.

    custom_settings = {
        # 'DOWNLOAD_DELAY': '30',
        'DEPTH_LIMIT': '11',
        'ITEM_PIPELINES': {
        #   'scrapy.pipelines.files.FilesPipeline': 1,
            'scrapy.pipelines.images.ImagesPipeline': 1
        },
        'IMAGES_STORE': 'media',
        'IMAGES_THUMBS': { 'small': (50, 50) },
        # 'FILES_STORE': 'files',
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
        'TELNETCONSOLE_ENABLED': False,
        'DOWNLOAD_HANDLERS': {'s3': None}

The DEPTH_LIMIT setting above makes sure that we stop scraping when we get to the last page. The uncommented item in the collection of ITEM_PIPELINES tells Scrapy to download and save all images identified by the images_url column selector. IMAGES_STORE tells it to put those image files into a folder called media, which it will create automatically. IMAGES_THUMBS tells it to create very small thumbnail image for each one.

(scrapy.pipelines.files.FilesPipeline, FILES_STORE and the special file_urls column selector key can be used to download other sorts of files from a webpage. This might include PDFs, Microsoft Word documents, audio files, etc. We leave these options disabled below.)

After you run your spider, have a look at your .csv output and explore that media folder to see exactly how this works.

 

Once you understand the configuration changes needed to make these three Scrapy spiders do their jobs, you should be able to securely and privately scrape structured data, images and files from most HTML tables on the web. As always, there is more work to do. You will still have to clean, analyse, present and explain the data you scrape, but we hope the information above will give you a place to start next time you stumble across an important trove of tabular data that does not have a "download as CSV" link right there next to it.

 

Published on 29 May 2017. Follow us @seeingsideways, get in touch, or read another of our guides here