Web Archiving and Retrieving Archived Information Online

This workshop introduces participants to methods, tools and tips to find and retrieve historical and ‘lost’ information from websites, as well as to archive and preserve copies of webpages for future reference and evidence that something existed online.

Workshop Overview

Topic: Web archiving and retrieving archived information online: methods, tips and tools.

Aims

  • To introduce participants to hands-on methods and tools of finding and retrieving historical and "lost" information from websites.
  • To introduce participants to methods and tools of archiving and preserving copies of webpages for future reference.
  • To demonstrate the importance of web archives and archival efforts for preserving digital content and providing evidence that something existed online.

Learning Outcomes

  • Apply ways to find and retrieve historical and "lost" information from websites.
  • Apply ways to archive and preserve your own copies of webpages for future reference.

General guidelines for trainers:

  • This workshop can be divided into 30-40 minute long sessions. Breaks are not included in the timeline; you can decide when to allocate them based on your context. Between sessions, you can add a short break or a quick energizer activity.
  • For group activities, divide participants into teams of 3-5 people. Please adapt times allocated to feedback and post-exercise discussions/debriefing based on the number of participants and size of groups. You can also encourage participants to assign various roles when working in groups. These roles can include Facilitator, Note-taker, Timekeeper, Presenter or Artist (if a visual presentation is required.)
  • For online workshops, we recommend sharing a timer on the screen during energizers and small group activities.
  • Whenever possible, adapt the workshop examples to the context of your audience.

Mode of delivery: online / in-person workshops

Workshop duration (without breaks): 2 hours and 40 minutes

Number of participants: 6 to 24

Tools

For online workshops:

  • Video-conferencing platform of your choice
  • Online polls and quizz apps (e.g. Slido, Mentimeter, etc.)
  • Whiteboard application, (such as Mural, Miro, etc.)

For offline / in person workshops:

  • computers for each participant or one per team
  • whiteboard
  • flip chart paper
  • markers
  • post-its

Related Exposing the Invisible guide:

Workshop activities and templates, to download:

Learning Activities

Opening (10 minutes)

Workshop introduction

Read Watch Listen | 5 minutes

Instructions

  • Grab attention by posing a question or commenting on a relevant topic, image, etc.

  • Introduce yourself and the goals of the workshop.

  • Optional: introduce the source of the workshop material (Tactical Tech)

  • Inform participants of the workshop agenda.

  • Suggest ground rules for the workshop: how you expect participants to act and react, respect each other, etc. Depending on the time you spend with the workshop participants, if you are running a longer training you could also consider working on a Shared Agreement or commonly agreed Code of Conduct (see suggestions in the ETI Facilitator's Guide.)

Participants' introductions / Icebreaker

Produce | 10 minutes

Instructions

  • Facilitate a round of introductions by asking participants to answer a couple of questions about themselves, their work, their workshop expectations, etc.

  • Alternatively, you can pick an icebreaker exercise that encourages participants to get creative by drawing answers or ideas on an online whiteboard or, if off-line, stand up and perform some tasks. Check the Icebreakers section in the ETI Facilitator's Manual for inspiration.

Introduction to Web Archives (20 min)

Presentation: Why archiving?

Read Watch Listen | 5 minutes

Instructions

  • Conduct a short presentation introducing the need for internet archiving techniques and tools. While you speak, you can invite participants to think of relevant examples.

Suggested script:

  • Sometimes, when you want to verify online information, you'll end up following a trail that leads to broken links or to websites that are no longer available.

  • Other times, you will come across websites with vital information that could add great value to a story, but you will not realize this value until later.

  • When you revisit that website to reference it, you may find that it no longer exists, that the specific webpage you remember has been removed or that the information you need is no longer accessible and has been replaced with new content.

  • You are likely to face all of these challenges at some point during the course of your research and investigations.

Case Study: "Lost" pages

Read Watch Listen | 5 minutes

Instructions

  • Present a situation when useful public information was removed from a website and could be retrieved by using online archives such as the Internet Archive's Wayback Machine or Archive Today

Suggested case and script (more details available in the "Retrieving and Archiving Information from Websites" guide):

RESOURCES:

Utility of web archiving tools

Discuss | 10 minutes

Instructions

  • Divide participants into small groups of 3-5 people, and invite them to discuss:

      1. how web archiving tools and services can be useful in their work, and
      1. what the possible use cases are.
  • Suggest that each group assigns a note-taker to write down the answers.

  • Collect and share the main points on a shared (online) whiteboard / (offline) flip chart paper.

Searching and Retrieving Content from Web Archives (50 minutes)

Web archives: search and retrieve

Read Watch Listen | 15 minutes

Instructions

Conduct a short presentation with live online demonstration or slides about the two main web archiving tools / platforms that can be used to find and retrieve past versions of webpages and other online content:

  • the Wayback Machine
  • Archive Today

The Wayback Machine: https://web.archive.org/

Points to address:

  • What the Wayback Machine is, and a short introduction to its host organisation the Internet Archive and its mission.

  • Briefly mention the two main functions of the Wayback Machine:

    1. users can search and retrieve saved historical website contents, and
    2. users can manually archive webpages in Wayback Machine (more details on how to save webpages in next session).
  • Demonstrate - preferably live online - how to search and retrieve timelines of past versions of websites / webpages in Wayback Machine.

  • Explain what the coloured dots symbolize when viewing a calendar and timeline of archived website results (i.e. green and blue dots relate to successful automated archiving attempts and availability of the archived material; red means unsuccessful archival effort.)

  • Explain how the Wayback Machine's technical process of web search and archival works, mentioning:

    • its use of automatic crawlers to decide which websites and webpages to visit and save, and how often (see more technical details here and here),
    • possible limitations it faces (e.g. it does not completely save webpages including interactive features and moving images, website owners can ask for content to be taken down, etc., see more in the ETI Kit's Archiving guide, section on "Limitations of the Wayback Machine" here).
  • Use the opportunity to bring up and describe new terms such as "robots.txt" and "Sitemap.xml" and how they can prove useful for researchers and investigators. Mention that robots.txt is a file that sits on a website and lists portions of the site that should or should not be accessed by crawlers. See an example here.

  • Note that users can create and use a free account with the Internet Archive and the Wayback Machine, which offers certain benefits when saving webpages (more details in next session.)

  • Emphasize that retrieving saved historical webpage data and archiving webpages with the Wayback Machine can help an investigation and provide an easy way to cite research and link to historical content. - Warn participants that they should still download and save such evidence in multiple places, given that there is a risk for archived content to be removed if website / webpage owners successfully make a case for this with the Internet Archive.

  • Note that using the free services of the Internet Archive and its Wayback Machine can add an element of neutrality and trust for investigations since this is an independent, reputable global platform whose mission is to preserve information on the internet.

Archive Today: https://archive.ph/

Points to address:

  • Archive Today uses a different mechanism to capture webpage snapshots than Wayback Machine so it archives based on a user's request, meaning just manually.

  • It sends the website your IP address so that the archived material is relevant to the region of the archiver, but does not save information about the archiver.

  • It is better at storing public social media pages because it does not respect robots.txt (i.e. saving restrictions indicated by website owners)

  • Demonstrate - ideally live online - how to search and retrieve available past versions of websites and webpages saved by users in Archive Today (tips here).

RESOURCES:

Activity: time machine

Collaborate | 10 minutes

Instructions

  • Divide participants into small groups of 3-5 people, give each group a handout (see template in "Resources" below) with the steps for looking up a website or webpage in the Wayback Machine.

  • Participants need to go back in time to find a full historical capture of one or more websites.

  • One example can be:

  • You (the trainer) can replace this example with others as long as historical data from a selected time range is available on Wayback Machine.

RESOURCE:

Tips for Wayback Machine searches

Read Watch Listen | 10 minutes

Instructions

Prepare a short presentation and online demonstration including:

Tips for retrieving archived webpages directly from a browser:

  • The Wayback Machine also allows you to request archived versions of websites / webpages without going through its website's search interface.

  • Instead, you can do so from your own browser by using several correctly formatted web addresses, such as:

    • https://web.archive.org/*www.yoursite.com/* (where www.yoursite.com/ is any site you wish to search). The browser will display the latest archived version of the site.

    • Separate the two addresses with an asterisk (*) to see the website archive's calendar view: https://web.archive.org/*/www.yoursite.com/

    • Add an asterisk to the end: https://web.archive.org/*/www.yoursite.com/* to obtain all of the archives available under that domain, not just the homepage. For example, browsing to https://web.archive.org/web/*/cambridgeanalytica.org/* will display a page-by-page listing of all cambridgeanalytica.org pages archived by the Wayback Machine.

Tips about robots.txt files:

  • Emphasize the significance of robots.txt files for archiving and research.

  • Remind participants that robots.txt is a file that sits on a website and lists portions of the site that should or should not be accessed by crawlers. If a website has a robots.txt file, you can view it by adding "/robots.txt" to its domain or subdomain. For example: https://www.google.com/robots.txt

  • Websites can use this file to block crawlers from the Wayback Machine, from search engines like Google or from any other indexing and archiving service. Some possible reasons why website administrators may opt for restrictive robots.txt files:

    • to limit bandwidth costs,

    • to reduce strain on overloaded servers,

    • to protect trademarked images,

    • to keep unfinished websites from showing up in search results,

    • to obscure potentially sensitive content.

  • While the Wayback Machine does not always comply with these restrictions, there are still many websites that its crawlers refuse to archive as a result of robots.txt directives. This could be a reason (among others) why users might not find certain pages archived, or why they might face restrictions when manually archiving some webpages.

  • If users have troubles using the Wayback Machine to view or archive some but not all of the pages on a website, they can check its robots.txt file to see if any portions of the site are "disallowed."

Activity: evolution timeline

Produce | 15 minutes

Instructions

[10 minutes] Timeline creation

  • Divide participants into small groups of 3-5 people, suggesting that each group can assign a note-taker and a presenter for their results.

  • Ask groups to choose a website that they know well (or that is relevant to their work) and create a timeline of screenshots showing the evolution of the website over time.

  • The timeline should include minimum 4-5 screenshots.

  • Each group then discusses the timeline they created, looking for interesting changes.

[5 minutes] Debriefing / Sharing with larger group

  • Each group presents their timeline and findings to the others (1 minute per group).

Archiving Content (40 minutes)

Archive and preserve

Read Watch Listen | 10 minutes

Instructions

Make a short presentation and live demonstration including:

  • How to archive websites / webpages on demand using the Wayback Machine and Archive Today.

  • Re-emphasize how web archiving can help save and preserve information for an investigation or ensure the accessibility of your own published work.

  • Use a shared screen function or projector (if offline) to demonstrate the process of saving a webpage.

For the Wayback Machine:

  • Start from https://archive.org/web - "Save Page Now"

  • Remind participants of the role of robots.txt in potentially restricting some website content from being archived

  • Important to note that manually saving webpages will only archive the page you submitted (e.g. "http://www.yoursite.com/projects") not all of the outlinks and the content on that website.

  • To archive an entire website using this method, one will need to submit each page separately or to create a free account with the Internet Archive, which will allow users to access more features, such as saving the outlinks of a webpage.

  • Show how to create and use a free account with Wayback Machine and what are the benefits of having an account: e.g. users can save outlinks and perform more bulk webpage archival or save archived pages in their account for easier storage and retrieval at a later stage.

  • Indicate that Google Sheets can be used to mass archive website pages in case the website has not been automatically crawled by Wayback Machine or if a collection of pages needs to be archived at a specific date. See details and instructions here and here.

For Archive Today:

  • Start from https://archive.ph/ - "My url is alive and I want to archive its content."

  • Pick a public webpage or a public tweet that has not been archived before and try searching for it, then use Archive Today to archive it.

  • After archiving it, search for it with the URL once more and load it to demonstrate the results.

Limitations of online archiving

Read Watch Listen | 5 minutes

Instructions

Make a short presentation with the limitations of the Wayback Machine and Archive Today and the pros and cons of different archiving methods. Emphasize the following points:

  • The Wayback Machine cannot always fully capture and retrieve public social media content, interactive content in Flash or JavaScript, moving images / videos and other similar content. Therefore other solutions may be needed to preserve such content for future reference, for instance: screenshots, regularly saving copies of websites offline.

  • The Wayback Machine automatically archives websites based on ranking and popularity, this is automated and you cannot request that an entire website is archived.

  • Archive Today does not automatically crawl and save webpages like the Wayback Machine does, one needs to do so manually and to make sure that saved links are preserved for later access and use.

  • Remind participants about the importance of not relying solely on online archives since sometimes website owners can request archived pages be taken down. Having backups is essential, they can also save HTML or PDF copies of important webpages to their own devices.

RESOURCES:

Your own time capsule

Practice | 10 minutes

Instructions

  • Ask participants to make their own time capsules by choosing a website or a webpage that they'd like to save for the future and follow the previous steps to archive it with the Wayback Machine and Archive Today.

  • Participants add their archived links to a shared document.

Other ways to retrieve and archive webpages

Investigate | 15 minutes

Instructions

  • Divide participants into small groups of 3-5 people.

  • Ask each group to investigate other ways of retrieving and archiving webpages and to add them to a shared document.

  • This could include:

    • checking 'cached' versions of webpages,

    • finding other platforms that host archived content from sources such as social media, online libraries, maps, or look for thematic archives hosting documentation of conflicts, human rights abuses, disinformation networks, etc.,

    • finding out what other information is archived in the Internet Archive and make a case for how it can be used to investigate different topics.

NOTE: You can adapt this task and suggestions to the specific needs and context of your participants.

Safety First! (15 minutes)

Digital safety basics

Read Watch Listen | 15 minutes

Instructions:

Make a short presentation covering the most relevant safety tips when conducting research and archiving content online. Suggested script:

  • Access logs: when you direct an archive service to a webpage that interests you, it will visit that webpage and store a copy of it. When it does so, the webpage being archived will automatically add a record to an ongoing "access log". This includes details such as when and by what IP addresses the webpage has been visited. Most archiving services keep access logs.

  • An attentive website administrator or an automated process might realise that a portion of their site has been archived by the Wayback Machine. This might give them clues that someone is investigating a particular piece of content or a person relevant to them. This does not mean that they can trace the investigation back to you but it can provide them with hints.

  • For example, let's suppose that only a handful of IP addresses viewed the archived page on the same day when it was added to the Wayback Machine. It would be easy for a website administrator to figure out that they are being watched from a particular place.

  • Webcite, for example, records the computer operating system and web browser of each user, as well as the domain name of each user's internet service providers (Webcite privacy policy). Archive Today uses your IP to visit pages you ask it to archive.

  • It is a good idea to activate a Virtual Private Network (VPN) or to use the Tor Browser when working with archiving services.

  • Some services require each user to create an account, to choose a username, to verify an email addresses, or to associate a social media profile.

  • You should consider establishing a separate set of accounts to use with such services in order to compartmentalise (separate) your investigative work from your personal profile oline.

  • You might even want to create a single use "identity" for a particular investigation, and dispose of it once your research is done. For example, you could create a relatively secure, compartmentalised email account, which you can do quite easily at tutanota.de or protonmail.com.

  • Any small investment of time, before you begin your investigation, can help you limit these kinds of risks.

RESOURCE:

Closure (10 minutes)

Wrap-up Activity: Takeaway Poster

Produce | 5 minutes

Tools / Materials

  • Shared drawing pad / slide / whiteboard (online)
  • Whiteboard / flip-chart paper, post-its, markers (offline)

Instructions

  • Ask participants to create a takeaway poster by sharing their answers to the following question in the shared whiteboard / drawing board:

    • What are your main takeaways from today's workshop?
  • Give participants a few minutes to write and/or draw their thoughts and read the thoughts of others.

Debriefing

  • Review their contributions and highlight some of the points on the board.
Conclusion

Read Watch Listen| 5 minutes

Tools / Materials: No materials needed.

Instructions

  • Wrap up the workshop and sum up its contents.

  • Run a quick review of the session. Each participants would say:

    • one thing they found very good about the session and

    • one thing they would improve for the next time

  • You can encourage participants to ask questions or give some final tips.

  • Share contact information if relevant, and any follow-up details.

To keep participants informed about what is going on at all times, trainers can effectively sum up workshop contents following these steps:

    1. [in the introduction] tell participants what is going to happen;
    1. [during each part of the session / workshop] remind them what is happening;
    1. [at the end of the session/workshop] tell them what just happened. In addition, at the end, trainers need to make sure they point out which expectation have been addressed.

Contact Us

Please reach out to us at Exposing the Invisible if you:

  • have any questions about this workshop plan and facilitation guidelines,
  • use this workshop plan and want to share feedback and suggestions that can help to improve them,
  • adapt the workshop plan to a specific context and want to share the results with us,
  • want to suggest new activities, tips or examples that can be added to this workshop,
  • want to share your expertise and collaborate with us on developing and testing new workshops.

Contact: eti@tacticaltech.org (GPG Key / fingerprint: BD30 C622 D030 FCF1 38EC C26D DD04 627E 1411 0C02).

Credits and Licensing

CC BY-SA 4.0

This content is produced by Tactical Tech's Exposing the Invisible project, and licensed under a Creative Commons Attribution-ShareAlike 4.0 International license

  • Workshop authors: A. Hayder, Laura Ranca, Wael Eskandar
  • Instructional design: A. Hayder
  • Editorial and content: Christy Lange, Laura Ranca, Wael Eskandar
  • Graphic design: Yiorgos Bagakis
  • Website development: Laurent Dellere, Saqib Sohail
  • Project coordination and supervision: Christy Lange, Laura Ranca, Lieke Ploeger, Marek Tuszynski, Safa Ghnaim, Wael Eskandar

This resource has been developed as part of the Collaborative and Investigative Journalism Initiative (CIJI) co-funded by the European Commission under the Pilot Project: "Supporting investigative journalism and media freedom in the EU" (DG CONNECT).

This text reflects the author’s view and the Commission is not responsible for any use that may be made of the information it contains.

More about this topic