/web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")

Primary LanguageHTMLCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

Code of Conduct  Project Status Board

Warning

This project is no longer actively maintained. It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

  • Looking for tools to monitor websites? Check out our Awesome Website Change Monitoring document or issue #18, which discusses similar projects. (This project is most useful if monitoring several thousand pages in bulk, but in most cases, other existing tools will solve your needs faster and cheaper.)

  • If you have questions about this project or the code, we’re happy to respond! Check out the Get Involved section below for information about contacting EDGI members via Slack or e-mail. You can also file an issue on this repo.

  • We still actively maintain Wayback and web-monitoring-diff. While we built them as part of this project, they are in wider, more generalized use.

EDGI: Web Monitoring Project

As part of EDGI's Website Governance Project this repository contains tools for monitoring changes to government websites, both environment-related and otherwise. It includes technical tools for:

  • Loading, storing, and analyzing historical snapshots of web pages
  • Providing an API for retrieving and updating data about those snapshots
  • A website for visualizing and browsing changes between those snapshots
  • Tools for managing the workflow of a team of human analysts using the above tools to track and publicize information about meaningful changes to government websites.

EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.

This project and its associated efforts are already monitoring tens of thousands of government web pages. But we aspire for larger impact, eventually monitoring tens of millions or more. Currently, there is a lot of manual labor that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.

For a combined view of all issues and status, check the project board. This repository is for project-wide documentation and issues.

Project Structure

The technical tooling for Web Monitoring is broken up into several repositories, each named web-monitoring-{name}:

Repo Description Tools Used
web-monitoring (This Repo!) Project-wide documentation and issue tracking. Markdown
web-monitoring-db A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes. Ruby, Rails, Postgresql
web-monitoring-ui A web-based UI (built in React) that shows diffs between different versions of the pages we track. It’s built on the API provided by web-monitoring-db. JavaScript, React
web-monitoring-processing Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes. Python
web-monitoring-diff Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API. Python, Tornado
web-monitoring-versionista-scraper A set of Node.js scripts that extract data from Versionista and load it into web-monitoring-db. It also generates the CSV files that analysts currently use to manage their work on a weekly basis. Node.js
web-monitoring-ops Server configuration and other deployment information for managing EDGI’s live instance of all these tools. Kubernetes, Bash, AWS
wayback A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages). Python

For more on how all these parts fit together, see ARCHITECTURE.md.

Get Involved

We’d love your help on improving this project! If you are interested in getting involved…

This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.

Get involved as an analyst

Get involved as a programmer

  • Be sure to check our contributor guidelines
  • Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset
  • Try to get the repo running on your machine (and if you have any challenges, please make issues about them!)
  • Find an issue labeled good-first-issue and work to resolve it

Project Overview

Project Goals

The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:

  1. Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
  2. Imports those snapshots and other metadata from archival sources.
  3. Determines which snapshots represent a change from a previous version of the page.
  4. Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
  5. Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
  6. Journalists build narratives and amplify stories for the wider public.

Identifying "Meaningful Changes"

The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.

However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.

Some examples of meaningless changes:

  • it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
  • many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.

An example of a meaningful change:

  • In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?

Sample Data

The example-data folder contains examples of website changes to use for analysis.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

Individuals

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!

Contributions Name
🔢 Chris Amoss
🔢 📋 🤔 Maya Anjur-Dietrich
🔢 Marcy Beck
🔢 📋 🤔 Andrew Bergman
📖 Kelsey Breseman
🔢 Madelaine Britt
🔢 Ed Byrne
🔢 Morgan Currie
🔢 Justin Derry
🔢 📋 🤔 Gretchen Gehrke
🔢 Jon Gobeil
🔢 Pamela Jao
🔢 Sara Johns
🔢 Abby Klionski
🔢 Katherine Kulik
🔢 Aaron Lamelin
🔢 📋 🤔 Rebecca Lave
🔢 Eric Nost
📖 Karna Patel
🔢 Lindsay Poirier
🔢 📋 🤔 Toly Rinberg
🔢 Justin Schell
🔢 Lauren Scott
🤔 🔍 Nick Shapiro
🔢 Miranda Sinnott-Armstrong
🔢 Julia Upfal
🔢 Tyler Wedrosky
🔢 Adam Wizon
🔢 Jacob Wylie

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

Sponsors & Partners

Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:

License & Copyright

Copyright (C) 2017-2020 Environmental Data and Governance Initiative (EDGI)
Creative Commons License Web Monitoring documentation in this repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.