This repository is for EDGI Web Monitoring Project documentation and project-wide issue management.
EDGI is already monitoring tens of thousands of pages and will eventually be monitoring tens of millions (or even as many as ~1 billion). Currently there is a lot of manual labour that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.
You can track upcoming releases by exploring our milestones within the GitHub issue queue.
The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. The Website Monitoring automated system aims to make these changes easy to track, review, and report on.
For more, see the Version Tracking page on the EDGI website and watch this 50-minute Analyst training video.
The best way to get involved is to take a run through our onboarding process, for which we rely on Trello. It's designed to be self-directed, so you can run through it at your own pace. But don't worry -- along the way, it will introduce you to the humans of EDGI's Web Monitoring project! Yay humans!
Also, sign up and join us on Slack! Active discussion is happening in the #webmonitoring channel with various offshoot topics happening on channels prefixed with #webmonitoring.
- Access captured data (starting with HTML, later encompassing more types) from multiple archival sources including Versionista, PageFreezer, and the Internet Archive.
- Compare version of the same page over time --- potentially using multiple different strategies.
- Automatically filter out "nonmeaningful" or repetitive changes: for example, the "Page Last Viewed" timestamp updated or the same news article was added to 100 pages from the same website.
- Prioritize the changes most likely to be "meaningful," meaning that some item of importance to fact-based governance was deleted or changed in a harmful way.
- Present changes to human analysts with useful visualizations and statistics to help them differentiate meaningful changes. Each user will have been assigned a "subdomain", a full or partial government domain that has been identified as relevant to fact-based governance.
- Collect annotations from the analysts. Use this to flag changes for special attention from EDGI administrators. Also use it to feed back into the filtering and priotization process. (That is, use it to train models.)
- Page: a web page crawled over time by one or more services like Internet Archive, Versionista, or PageFreezer.
- Version: a snapshot of a Page at a specific time (saved as HTML, for now).
- Change: two different Versions of the same Page.
- Diff: a representation of a Change: this could be a plain text
diff
(as in the UNIX comand line utility) or a richer representation (as in the JSON blobs returned by PageFreezer) that takes into account HTML semantics. - Annotation: a set of key-value pairs characterizing a given Change, submitted by a human analyst or generated by an automated process. A given Change might be annotated by multiple analysts, thus creating multiple Annotations per Change.
For more detail, see the Schema section below.
The project is currently divided into several repositories handling complementary aspects of the task. They can be developed and upgraded semi-independently, communicating via agreed-upon interfaces. For additional information, you can contact the active maintainers listed alongside each repo or our Project Manager, @weatherpattern:
- web-monitoring-db (@Mr0grog) A Ruby on Rails app that serves database data via a REST API, serves diffs, and collects human-entered annotations.
- web-monitoring-ui (@lightandluck) React front-end that provides useful views of the diffs. It communicates with the Rails app via JSON.
- web-monitoring-processing (@danielballan) A Python backend ingests new captured HTML, computes diffs (for now, by querying PageFreezer), performs prioritization/filtering, and populates databases for Rails app.
- web-monitoring-versionista-scraper (@Mr0grog)
A set of Node.js scripts used to extract data from Versionista and load it into the the database. It also generates the CSV files that analysts currently use in Google Spreadsheets to review changes. This project runs on its own, but in the future may be managed by or merged into
web-monitoring-processing
.
The software will be deployed on Google Cloud, with each component running in a separate Docker container.
The vast majority of changes to web pages are not relevant to analysts and we want to avoid presenting those irrelevant changes to analysts at all. It is, of course, not trivial to identify "meaningful" changes immediately, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not. However, as we expand from 104 to 107 webpages, we need to drastically reduce the number of pages that analysts look at.
Some examples of meaningless changes:
- it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
- many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.
An example of a meaningful change:
- In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?
example-data
contains examples of website changes:
falsepos-...
files are cases any filter should catchtruepos...
files are cases of changes we care about
This is a small but illustrative sample. Many more samples will be made available as soon as possible.
This describes the schema of the SQL databases shared by the Rails app in web-monitoring-db and the Python processing bakend in web-monitoring-processing. Review the Definition of Terms section above, which corresponds to these tables.
Every table includes:
- uuid: UUID4 unique indentifier
- created_at: internal detail of the database
- updated_at: internal detail of the database
in addition to the table-specific fields listed below.
- url: URL, which may be updated over time if a page is moved
- title:
<title>
tag - agency: Government agency
- site: A category used to organizing Pages, loosely but not always the subdomain of the URL.
- page_uuid: reference to a Page
- capture_time: when this snapshot of the Page was acquired
- uri: path to stored (HTML) data; could be a filepath, S3 bucket, etc.
- version_hash: sha256 hash of stored data
- source_type: name of source (such as 'Internet Archive')
- source_metadata: JSON blob of extra info particular to the source.
This field is free-form, but we generally expect the following content for a
given
source_type
:source_type: 'versionista'
account
: A string identifying which Versionista account the data came from. This will generally beversionista1
orversionista2
.site_id
: ID of the site in Versionistapage_id
: ID of the page in Versionistaversion_id
: ID of the version in Versionistaurl
: The full URL to view this version in Versionista. You’ll need to be logged into the appropriate Versionista account to make use of it.diff_with_previous_url
: URL to diff view in Versionista (comparing with previous version)diff_length
: Length (in characters) of the diff identified by the abovediff_with_previous_url
.diff_hash
: SHA 256 hash of the above diff identified bydiff_with_previous_url
.diff_with_first_url
: URL to diff view in Versionista (comparing with the first recorded version)has_content
: Boolean indicating whether Versionista had raw content for this version. If this is true, the version’suri
should have a value (and vice-versa).error_code
: If HTTP status code returned to Versionista when it originally scraped the page was a non-200 (OK) status, this property will be present. Its value is the status code of the response, e.g.403
,500
, etc.
- uuid_from: reference to the "before" Version
- uuid_to: reference to the "after" Version
- priority: a number between 0 and 1 where 1 is high priority
- current_annotation: a JSON blob production a materialized reduction of one or more submitted Annotations, resolving conflicts in some way yet to be determined
- change_uuid: reference to a Change that this Diff represents
- uri: path to stored diff data; could be a filepath, S3 bucket, etc.
- diff_hash: sha256 has of stored diff data
- source_type: name of diffing utility (such as 'PageFreezer')
- source_metadata: JSON blob of extra info particular to the source
- change_uuid: reference to a Change that this Annotation characterizes
- annotation: JSON blob
- author: user id
(This summary omits Users and Invitations, which are implemented in the Rails app.)
For more details see the Python implementation and the Ruby implementation (currently in progress).
Don't forget check out the "How To Help" section above.
See our contributor guidelines.
This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!
Contributions | Name |
---|---|
🔢 | Chris Amoss |
🔢 📋 🤔 | Maya Anjur-Dietrich |
🔢 | Marcy Beck |
🔢 📋 🤔 | Andrew Bergman |
🔢 | Madelaine Britt |
🔢 | Ed Byrne |
🔢 | Morgan Currie |
🔢 | Justin Derry |
🔢 📋 🤔 | Gretchen Gehrke |
🔢 | Jon Gobeil |
🔢 | Pamela Jao |
🔢 | Sara Johns |
🔢 | Abby Klionski |
🔢 | Katherine Kulik |
🔢 | Aaron Lamelin |
🔢 📋 🤔 | Rebecca Lave |
🔢 | Eric Nost |
🔢 | Lindsay Poirier |
🔢 📋 🤔 | Toly Rinberg |
🔢 | Justin Schell |
🔢 | Lauren Scott |
🔢 | Miranda Sinnott-Armstrong |
🔢 | Julia Upfal |
🔢 | Tyler Wedrosky |
🔢 | Adam Wizon |
🔢 | Jacob Wylie |
(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)
Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:
Copyright (C) 2017 Environmental Data and Governance Initiative (EDGI)
Web Monitoring documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the
LICENSE
file for details.