edgi-govdata-archiving/web-monitoring-processing

Optionally pre-check known versions in `wm import ia-known-pages`

Closed this issue · 1 comments

Depends on #659, #660.

In the wm import ia-known-pages script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.

Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.

After loading the list of page URLs in import_ia_db_urls():

def import_ia_db_urls(*, from_date=None, to_date=None, maintainers=None,
tags=None, skip_unchanged='resolved-response',
url_pattern=None, worker_count=0,
unplaybackable_path=None, dry_run=False):
client = db.Client.from_env()
logger.info('Loading known pages from web-monitoring-db instance...')
urls, version_filter = _get_db_page_url_info(client, url_pattern)

We should load all the versions in the timeframe (using the new features in #660) and add them to version_filter, e.g:

if should_precheck:
	print('Pre-loading known versions...')
	memento_key = lambda time, url: f'{time.strftime("%Y%m%d%H%M%S")}|{url}'
	versions = client.list_all_versions(start_date=from_date,
	                                    end_date=to_date,
	                                    sort='capture_time:asc',
	                                    chunk_size=1000)
	known_mementos = set(memento_key(v["capture_time"], v["capture_url"]) for v in versions)
	_filter = version_filter
	def precheck_filter(cdx_record):
		if memento_key(cdx_record.timestamp, cdx_record.url) in known_mementos:
			return False
		return _filter(cdx_record)
	
	version_filter = precheck_filter

Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).

This was solved in #667.