Optionally pre-check known versions in `wm import ia-known-pages`
Closed this issue · 1 comments
In the wm import ia-known-pages
script, we first list all the pages in web-monitoring-db, then search for them in Wayback and import every memento we can find. Because we import data from overlapping time periods each time we run the import script (this works around occasional outages in the Wayback Machine’s indexing), we wind up import a lot of versions that we already have in web-monitoring-db! That’s not strictly a problem (web-monitoring-db will just ignore data it already has), but it is a waste of bandwidth and resources.
Instead, the import script should first load all the versions in the timeframe we're checking so that, as it gets CDX results, it can check them against the list and skip over versions that are already in the DB.
After loading the list of page URLs in import_ia_db_urls()
:
web-monitoring-processing/web_monitoring/cli/cli.py
Lines 555 to 561 in 06e3e51
We should load all the versions in the timeframe (using the new features in #660) and add them to version_filter
, e.g:
if should_precheck:
print('Pre-loading known versions...')
memento_key = lambda time, url: f'{time.strftime("%Y%m%d%H%M%S")}|{url}'
versions = client.list_all_versions(start_date=from_date,
end_date=to_date,
sort='capture_time:asc',
chunk_size=1000)
known_mementos = set(memento_key(v["capture_time"], v["capture_url"]) for v in versions)
_filter = version_filter
def precheck_filter(cdx_record):
if memento_key(cdx_record.timestamp, cdx_record.url) in known_mementos:
return False
return _filter(cdx_record)
version_filter = precheck_filter
Because this can be a time and memory consuming process for large timeframes, we should probably have a CLI option to turn it on (or off, not sure which).