18F/site-scanning

Microrequest Details: Stabilize the scan engine

Closed this issue · 1 comments

Goal: The scan engine scans the entire Target URL list as expected. (The simplest way to see this would be for each website to have a fresh timestamp in the data.)

Background: The Site Scanning program operates by iterating through a list (20-50k) list of public federal .gov websites (e.g. blog.fbi.gov and 18f.gsa.gov), collect information about each of them, and writes the results in a database.

The engine is supposed to run fresh scans against the full list of websites once a day. However, we've noticed that it stops after some period of time, usually a few days. The lead engineer suspected a memory issue but did not have a chance to resolve it before our engagement ended.

Details:

The Site Scanning engine is hosted in cloud.gov and we can share access to it anytime.

Links: