covidatlas/coronadatascraper

COVID Atlas 1.0 architecture & plans

ryanblock opened this issue · 3 comments

As with any project of meaningful utility and scale, we never know all of its needs up front.

First, we build the thing, and then we see where it takes us. We learn as quickly as possible, adapt, and grow. (Who could have anticipated that governments would publish pandemic case data in PDFs or images? Or require cookies and csrf tokens to just request a page containing basic public health data?)

The purpose of this document is to discuss the future architecture plans¹ for COVID Atlas.

This issue assumes a semi-large scale refactor.

I know, this can make folks feel uncomfortable. It makes me somewhat uncomfortable. It's also where we are.

A quick spoiler: scrapers may need some updating, but they will be preserved! We love our scrapers. We are not tossing out the scrapers!

Why start fresh

The initial analysis I did of the coronadatascraper codebase seemed promising for an in-flight, gradual refactor into production infrastructure.

After spending the last few weeks in the codebase, discovery surfaced deep underlying architectural flaws that posed significant barriers to overcoming core issues in our current processes.

For those that may not be aware of the problems downstream of these issues, they include such fan favorites as: Larry has to stay up until 10pm every night manually releasing the latest data set, which only he knows how to do; unexpected errors can fatally break our entire build; and, even minor changes require a large degree of manual verification.

@lazd and I agree agree these issues are fundamental and must be addressed with seriousness, care, and immediacy.

Second-system syndrome

We must immediately call out a common reason refactors or rewrites may fail: second-system syndrome.

Putting aside the fact that this codebase is only a few weeks old, we still need to be clear about expectations: v1.0 will likely seem like a step back at first; it will do fewer things, and the things it does may be approached differently.

This issue is not a dropbox for every idea we have, or a long-term roadmap for the future. This issue is a plan to get us into robust and stable production infra as soon as possible, and to begin phasing out parts of CDS as quickly as possible.


What we learned from v0 (coronadatascraper) architecture

Over the last few weeks, we learned an enormous amount from coronadatascraper. Below is a summary of a few of those findings that informed this decision, and will continue to inform our architecture moving forward:

Crawling

  • Crawling once daily is insufficient and a catastrophic single point of failure
    • We have witnessed frequent failures due to a variety of reasons, and need to be crawling many times per day
  • Crawling many times per day necessitates datetime normalization and understanding source locales
    • Example: the 2020-04-01T00:00:00.000Z crawl for San Francisco, CA must somewhere, at some point, cast its data to 2020-03-31
    • If this example is not immediately apparent to you, that's ok! Just take our word for it for the moment
  • Datetime normalization is greatly aided by the increased separation of concerns of crawling and scraping (e.g. logic related to URLs belongs outside of logic related to scraping)
  • No individual crawl failure should ever take down crawling another source; crawls should run independently of other sources
  • The cache should be read-only; humans should should not be responsible for maintaining the cache
    • We must remove manual steps prone to human error or other external factors (whether someone's internet connection is working) and replace such steps with automation

Scraping

  • Scrapers will frequently begin to fail due to no fault of our own
    • Every day a half dozen or more scrapers will start to fail due to changes in their sources
    • This is a known phenomena, and should be expected and accounted for
  • Scrapers often require a very high degree of flexibility due to the absurd variation seen by state and local governments in data publishing
    • URLs change all the time, and we need to be highly flexible with that
    • Some URLs are published daily and are thus date-dependent (example: VA)
    • Some data sources represent multiple timezones (JHU, NYT, TX)
    • Some need to access headers, cookies, and other non-obvious vectors in order to acquire data (RUS)
    • Scrapers need many built-in parsers, including HTML (unformatted and tabular data), CSV, JSON, ArcGIS, etc.
  • Some data is reported with cities subtracted from counties (example: Wayne County - Detroit)
  • Some countries block access to our requests! WTF!
  • Normalizing location names is very difficult, but we have to be extremely good at it in order for other things to work without issue (see morebelow)
  • Some scrapers may only be able to return large datasets from states; these datasets may completely rely on some post-run normalization to make usable
    • Which is another way of saying: scraper devs should not be solely responsible for adding ISO/FIPS IDs to their own metadata; but they are responsible for ensuring their metadata can be identified with ISO/FIPS IDs

Data normalization + tagging

  • All data being emitted from scrapers needs to be normalized to ISO and (in the US) FIPS codes
    • This is very important, because location normalization unlocks a large number of other key features
    • This includes GeoJSON, which enables us to plot distance of cases around a location, or the effects of population density
  • Data normalization is a key and essential ingredient in the future efficacy of the system
  • Normalizing our data has a number of ongoing challenges, including:
    • Variations in official casing; see: Dekalb County vs. DeKalb County || Alexandria City vs Alexandria city
    • Variations in characters; see: LaSalle Parish vs La Salle Parish
    • Varied classifications of locales; see AK: Yakutat City and Borough, Skagway Municipality, Hoonah-Angoon Census Area
    • Some sources do not present results cleanly and uniformly, for example:
      • The state of Utah aggregates the counts of three counties (Uintah, Duchesne, and Daggett counties) into Tricounty, which requires denormalization
    • Untaggable data (namely: cities) are a nice to have, but may only appear in certain API-driven data sets

Local workflows

  • Local workflows should have clear success / failure vectors
    • Any backend dev should be able to easily understand and diagnose potential side effects of their changes
    • Any scraper dev should be able to easily understand and diagnose potential issues with their scraper's returned data

Testing

  • We need to employ TDD, and key areas of the codebase (such as scraper functions, caching, etc.) should be completely surrounded in tests
  • Failures should be loud, and we should hear about them frequently
  • Scraper testing will be a particular focus
    • All scrapers will undergo frequent regular tests running out of the cache (read: no manual mocks) and against live data sources to verify integrity

Moving towards 1.0 architecture

Prerequisites

  • Node.js 12 – same as today
  • For anything in the backend, we will use CommonJS, not ES modules
    • Node.js continues to have a lot of thrash around ES modules, and it is unclear when it will stabilize
    • This app has been and will continue to be written for Node.js, not the browser
    • Therefore, we will use the tried and true, boring, built-in option
  • Technical decisions will be made favoring:
    • Separation of concerns
    • Functional paradigms and determinism
    • Developer velocity
    • Production readiness
  • Changes should be describable in failing tests
  • Scrapers may need some updating, but they will be preserved!
    • We love our scrapers.
    • We are not tossing out our scrapers!
  • Workloads will run on AWS
    • The cache will be served out of S3
    • All data will be delivered via a database (not a batch scrape job)
    • There will be proper APIs (in addition to or instead of large flat files)

Key processes

References to the "core data pipeline" refer to the important, timely information required to publish up to date case data to covidatlas.com location views, our API, etc.

Crawling

  • Crawling will become its own dedicated operation
    • This represents step 1/2 in our core data pipeline
  • This operation will have a single responsibility: loading one or more pieces of data (a web page, a CSV file, etc.) from the internet and writing that data through to the cache
    • The cache will be stored in S3, and local workflows will start to copy down or hit the S3 bucket
  • Incomplete crawls – say 1/3 URLs requested fails – should be fail completely
  • Crawling failures should be loud; alert Slack, etc.

Scraping

  • Scraping will become its own dedicated operation
    • This represents step 2/2 in our core data pipeline
  • Prior to invocation, the scraper-runner will load the latest, freshest data from the cache, parse it, and pass it to the scraper function
    • If the data is not fresh enough (say: a successful scrape has not completed in the last n hours or days), the scrape run will fail
    • If this cannot be accomplished for whatever reason, the scrape run will fail (read: scrape runs do not invoke crawls)
  • A scraper function will be supplied the parsed object(s) it's specified (e.g. CSV) as params
  • The scraper function will return data to the scraper runner, which will then normalize (aka "transform") the locations of its results
    • Non-city-level results (such as counties, states) that cannot be normalized to an ISO and/or FIPS location will fail
  • When a scrape is complete, its output should be a simple JSON blob that stands completely on its own
    • Depending, this result may be written to disk for local workflows / debugging, to database, or fired to invoke other another event or events

Annotator (updating locations' metadata) ← name needs work

  • What I'm currently calling annotation or tagging (additional name ideas welcome!) is its own dedicated operation
    • It will run periodically / async, and is not part of our core data pipeline
  • This operation will loop over all our locations at a higher level and ensure corresponding location metadata is updated; examples:
    • Associate a location with its GeoJSON
    • Associate a location with population density, hospital beds, etc.
    • (More to come!)

Metadata updater ← name needs work

  • Updating metadata is its own dedicated operation
    • It will run periodically / async, and is not part of our core data pipeline
  • Metadata updates ensure our sources for metadata tagging are up to date
    • This may include updating and loading various datasets (GeoJSON, population / census data, etc.) into database for querying during tagging
    • Rating sources

Blob publishing (tbd)

  • Any large published datasets that we don't want to make accessible by dynamic API, we will accomplish in a blob publishing operation

    • It will run periodically / async, and is not part of our core data pipeline

I'm looking forward to your thoughts, questions, feedback, concerns, encouragement, apprehension, and giddiness.

Let's discuss – and expect to see a first cut this week!


¹ Previous planning took place in #236 + #295

Join #v1_arch on slack if you'd like to discuss.

@ryanblock - this issue can likely be closed, thoughts?

Closing this issue, the new architecture is up and running, even though we're still moving over. :-) Cheers all! jz