This repo leverages a semi-mechanistic renewal approach to jointly fit COVID-19 hospital admissions data in the US and viral concentrations in wastewater to forecast hospital admissions. See our Model Definition page for a mathematical description of the generative model, and the example vignette to run the inference and forecasting on the simulated data provided. In brief, our model builds upon EpiNow2, a widely used R and Stan package for Bayesian epidemiological inference. We modify EpiNow2 to add model for the observed viral RNA concentration in wastewater.
This README is organized into the following sections:
- Our workflow for producing weekly forecasts
- Details on model input data
- A description of our forecasting pipeline
- A guide to installing and running our code
- Details on contributing to this project
- Standard CDCGov open source repository information, notices, and disclaimers
To produce our submissions to the Covid-19 Forecast Hub, we run a forecasting pipeline every Saturday evening at 9:10 pm EST. In addition to pulling the latest data and using it to fit our inference models, the pipeline generates summary figures, produces a diagnostic report of Markov Chain Monte Carlo convergence diagnostics, and performs data quality checks on the wastewater data. We examine these outputs manually to check for data or model convergence issues.
We produce forecasts of COVID-19 hospital admissions for the 50 states, Puerto Rico, District of Columbia (DC), and the United States. Most forecasts use both wastewater data and hospital admission data, but if a location does not have any wastewater data, the wastewater input data for the model are deemed unreliable, or the model fails to converge, we use the hospital admissions-only model instead. If that model is also unreliable, we do not submit a forecast for that location. In all cases, we record our choice and the reason for it in a run-specific metadata.yaml
file, as follows:
- "States without wastewater data": No wastewater data from the past 90 days were available for these locations, so we necessarily used the hospital admissions-only model for them.
- "States we chose to use hospital admissions only model on": We detected anomalies in reported wastewater values for these locations, or the wastewater model fits for these locations did not pass checks for reliability, so we chose to use the hospital admissions-only model for them.
- "States with insufficient wastewater data": We used the wastewater-informed model for these locations, but the actual wastewater data available for them was likely too sparse to meaningfully inform the forecast.
- "States not forecasted": Both the wastewater-informed and the hospital admissions-only models had issues for these locations, so we did not submit forecasts for them.
Individual archived forecasts and their corresponding metadata.yaml
files can be found in datestamped subdirectories of the output/forecasts
directory, e.g. output/forecasts/2024-02-05
.
We store all data and configuration for the model in the input
folder.
For real-time production, we pull hospitalization data from the NHSN HealthData.gov public dataset and then is stored locally once it is ingested. For retrospective evaluation on time-stamped data sets, we use the covidcast
R package.
We use the NWSS API on the DCIPHER platform (non-public data, requires permission from NWSS to access) to obtain wastewater data at the facility level.
To interact with covidcast
, HealthData.gov, or DCIPHER/NWSS, we use API keys. covidcast
and HealthData.gov are public; anyone can request an API key. One must complete a data use agreement to access raw wastewater data from NWSS; see the NWSS website for details.
Our data pipeline expects users to store these API keys in a local secrets.yaml
file. See instructions below for setting up your secrets.yaml
file in a format the pipeline can parse.
The data (both inputs and outputs) are currently loaded in either from within the input
folder as shown below or directly from the APIs (described above). This folder also contains a file with state-level population data (locations.csv
) used by the pipeline.
Model outputs are written to individual folders after each model is run, and the file path to access those model outputs are returned as an output to the targets pipeline, to be used for downstream analysis and plotting. Alongside each pipeline run is a model metadata .txt
file formatted for the COVID-19 Forecast Hub submission, which can be found in the folder forecasts
.
+--input
+-- ww_data
+-- nwss_data
+-- {date_of_data_pull}.csv
+-- hosp_data
+-- vintage_datasets
+-- {date_of_data_pull}.csv
+-- config
+--{test/prod}
+-- config-{model_type}-{run_id}.yaml
+--saved_pmfs
+-- generation_interval.csv
+-- inf_to_hosp.csv
+--train_data
+-- {forecast_date}
+-- {model_type}
+-- train_data.csv
+--locations.csv
+-- output
+-- forecasts
+-- {forecast_date}
+-- {forecast_date}.tsv
+-- metadata.yaml
+-- wastewater_metdata_table.tsv
+-- raw
+-- {individual_state}
+-- {model_type}
+-- draws
+-- {forecast_date}
+-- run-on-{date_of_run}-{run_id}-draws.parquet
+-- quantiles
+-- {forecast_date}
+-- run-on-{date_of_run}-{run_id}-quantiles.parquet
+-- parameters
+-- {forecast_date}
+-- run-on-{date_of_run}-{run_id}-parameters.parquet
+-- future_hosp_draws
+-- {forecast_date}
+-- run-on-{date_of_run}-{run_id}-future_hops_draws.parquet
+-- stan_objects
+--{forecast_date}
+-- run-on-{date_of_run}-{run_id}-{model_name}-{time_stamp}-{unique_id}.csv
+-- diagnostics
+-- {forecast_date}
+--run-on-{date_of_run}-{run_id}-diagnostics.csv
+-- figures
+-- {forecast_date}-run-on-{date-run}
+--{individual state}
+-- {individual plots of generated quantities + data for all models}
+-- cleaned
+-- {forecast_date}-run-on-{date_run}
+-- external
+-- {submitted/test}_forecasts
+-- {forecast_date}-CDC_CFA-renewal_ww.csv
+-- {submitted/test}_metadata
+--metadata-CDC_CFA-renewal_ww.yaml
+-- internal
+-- combined quantiles + data for GQ Kate and Eric want
+-- diagnostic_report.html
+-- pdfs of combined quantiles forecasts, hospital admissions forecasts for mult models, wastewater estimates, R(t), etc.v
+-- pipeline_run_metadata
+-- test
+--{forecast_date}-run-on-{date_run}
+--{run_id}
+-- prod
+--{forecast_date}-run-on-{date_run}
+--{run_id}
We use a pipeline to pull data, process it, fit models, and generate forecasts formatted for submission to the COVID-19 Forecast Hub. The _targets.R
script in the project root directory defines the pipeline via the targets
R package.
The pipeline does the following, in order:
- Pulls the latest wastewater and hospital admissions data from NWSS and NHSN, respectively
- Formats the data properly for ingestion by our Stan models.
- Fits Bayesian renewal models to those data (links below point to the relevant
.stan
source files):- A model without wastewater (based only on hospital admissions).
- A national model using aggregated wastewater concentration data.
- A model incorporating site-level wastewater concentration data.
- Post-processes model output to produce forecasts and summary figures, including a table formatted for submission to the Covid-19 Forecast Hub.
See our model definition page for further details on the modeling methods and data pre-processing.
To run our code, you will need a working installation of R (version 4.3.0
or later). You can find instructions for installing R on the official R project website.
We do inference from our models using CmdStan
(version 2.34.1
or later) via its R interface cmdstanr
(version 0.7.1
or later).
Open an R session and run the following command to install cmdstanr
per that package's official installation guide.
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
cmdstanr
provides tools for installing CmdStan
itself. First check that everything is properly configured by running:
cmdstanr::check_cmdstan_toolchain()
You should see the following:
The C++ toolchain required for CmdStan is setup properly!
If you do, you can then install CmdStan
by running:
cmdstanr::install_cmdstan()
If installation succeeds, you should see a message like the following:
* Finished installing CmdStan to <a filepath on your system>
If you run into trouble, consult the official cmdstanr
website for further installation guides and help.
Once cmdstanr
and CmdStan
are installed, the next step is to download this repository and install our project package, cfaforecastrenewalww
. The repository provides an overall structure for running the forecasting analysis; the project package provides tools for specifying and running our models, and installs other needed dependencies.
Once you have downloaded this repository, navigate to it within an R session and run the following:
install.packages('remotes')
remotes::install_local("cfaforecastrenewalww")
If that fails, confirm that your R working directory is indeed the project directory by running R's getwd()
command.
Installing the project package should take care of almost all dependencies installations. Confirm that package installation has succeeded by running the following within an R session:
library(cfaforecastrenewalww)
To load in the data you will need to set up a secrets.yaml
file in the root of the directory with the following format:
covidcast_api_key: {key}
NHSN_API_KEY_ID: {key}
NHSN_API_KEY_SECRET: {key}
nwss_data_token: {token}
data_rid: {rid}
Directions to obtain the covidcast api are found here.
Directions to obtain an NHSN key are described in the forecasttools
repo here.
Note that you will need to have access to the dataset on DCIPHER to obtain the credentials stored as nwss_data_token
and data_rid
.
Once you have DCIPHER access, you will need to go here to create a new token.
Name it whatever you like, then be sure to copy the one-time token and place it in secrets.yaml
as your nwss_data_token
.
The dataset RID is obtained when the data use agreement is approved and the link to the dataset on DCIPHER is provided.
To run the pipeline, type the following at a command prompt from the top-level project directory:
Rscript --vanilla -e "targets::tar_make()"
Alternatively, in an interactive R session with your R working directory set to the project root, run
targets::tar_make()
We store our production code in the prod
branch; refer to the HEAD
of that branch for the code used to produce our most recent published forecast. To develop new features or fix bugs, create a feature branch off of prod
. When the feature is ready, make a pull request into prod
. All tests should pass within a feature branch before pull request can be merged.
Please see our contributing guidelines and code-of-conduct for more details.
We want feedback and questions! Feel free to submit an issue here on Github, or contact us via this form.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
Any included source code adapted or reused from another open source project inherits that project's license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for the standard/template CDCGov README, contribution policy, disclaimer, and code of conduct from which the corresponding documents found in this repository have been derived.