During COVID 19, the unknown course of the pandemic is almost as deadly as the virus itself. COVIDCast aims to address this problem by Predicting to Protect. To do this, COVIDCast utlizes the expert knowledge of epidemiological models and the forecasting power of times series models to predict the next 14 days of new COVID cases and protect governments, hospitals, and people from the coming storm.
The data for this project comes from three sources:
These repositories gathered data from a number of sources all over the world including the WHO, John Hopkins Hospital, and the CDC. I used 6 different csv's in the master file, combining together mobility data (Google), weather data (Google), government restrictions (Google), hospitalizations and tests (OWID), case counts and epidemiological variables (CovsirPhy). There was some feature overlap in the datasets from each source, but this was used to help impute missing data in the other datasets to create a more complete final master_df.
๐งน Data Cleaning
Originally 30% of the data was missing. I used a variety of techniques to make this data more manageable:
- Imputation: Missing values in one feature would be imputed from another dataset with the same feature after checking for similarity (eg:
current_hospitalizations
,fatal
). - Interpolation: Missing values within a continuous feature with an underlying exponential growth curve would be filled using polynomial order 2 interpolation (eg:
excess_mortality
,derived_reproduction_rate
). - Filling: Missing values that were only updated if changed, or impossible before a certain date were filled forward or filled with zeroes respectively (eg:
vaccine_policy
,new_vaccinations
) - Trimming the Horizons: The horizons of my time series data was from 2020-02-15 until 2023-03-22, as many features didn't have values beyond these dates.
- Dropping Columns: Features that didn't have values up till 2023-03-22 were dropped (eg: the mobility and weather data).
๐ EDA
To properly apply time series models to the data, I had to assess:
- Differencing: Use various differencing orders to achieve stationarity, and as indicated by the PACF and ACF plots.
- PACF and ACF: Look at the PACF and ACF correllelograms to determine AR and MA orders.
- Seasonality: Apply seasonal decomposition and find the lag for seasonality
- Target Normality: Check the COVID Case numbers for normality and performed a BoxCox tranformation of the data
Here is my Preprocessing and EDA Presentation
COVIDCast works by taking Epidemiological SIRD model estimated parameters of spread, death, and recovery and plugging them into the time series models as exogenous variables to give the model better information about the underlying nature of the disease being predicted.
- SIRD Model: The SIRD (Susceptible, Infected, Recovered, Deceased) model offers insights into real-time disease spread. It estimates the rate of change for different populations (susceptible, infected, recovered, and deceased) and computes the reproductive rate of the disease known as ( R_0 ) (pronounced 'R naught'). This was computed in the beginning of the Preprocessing Notebook using the CovsirPhy library.
-
ARIMA, SARIMA, SARIMAX models: These models predict future trends using moving averages, linear autoregression, and differencing. They also incorporate seasonality and exogenous variables, making them potent tools for forecasting. However they don't model nonlinear trends very well and they require a lot of prior programming by the forecaster.
-
Prophet model: From Facebook's own description, "Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects."
For ARIMA time series models, I wanted to added exogenous variables that have information about how a target is going to fluctuate in the next time steps. To determine these features I had to assess each variables' Stationarity, Granger-Causality, Linear Correlation Strength, Multi-Collinearity, and Importance. I found that 6 variables I thought to be important due to background knowledge were actually the best additional regressors for the model. I used autoarima for grid searching SARIMAX's orders, and found the order of (3,0,2) (2, 1, 1) [7] with intercept
to be the most predictive in cross-validation.
For Prophet modeling, I grid searched with cross-validation over a range of 3 to 15 of the most important features as determined by Recursive Feature Elimination with LGMBoost as my regressor. The best performing model using this technique had 11 exogenous variables. I then grid searched over the hyperparameters to land on the final settings of changepoint_prior_scale=10
, seasonality_prior_scale=0.01
, holidays_prior_scale=10
, and growth='linear'
.
The training period was from February 15th, 2020 to March 5th, 2023 and the testing period was the 16 days from March 5th,2023 to March 21st, 2023. The forecasts and benchmarks below are based on the models' performance during the testing period.
The SARIMAX model adeptly captures the weekly COVID case variations, demonstrating minimal residuals for low case counts. Although it occasionally misses predicting peaks, the observed values still lie within its 95% confidence interval.
The Prophet model seems to capture a long-term trend, even venturing into negative COVID case counts. Efforts to employ a logistic growth curve didn't enhance its accuracy. It struggles with weekly fluctuations and, at times, is directionally incorrect. Given its current state, it's not recommended for predicting COVID cases.
Comparing the testing scores, the SARIMAX model with 6 exogenous variables and times series order of: (3,0,2) (2, 1, 1) [7] with an intercept demonstrates the best understanding of daily COVID case trends, showcasing superior scores across all metrics.
Future endeavors for this project include adapting the target variable to deaths, hospitalizations, or weekly COVID case averages to assess the forecast accuracy. Also, literature on COVID prediction suggests that an RNN with LSTMs yield a cutting-edge sMAPE of 5%. I would like to recreate this RNN model, and explore whether integrating the SIRD model parameters further enhances its predictive power.
And, of course, the ambitious goal for this project is for COVIDCast to Predict to Protect against any future pandemics.
Thank you for your interest in COVIDCast. For further inquiries or insights, contact via this GitHub repository or at scelarek@gmail.com.
Sam Celarek