Kaggle Challenge: CVOID-19 Global Forecasting

Please Visit the Kaggle Page for COVID 19 Global Forecasting for more information about the contest.

Some useful Kaggle discussion:

Getting started

Setting up environment

conda remove -y --name kaggle-covid --all
conda create -n kaggle-covid -y python=3.7 pandas numpy scipy statsmodels scikit-learn matplotlib seaborn ipykernel
conda activate kaggle-covid
conda install -y -c pyviz holoviews bokeh
pip install autopep8
ipython kernel install --user --name=kaggle-covid
conda deactivate
conda activate base
jupyter notebook --notebook-dir="./" --NotebookApp.port=8888

Kaggle Competition Rules

The Competition Rules can be found in the link below. It is important to know the competition permits the use of both competition and external data. As described in the rules section on the Kaggle competition page, the external data has to be published on the official competition forum before the entry deadline. As for this particular challenge, the entry deadline is on the same day as the submission deadline, which is on March 25th. We don’t have to rush for the first-week entry. However, please keep in mind any external data we found or generated must be posted before the deadline. Also, if you decided to contribute to the competition, make sure you join the team before March 25th for the first-week competition.

Kaggle docs: https://www.kaggle.com/docs

Datasets

Completed datasets

Kaggle COVID-19 data
Population of cities
- ~~Other source: https://data.london.gov.uk/dataset/global-city-population-estimates~~
Number of ICU beds
Number of Physicians per ppl by Country
Regional Demographics, like population by age group
- Media age from kaggle/koryto, amongst other things.
Airport and Routes (Private dataset, see below*)

Uncollected / Unfinished datsets

Social Media
- twitter data (we need API access)
- (distribution of) sentiment by geography
- visits to CSSE ArcGIS Portal (https://github.com/CSSEGISandData/COVID-19), ask jhusystems@gmail.com for website visit data
Policies
- ideally, we would know how much people are going out?
Treatment Options
- Vaccine trial stages

*All datasets from Kaggle are found on the sharing datasets public discussion board. All "private" external datasets should be entered on that discussion board.

Relevant Readings

Papers

Guides

Very Short and Long Time Series: https://otexts.com/fpp2/long-short-ts.html
Evaluating Time-series Forecasting: https://otexts.com/fpp2/accuracy.html
https://towardsdatascience.com/how-not-to-use-machine-learning-for-time-series-forecasting-avoiding-the-pitfalls-19f9d7adf424:

"Defining the model to predict the difference in values between time steps rather than the value itself, is a much stronger test of the models predictive powers."

ARIMA with Python: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/

Our methods

Methods are preliminary, final approach will depend on how models perform.

Clustered Time Series

Clustering cities and countries
- type of response
- healthcare system
- transportation (air travel)
- population density
- etc...
Train time-series models by cluster
- train on difference in value
- evaluate against historical mean
- this help us separate cities/countries with different environment and response
Combine model predictions

KaggleDS/covid19-global