Data Hackathon: Cherry Tree Flowers

The Data

The original data can be found here together with information on how the data was collected.

I also provided a pre-cleaned version of the data here.

The data contains the following columns:

AD: the year
Full-flowering date
- DOY encoded as the day of the year (e.g. day 89 is April 1)
- as date where 401 encodes April 1
source code: which scientific paper first reported the data point? Encoded as integers 1 to 8
data type code: which source was used to deduce the date? (e.g. diary, newspaper etc). Encoded as integers 0 to 9
reference name: Name of the historic document

Together with the time of cherry tree blossom time, the original paper also collected and estimated (average) temperature in March. The temperature data has the following collumns:

AD: the year
estimated temperature: an estimated temperature (either smoothed or non-smoothed)
observed temperature
upper/lower limit: (only for the smoothed estimate) limits of the 95% confidence interval in smoothing procedure
urban bias (hikone/kameoka weather station): urban bias was substracted

For more details on the columns, please refer to the original paper.

Data Questions

Some ideas for interesting questions to explore with this data:

Forecasting and Predicting

The data only contains the dates/temperature until 2015 and has many years without any data.

Forecast flower date and temperature

Can we extrapolate the flowering date and temperature for the years after?
Finding the missing data from the last years, how do predictions from this data compare to the actual flowering dates and temperature?
What does the data forecast for the next years?

Predicting temperature

Using only the flowering dates and the observed temperature, can we predict the temperature?
How does our result compare to the estimated temperature provided in the data?

Imputing and Interpolating

Can we impute the missing data for the years without data?

Useful resources for forecasting timeseries:

Book on timeseries Forecasting. Principles and Practise (uses R), e.g. chapter 5, 7, 8
The book Statistical Rethinking, Chapter 4, uses splines to estimate the flowering dates (also R, but Python ports exist)
Splines in Python
Timeseries Forecasting Methods in Python- Cheat Sheet
Regression in scikit-learn

Visualization

The data lends itself well to some flowery visualizations:

A fun exercise can be to try to replicate some of these visualizations with your favorite plotting tool.

Plotting Tutorials:

Ggplot
Ggplot graph gallery
Python graph gallery has code for Matplotlib, Seaborn and plotly.
Seaborn
Plotnine (port of ggplot, so the ggplot tutorial should work with some minor tweaking)

corriebar/cherry-tree-flowers

Data Hackathon: Cherry Tree Flowers

The Data

Data Questions

Forecasting and Predicting

Visualization