The original data can be found here together with information on how the data was collected.
I also provided a pre-cleaned version of the data here.
The data contains the following columns:
- AD: the year
- Full-flowering date
- DOY encoded as the day of the year (e.g. day 89 is April 1)
- as date where 401 encodes April 1
- source code: which scientific paper first reported the data point? Encoded as integers 1 to 8
- data type code: which source was used to deduce the date? (e.g. diary, newspaper etc). Encoded as integers 0 to 9
- reference name: Name of the historic document
Together with the time of cherry tree blossom time, the original paper also collected and estimated (average) temperature in March. The temperature data has the following collumns:
- AD: the year
- estimated temperature: an estimated temperature (either smoothed or non-smoothed)
- observed temperature
- upper/lower limit: (only for the smoothed estimate) limits of the 95% confidence interval in smoothing procedure
- urban bias (hikone/kameoka weather station): urban bias was substracted
For more details on the columns, please refer to the original paper.
Some ideas for interesting questions to explore with this data:
The data only contains the dates/temperature until 2015 and has many years without any data.
Forecast flower date and temperature
- Can we extrapolate the flowering date and temperature for the years after?
- Finding the missing data from the last years, how do predictions from this data compare to the actual flowering dates and temperature?
- What does the data forecast for the next years?
Predicting temperature
- Using only the flowering dates and the observed temperature, can we predict the temperature?
- How does our result compare to the estimated temperature provided in the data?
Imputing and Interpolating
- Can we impute the missing data for the years without data?
Useful resources for forecasting timeseries:
- Book on timeseries Forecasting. Principles and Practise (uses R), e.g. chapter 5, 7, 8
- The book Statistical Rethinking, Chapter 4, uses splines to estimate the flowering dates (also R, but Python ports exist)
- Splines in Python
- Timeseries Forecasting Methods in Python- Cheat Sheet
- Regression in scikit-learn
The data lends itself well to some flowery visualizations:
- Some twitter user (with R code!)
- The Economist
- BBC
- Viz of Washington cherry trees
A fun exercise can be to try to replicate some of these visualizations with your favorite plotting tool.
Plotting Tutorials:
- Ggplot
- Ggplot graph gallery
- Python graph gallery has code for Matplotlib, Seaborn and plotly.
- Seaborn
- Plotnine (port of ggplot, so the ggplot tutorial should work with some minor tweaking)