Messing around with the TFL cycles data
Bike points data can be downloaded from https://api.tfl.gov.uk/bikepoint
. The data is 'live' with real time counts
for number of available bikes, number of empty slots etc. See bike-points.py
for example code for downloading and
reading the data.
Historical trip data for all TFL cycle journeys is available from
https://s3-eu-west-1.amazonaws.com/cycling.data.tfl.gov.uk/
. To avoid using AWS credentials etc, in
download-trip-data.py
, I download the bucket listing as XML, extract the full list of filenames and select the ones
I want using a regular expression.
prepare-trip-data.py
can be used to parse the downloaded CSV files. It uses PySpark to read and clean the files
before storing to parquet format. The schema of the files has changed several times over the years, so needs to be
loaded with the PERMISSIVE
flag set. The Duration
column was renamed Duration_Seconds
at some point, and several
additional columns were added on the right.
To help make aggregations easier, start and end timestamps are exploded out to multiple columns - a unix timestamp
column and derived columns for year
, month
, day
(of month), day_of_week
. This cuts down on the need for
date time manipulation in notebooks.
Note that parquet data is deleted and recreated each time prepare-trip-data.py
is run.
You can host these inside PyCharm, which is handy
https://www.jetbrains.com/help/pycharm/using-ipython-notebook-with-product.html
https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972 http://michelleful.github.io/code-blog/2015/07/15/making-maps/
https://www.metoffice.gov.uk/public/weather/climate-historic/#?tab=climateHistoric
The data consists of:
- Mean daily maximum temperature (tmax)
- Mean daily minimum temperature (tmin)
- Days of air frost (af)
- Total rainfall (rain)
- Total sunshine duration (sun)