/velib

Primary LanguagePython

Usage

Requirements:

mkdir data
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration file:

Copy the config example file :

cp config/config.yaml.dist config/config.yaml

and modify the variable as you wish :

velib_files_path: <absolute path to station*.gz files>
weather_files_path: <absolute path to paris_weather*.gz file>
logging_level: INFO

To make the import work, you should end your path with a / and compressed data in the folders.

First Launch

python run.py

It will create a training.csv file, which will be use in future runs to accelerate loading.

Modifications : I had to modify a bit the weather file to be able to read it easily

Workflow

The program work this way :

  • Load local challenge data (velib history + weather)
  • Crawl additional data from open data paris (theatre, museum and market locations)
  • Merge data using timestamps or coordinates
  • Split the data into training set and test set (split_proportion variable)
  • Fit a Classifier or a Regressor the training set (here RandomForestRegressor)
  • Predict on the testset

Then it will print the feature importance and plot the error rate based on the confidence level : the predictions are floats between 0 and 1. If the confidence level is 0.5, every prediction above 0.5 would mean 'There are some stands left in the station'. We then compare to the actual value.

Usecase

The use case would be : I know the station near me has bikes. I will arrive to my destination in 1 hour. Would the station I'm targeting have stands left ?

### Results

The testset is a 30% subset of the trainset. We can make multiple run to have more accurate results but with few data a single run gives the same results.

Prediction Rate over confidence level

Here the error rate is very low (assets/prediction_rate.png) and could be explain by an overfitting. A velib data crawler had been developped but couldn't be used because of missing weather data (even though the first 4 important features are temporal data (see below)).

I used the RandomForestRegressor for :

  • its well know accuracy
  • the feature importance output
  • the oob estimation

Feature importance

feature importance
minute 4.9833051357
hour 3.7418882181
day 1.9510148148
wday 1.0712071957
tempi 1.0058053216
number 1.0014975615
tempm 0.9662565957
theatre_distance 0.6568507767
museum_distance 0.6205061519
bike_stands 0.6125561546
market_distance 0.5977898978
month 0.539800456
conds_Mostly Cloudy 0.1460951477
icon_mostlycloudy 0.1424671599
icon_clear 0.1260777932
conds_Clear 0.1232655831
conds_Scattered Clouds 0.109913737
icon_partlycloudy 0.1092215464
conds_Partly Cloudy 0.0877612316
conds_Light Rain 0.0524947526
icon_rain 0.0461725471
rain 0.040896624
conds_Overcast 0.0368349133
icon_cloudy 0.0362478491
conds_Fog 0.024847045
fog 0.0233143817
conds_Light Drizzle 0.0201011979
icon_fog 0.0196913614
conds_Rain 0.017992301
conds_Patches of Fog 0.0171216008
conds_Shallow Fog 0.0160653025
conds_Light Rain Showers 0.0135096732
icon_unknown 0.0126847769
conds_Unknown 0.0120640566
conds_Rain Showers 0.0044508173
icon_hazy 0.0025105848
conds_Mist 0.0024687149
conds_Heavy Rain Showers 0.002114527
conds_Drizzle 0.0015543805
conds_Light Fog 0.0011291071
icon_tstorms 0.0008785905
thunder 0.0008064289
conds_Light Thunderstorms and Rain 0.0007679871
precipi 0
bonus 0
snow 0
banking 0
precipm 0

### Future work

  • Use weather open data
  • Add more relevant features
    • Minutes from midnight
    • Meeting area
  • Be able to mix contextual predictions with historical data (active learning)

Litterature

These are papers talking about bike sharing in different ways, which could help for future improvements.

External Data

Used

Not used