src/data-preprocessing
- create a database
- setup the script on a server
- run script automated with a cron job
- Python 3.6
- Libraries:
- requests
- psycopg2
install packages:
cd src/data-preprocessing
pip install -r requirements.txt
SQL Script create_bikeDB.sql to create the database scheme
Create a database where the data queried in the script is being stored.
Script query_bike_apis.py is used to query provider API data
API requests to receive all current locations of bikes from nextbike, lidlbike and mobike in Berlin (inner circle) and store them into a single database.
Script query_nextbike_stations.py is used to query the stations of nextbike
Config File Add config.py file to src/data-preprocessing with API Keys for Deutsche Bahn API (https://developer.deutschebahn.com/store/) and database credentials. (see Example config-example.py)
Set up a cron job that runs the script in regular intervalls. E.g. this setup
- runs the query_bike_apis.py script every 4 minutes
- runs the query_nextbike_stations.py script once a day at 8 AM
- runs a cleaning script on the database (/src/clean_script.py) once a day at 11 PM deleting all unnecessary rows in the database.
CRON JOBS
*/4 * * * * python3 [PATH TO FOLDER]/src/query_bike_apis.py
0 8 * * * python3 [PATH TO FOLDER]/src/query_nextbike_stations.py
0 23 * * * python3 [PATH TO FOLDER]/src/clean_script.py
To query APIs for different cities the src/data-processing/query_bike_apis.py script has to be adapted accordingly. To query other providers this documentation is a good source of information.
For access to lime bike API insert phone_no to config.py and follow steps in lime_access.py (three manual steps required).
src/analysis
Jupyter Notebook to analyse data.
-
preprocess.ipynb contains the preprossing steps of the raw data to a usable format.
- raw.csv contains the data from the database
- preprocessed.csv contains the data with added columns and fixed lat / lng
- routed.csv contains the data with distance and waypoints
- cleaned.csv is the cleaned routed dataset (unplausible data is removed)
- pseudonomysed.csv is the anonymized, cleaned data, following this standard
- pseudonomysed_raw.csv ist the anonymized data (NOT cleaned).
-
analysis.ipynb includes analysis about provider and bike specific data
-
pseudonomysed.ipynb includes analysis using the anonymized dataset (without information on providers.)