Programming Data Science – Semester Project
Within the same folder as setup.py
run pip3 install .
to install the package. Use flag -e
to install in development mode. Import via import nextbike
.
Project Report
Please access the project report via the root directory of this repository or via this link: Report.pdf
Quick Start
Table of Contents
Preprocessing API
The Preprocessing package exports two classes, Preprocessor
and Transformer
. Import them as follows:
from nextbike.preprocessing import Preprocessor, Transformer
The Preprocessor can load and clean a raw NextBike data set. Column format validation of the input data is done automatically.
preprocessor = Preprocessor()
preprocessor.load_gdf() # Load the data set as geopandas GeoDataFrame
preprocessor.clean_gdf() # Clean the data set for Mannheim
At any point of time the current state of the data can be accessed through the gdf
property. A UserWarning
is raised
if the GeoDataFrame is not initialized.
preprocessor.gdf
The Transformation class transforms the preprocessed data set to the target data format. It needs a Preprocesssor
instance. It checks automatically on instantiation if the Preprocessor
has run successfully.
transformer = Transformer(preprocessor)
Transform and save the data set as follows:
transformer.transform()
# filename parameter is optional
transformer.save(filename='mannheim_transformed.csv')
Prediction API
Duration Prediction
Loading Data
Data can be loaded from a valid Transformer
instance or from a file path. The recommended way is to use a valid Transformer
instance, because, under the hood, the prediction sub-package also uses it to load data from a file path.
Data loading with the Transformer
:
from nextbike.models import DurationModel
duration_model = DurationModel()
duration_model.load_from_transformer(transformer)
Data loading from a file path to the raw input data:
from nextbike.models import DurationModel
duration_model = DurationModel()
duration_model.load_from_csv('data/input/mannheim.csv')
Training
Training can be conducted on an instantiated Model
instance, in this case a DurationModel
. Please note that the methods called are standardized for all implemented models through the abstract base class nextbike.models.Model
.
Training the model:
duration_model.train()
Printing the training score after prediction:
duration_model.predict() # Conduct predictions on the training data
duration_model.training_score() # Print the training score to the console
Predict unseen Data
Prediction for unseen data can be conducted on a Model
instance by simply calling the predict
method with a path to the data which should be predicted. It automatically loads the previously trained model or throws an error if it does not exist.
Prediction can be conducted as follows:
duration_model = DurationModel() # Create a DurationModel instance
duration_model.predict('data/input/mannheim_test.csv') # Predict unseen data with the previously trained model
Direction Prediction
Direction prediction works exactly the same way as duration prediction. Use the nextbike.models.DirectionModel
instance instead of the nextbike.models.DurationModel
. All methods are the same as for the DurationModel
.
For example:
from nextbike.models import DirectionModel
direction_model = DirectionModel()
direction_model.load_from_transformer(transformer)
...
Combine both predictions into one data set
Currently, the direction and duration prediction models save to separate data sets to disk. To combine them automatically into one data set, you can use combine_predictions()
as follows:
from nextbike.io import combine_predictions()
combine_predictions()
Command Line Interface (CLI)
The following CLI commands are available. Each command provides a helper text if you have problems using them.
Transform the Raw Data
nextbike transform [--output <output-path>] <data-path>
Train the Duration and Direction Model
nextbike train <data-path>
Predict new Data
nextbike predict <data-path>