To ensure the traceability, reproducibility and standardization for all ML datasets and models generated and consumed within TRI, we developed the Dataset-Governance-Policy (DGP) that codifies the schema and maintenance of all TRI's Autonomous Vehicle (AV) datasets.
- Schema: Protobuf-based schemas for raw data, annotations and dataset management.
- DataLoaders: Universal PyTorch DatasetClass to load all DGP-compliant datasets.
- Visualizer: Simple web-based visualizer for viewing annotations.
- CLI: Main CLI for handling DGP datasets.
Getting started is as simple as initializing a dataset-class with the relevant dataset JSON, raw data sensor names, annotation types, and split information. Below, we show a few examples of initializing a Pytorch dataset for multi-modal learning from 2D bounding boxes, and 3D bounding boxes.
from dgp.datasets import SynchronizedSceneDataset
# Load synchronized pairs of camera and lidar frames, with 2d and 3d
# bounding box annotations.
dataset = SynchronizedSceneDataset('<dataset_name>_v0.0.json',
datum_names=('camera_01', 'lidar'),
requested_annotations=('bounding_box_2d', 'bounding_box_3d'),
split='train')
A list of starter scripts are provided in the examples directory.
- examples/load_dataset.py: Simple example script to load a multi-modal dataset based on the Getting Started section above.
You can build the base docker image and run the tests within docker via:
make docker-build
make docker-run-tests
Run streamlit-based interactive visualizer via:
make docker-start-visualizer