Understanding the 3D surroundings including the background stuffs and foreground objects is important for autonomous driving. In the traditional 3D object detection task, a foreground object is represented by the 3D bounding box. However, the geometrical shape of the object is complex, which can not be represented by a simple 3D box, and the perception of the background is absent. The goal of this task is to predict the 3D occupancy of the scene. In this task, we provide a large-scale occupancy benchmark based on the nuScenes dataset. The benchmark is a voxelized representation of the 3D space, and the occupancy state and semantics of the voxel in 3D space are jointly estimated in this task. The complexity of this task lies in the dense prediction of 3D space given the surround-view image.
Given images from multiple cameras, the goal is to predict the current occupancy state and semantics of each voxel grid in the scene. The voxel state is predicted to be either free or occupied. If a voxel is occupied, its semantic class needs to be predicted, as well. Besides, we also provide a binary observed/unobserved mask for each frame. An observed voxel is defined as an invisible grid in the current camera observation, which is ignored in the evaluation stage.
Leaderboard ranking for this challenge is by the intersection-over-union (mIoU) over all classes.
Let
where
We also measure the F-score as the harmonic mean of the completeness
where
Type | Info |
---|---|
mini | 404 |
train | 28,130 |
val | 6,019 |
test | 6,006 |
cameras | 6 |
voxel size | 0.4m |
range | [-40m, -40m, -1m, 40m, 40m, 5.4m] |
volume size | [200, 200, 16] |
#classes | 0 - 17 |
-
The dataset contains 18 classes. The definition of classes from 0 to 16 is the same as the nuScenes-lidarseg dataset. The label 17 category represents voxels that are not occupied by anything, which is named as
free
. Voxel semantics for each sample frame is given as[semantics]
in the labels.npz. -
How are the labels annotated? The ground truth labels of occupancy derive from accumulative LiDAR scans with human annotations.
- If a voxel reflects a LiDAR point, then it is assigned as the same semantic label as the LiDAR point;
- If a LiDAR beam passes through a voxel in the air, the voxel is set to be
free
; - Otherwise, we set the voxel to be unknown, or unobserved. This happens due to the sparsity of the LiDAR or the voxel is occluded, e.g. by a wall. In the dataset,
[mask_lidar]
is a 0-1 binary mask, where 0's represent unobserved voxels. As shown in Fig.1(b), grey voxels are unobserved. Due to the limitation of the visualization tool, we only show unobserved voxels at the same height as the ground.
-
Camera visibility. Note that the installation positions of LiDAR and cameras are different, therefore, some observed voxels in the LiDAR view are not seen by the cameras. Since we focus on a vision-centric task, we provide a binary voxel mask
[mask_camera]
, indicating whether the voxels are observed or not in the current camera view. As shown in Fig.1(c), white voxels are observed in the accumulative LiDAR view but unobserved in the current camera view. -
Both
[mask_lidar]
and[mask_camera]
masks are optional for training. Participants do not need to predict the masks. Only[mask_camera]
is used for evaluation; the unobserved voxels are not involved during calculating the F-score and mIoU.
The files mentioned below can also be downloaded via OpenDataLab.It is recommended to use provided command line interface for acceleration.
Subset | Google Drive | Baidu Cloud | Size |
---|---|---|---|
mini | data | data | approx. 440M |
trainval | data | data | approx. 32G |
test | coming soon | coming soon | ~ |
- Mini and trainval data contain three parts --
imgs
,gts
andannotations
. Theimgs
datas have the same hierarchy with the image samples in the original nuScenes dataset.
The hierarchy of folder Occpancy3D-nuScenes-V1.0/
is described below:
└── Occpancy3D-nuScenes-V1.0
|
├── mini
|
├── trainval
| ├── imgs
| | ├── CAM_BACK
| | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK__1531883530437525.jpg
| | | └── ...
| | ├── CAM_BACK_LEFT
| | | ├── n015-2018-07-18-11-07-57+0800__CAM_BACK_LEFT__1531883530447423.jpg
| | | └── ...
| | └── ...
| |
| ├── gts
| | ├── [scene_name]
| | | ├── [frame_token]
| | | | └── labels.npz
| | | └── ...
| | └── ...
| |
| └── annotations.json
|
└── test
├── imgs
└── annotations.json
imgs/
contains images captured by various cameras.gts/
contains the ground truth of each sample.[scene_name]
specifies a sequence of frames, and[frame_token]
specifies a single frame in a sequence.annotations.json
contains meta infos of the dataset.labels.npz
contains[semantics]
,[mask_lidar]
, and[mask_camera]
for each frame.
annotations {
"train_split": ["scene-0001", ...], <list> -- training dataset split by scene_name
"val_split": list ["scene-0003", ...], <list> -- validation dataset split by scene_name
"scene_infos" { <dict> -- meta infos of the scenes
[scene_name]: { <str> -- name of the scene.
[frame_token]: { <str> -- samples in a scene, ordered by time
"timestamp": <str> -- timestamp (or token), unique by sample
"camera_sensor": { <dict> -- meta infos of the camera sensor
[cam_token]: { <str> -- token of the camera
"img_path": <str> -- corresponding image file path, *.jpg
"intrinsic": <float> [3, 3] -- intrinsic camera calibration
"extrinsic":{ <dict> -- extrinsic parameters of the camera
"translation": <float> [3] -- coordinate system origin in meters
"rotation": <float> [4] -- coordinate system orientation as quaternion
}
"ego_pose": { <dict> -- vehicle pose of the camera
"translation": <float> [3] -- coordinate system origin in meters
"rotation": <float> [4] -- coordinate system orientation as quaternion
}
},
...
},
"ego_pose": { <dict> -- vehicle pose
"translation": <float> [3] -- coordinate system origin in meters
"rotation": <float> [4] -- coordinate system orientation as quaternion
},
"gt_path": <str> -- corresponding 3D voxel gt path, *.npz
"next": <str> -- frame_token of the previous keyframe in the scene
"prev": <str> -- frame_token of the next keyframe in the scene
}
]
}
}
}
- Nuscene (issues-721) lacks translation in the z-axis, which makes it hard to recover accurate 6d localization and would lead to the misalignment of point clouds while accumulating them over whole scenes. Ground stratification occurs in several data.
We provide a baseline model based on BEVFormer.
Please refer to getting_started for details.
- Mar 16, 2023 - Challenge Period Open.
- May 27, 2023 - Challenge Period End.
- May 29, 2023 - Finalist Notification.
- Jun 10, 2023 - Technical Report Deadline.
- Jun 12, 2023 - Winner Announcement.
To be released.
Before using the dataset, you should register on the website and agree to the terms of use of the nuScenes. All code within this repository is under Apache License 2.0.