/RIMT-Dataset

Re-ID Multi-Tracking Dataset

Re-ID Multi-People Tracking Dataset

Table of contents

Introduction to the dataset

RIMT is a new RGB-Depth dataset, acquired from a mobile robot moving in a domestic environment testbed equipped with a motion capture system. This dataset was built to test and evaluate 3D position accuracy and people re-identification performance of multi-target tracking methods based on RGB-D data.

The data was collected by teleoperating the MBOT in the ISRoboNet@Home Testbed1 with up to 3 targets moving in the environment.

The data was collected by teleoperating the MBOT in the with up to 3 targets moving in the environment. The number of targets was chosen as 3 as it is the average number of people in households in countries such as Portugal or the U.S. 23.

ISRoboNet@Home Testbed

The robot is equipped with a tilt-controlled Orbbec Astra RGB-D camera positioned on the head that captures RGB and depth images with 640 x 480 pixel resolution at 30Hz. The testbed is an apartment-like environment designed to benchmark service robots and is equipped with a motion capture system composed of 12 OptiTrack® ”Prime 13” cameras (1.3 MP, 240 FPS), which provides real-time tracking data of rigid bodies with sub mm precision in 6 dimensions with low latency (4.2ms).

Robot used to aquire the dataset

Although the camera frequency is 30Hz, the recording of the dataset was done at a lower frequency of approximately 10Hz, resulting in a total of 3144 RGB images, 3437 depth images and 2154 people instances. The RGB-D images, camera information, a map of the environment on the form of an occupany grid along with map metadata, odometry of the robot, transforms along reference frames and ground-truth are made available as ROS bag files.

People detections originated by the people detector used have also been included in the dataset. This information cannot be considered as ground truth because it is generated by an automatic people detector, but it was included because it can be of great utility for future works that focus solely on the tracking algorithm and require people detections beforehand.

As ground-truth, 3D positions of people in environment were obtained using the motion capture system, together with ground-truth of the robot’s position. Markers were placed on the targets and on the robot. Track ID’s were also obtained directly by the motion capture system. 3D ground-truths of targets that were out of the field of view of the robot or completely occluded were manually deleted.

Besides that, there were frames where the motion capture system failed, due to the positioning of the cameras and markers that were not visible, and the 3D ground-truth of some of the targets was not registered. In these cases, ground-truth was not associated with these frames and they should not be considered when evaluating performance metrics. After this process, ground-truth is associated with approximately 70% of the frames. This process is important because keeping frames with lack of ground-truth of some targets would lead to errors when evaluating methods on this dataset, such as the occurrence of incorrect false positives (cases where the method outputs a track that is not present in the ground-truth).

Description of the dataset sequences

The dataset consists of 7 videos with durations ranging from 40s to 1:10s. Each video contains different characteristics (camera and people movement) and represents different cases, so that the dataset is representative of several situations that can occur in an environment with multiple people and obstacles. The 7 sequences (videos) present in the dataset are the following:

  • Still: sequence recorded with a static camera. Three targets move around freely in front of the camera without being occluded by obstacles.
  • Moving camera: sequence recorded with the camera rotating while the robot’s base does not move. Three targets move around freely in front of the camera without being occluded by obstacles.
  • Moving base: sequence recorded with the robot moving around the environment. Three targets are present and are frequently occluded by obstacles. One of the targets also sits down and gets up again during the sequence.
  • Chairs: sequence recorded with the robot moving around the environment. Three targets are sitting down in chairs around two tables and during the sequence they get up, walk around and switch places several times.
  • People following: sequence recorded with the robot being teleoperated to follow a specific person around the environment. During the sequence, three targets are present and there are several occlusions caused by obstacles and people crossing paths.
  • Changing clothes 1: sequence recorded with the robot moving around the environment. Two targets are present. Both of the targets change their clothing during the sequence while in front of the camera.
  • Changing clothes 2: sequence recorded with the robot moving around the environment. Two targets are present. One of the targets exits the scene and re-enters with different clothes twice.

Moving base sequence example

These sequences cover most of the common cases that can occur in a domestic environment. There are several occlusions caused by furniture such as chairs, tables and a sofa or caused by other people when targets cross paths with each other. A specific case where the robot is following a person was also recorded, since this is a common task executed by mobile service robots. The last two sequences represent cases where targets change their clothes during the sequence, which is a challenging scenario for people re-identification. This dataset also has the particularity that all of the people present are wearing cirurgical masks, due to the Covid-19 pandemic.

Content of the dataset

Every bag contains several ROS topics, such as:

/head_camera/depth/camera_info					- Depth camera info                
/head_camera/depth/image_rect					- Depth image                     
/head_camera/depth_registered/camera_info			- Registered depth camera info                
/head_camera/rgb/camera_info 					- RGB camera info                   
/head_camera/rgb/image_rect_color				- RGB image				
/mbot_perception/generic_detector/detection_image/compressed	- RGB image with object detections as bounding boxes
/mbot_perception/generic_detector/detections			- Detections generated by the object detector
/mbot_perception/generic_localizer/localized_objects		- 3D Poses of detected objects in the environment
/map      					                - Map of the environment                                                                  
/map_metadata							- Map metadata
/odom								- Odometry of the robot
/tf								- Tf tree 
/tf_static							- Static transforms
/3D_ground_truth_1						- 3D position ground-truth of target 1
/3D_ground_truth_2						- 3D position ground-truth of target 2
/3D_ground_truth_3   						- 3D position ground-truth of target 3

The Raw_bags folder contains bags for all the sequences of the dataset with all the ground-truth positions generated by the motion capture system, including in frames where there were failures in the ground-truth recording of some targets and including ground-truth of targets outside the camera view.

The Manually_annotated_bags folder contains bags for all the sequences of the dataset with ground-truth only present in frames where there were no failures in the motion capture system and only of targets that are in the camera view. The ground-truth on the other frames was manually removed.

Download Dataset

To download this dataset please use the following link.

Footnotes

  1. https://welcome.isr.tecnico.ulisboa.pt/isrobonet

  2. https://www.pordata.pt/portugal/dimensao+media+dos+agregados+domesticos+privados-511

  3. https://www.statista.com/statistics/183657/average-size-of-a-family-in-the-us/