/RealMAN

A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Primary LanguagePython

A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization


Description

The Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:

  • A 32-channel array with high-fidelity microphones is used for recording
  • A loudspeaker is used for playing source speech signals
  • A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes
  • Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments
  • The azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera, and is used for the training of source localization networks
  • The direct-path signal is obtained by filtering the played speech signal with an estimated direct-path propagation filter, and is used for the training of speech enhancement networks.

The RealMAN dataset is valuable in two aspects:

  • Benchmark speech enhancement and localization algorithms in real scenarios
  • Offer a substantial amount of real-world training data for potentially improving the performance of real-world applications

The details of the RealMAN dataset are described in the following paper: [arXiv]

Download

To download the entire dataset, you can access: Origninal data page or AISHELL page. The dataset comprises the following components:

File Size Description
train.rar 521.76 GB The training set consisting of 36.6 hours of static speaker speech and 26.6 hours of moving speaker speech (ma_speech), 106.3 hours of noise recordings (ma_noise), 0-channel direct path speech (dp_speech) and sound source location (train_*_source_location.csv).
val_raw.rar 65.57 GB The raw validation set consisting of 4.5 hours of static speaker speech and 3.3 hours of moving speaker speech (ma_speech), 16.0 hours of noise recordings (ma_noise), 0-channel direct path speech (dp_speech) and sound source location (val_*_source_location.csv).
val.rar 25.57 GB The validation set consisting of mixed noisy speech recordings (ma_noise), 0-channel direct path speech (dp_speech), sound source location (val_*_source_location.csv).
test_raw.rar 91.75 GB The raw test set consisting of 6.9 hours of static speaker speech and 4.8 hours of moving speaker speech (ma_speech), 22.2 hours of noise recordings (ma_noise), 0-channel direct path speech (dp_speech) and sound source location (test_*_source_location.csv).
test.rar 38.02 GB The test set consisting of mixed noisy speech recordings (ma_noise), 0-channel direct path speech (dp_speech), sound source location (test_*_source_location.csv).
dataset_info.rar 127.9 MB The dataset information file including scene photos, scene information (T60, recording duration, etc), and speaker information
transcriptions.trn 2.4 MB The transcription file of speech for the dataset

The dataset is organized into the following directory structure:

RealMAN
├── transcriptions.trn
├── dataset_info
│   ├── scene_images
│   ├── scene_info.json
│   └── speaker_info.csv
└── train|val|test|val_raw|test_raw
    ├── train_moving_source_location.csv
    ├── train_static_source_location.csv
    ├── dp_speech
    │   ├── BadmintonCourt2
    │   │   ├── moving
    │   │   │   ├── 0010
    │   │   │   │   ├── TRAIN_M_BAD2_0010_0003.flac
    │   │   │   │   └── ...
    │   │   │   └── ...
    │   │   └── static
    │   └── ...
    ├── ma_speech|ma_noisy_speech
    │   ├── BadmintonCourt2
    │   │   ├── moving
    │   │   │   ├── 0010
    │   │   │   │   ├── TRAIN_M_BAD2_0010_0003_CH0.flac
    │   │   │   │   └── ...
    │   │   │   └── ...
    │   │   ├── static
    │   └── ...
    └── ma_noise

The naming convention is as follows:

# Recorded Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId_channelId.flac

# Direct-Path Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId.flac

# Source Location
[train|val|test]_[moving|static]_source_location.csv

Baseline

License

The dataset is licensed under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license.