A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
The Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:
- A 32-channel array with high-fidelity microphones is used for recording
- A loudspeaker is used for playing source speech signals
- A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes
- Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments
- The azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera, and is used for the training of source localization networks
- The direct-path signal is obtained by filtering the played speech signal with an estimated direct-path propagation filter, and is used for the training of speech enhancement networks.
The RealMAN dataset is valuable in two aspects:
- Benchmark speech enhancement and localization algorithms in real scenarios
- Offer a substantial amount of real-world training data for potentially improving the performance of real-world applications
The details of the RealMAN dataset are described in the following paper: [arXiv]
To download the entire dataset, you can access: Origninal data page or AISHELL page. The dataset comprises the following components:
File | Size | Description |
---|---|---|
train.rar |
521.76 GB | The training set consisting of 36.6 hours of static speaker speech and 26.6 hours of moving speaker speech (ma_speech ), 106.3 hours of noise recordings (ma_noise ), 0-channel direct path speech (dp_speech ) and sound source location (train_*_source_location.csv ). |
val_raw.rar |
65.57 GB | The raw validation set consisting of 4.5 hours of static speaker speech and 3.3 hours of moving speaker speech (ma_speech ), 16.0 hours of noise recordings (ma_noise ), 0-channel direct path speech (dp_speech ) and sound source location (val_*_source_location.csv ). |
val.rar |
25.57 GB | The validation set consisting of mixed noisy speech recordings (ma_noise ), 0-channel direct path speech (dp_speech ), sound source location (val_*_source_location.csv ). |
test_raw.rar |
91.75 GB | The raw test set consisting of 6.9 hours of static speaker speech and 4.8 hours of moving speaker speech (ma_speech ), 22.2 hours of noise recordings (ma_noise ), 0-channel direct path speech (dp_speech ) and sound source location (test_*_source_location.csv ). |
test.rar |
38.02 GB | The test set consisting of mixed noisy speech recordings (ma_noise ), 0-channel direct path speech (dp_speech ), sound source location (test_*_source_location.csv ). |
dataset_info.rar |
127.9 MB | The dataset information file including scene photos, scene information (T60, recording duration, etc), and speaker information |
transcriptions.trn |
2.4 MB | The transcription file of speech for the dataset |
The dataset is organized into the following directory structure:
RealMAN
├── transcriptions.trn
├── dataset_info
│ ├── scene_images
│ ├── scene_info.json
│ └── speaker_info.csv
└── train|val|test|val_raw|test_raw
├── train_moving_source_location.csv
├── train_static_source_location.csv
├── dp_speech
│ ├── BadmintonCourt2
│ │ ├── moving
│ │ │ ├── 0010
│ │ │ │ ├── TRAIN_M_BAD2_0010_0003.flac
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── static
│ └── ...
├── ma_speech|ma_noisy_speech
│ ├── BadmintonCourt2
│ │ ├── moving
│ │ │ ├── 0010
│ │ │ │ ├── TRAIN_M_BAD2_0010_0003_CH0.flac
│ │ │ │ └── ...
│ │ │ └── ...
│ │ ├── static
│ └── ...
└── ma_noise
The naming convention is as follows:
# Recorded Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId_channelId.flac
# Direct-Path Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId.flac
# Source Location
[train|val|test]_[moving|static]_source_location.csv
The dataset is licensed under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license.