Multimodal-Datasets

This repo collects multimodal datasets and process them in a nice manner. Please let me know if you have some interesting datasets to be processed.

Supported Datasets

MELD

This dataset is from the tv series Friends. It's got visual, audio, and text modalities.
IEMOCAP

This dataset includes dyadic conversation between two people. There are 10 actors in total. It's got visual, audio, and text modalities.
CarLani

This dataset is a conversation between the robot named Leolani and the human named Carl.

Dataset Structure

Every dataset has their own directory and it'll look something like below. If a dataset does not have all three modalities (i.e. visual, audio, and text), then obviously those directories will be empty.

DATASET
├── raw-videos
│   ├── train / val /test
├── raw-audios
│   ├── train / val /test
├── raw-texts
│   ├── train / val /test
├── face-videos
│   ├── train / val /test
├── face-features
│   ├── train / val /test
├── visual-features
│   ├── train / val /test
├── audio-features
│   ├── train / val /test
├── text-features
│   ├── train / val /test
├── foo.json
├── bar.json
└── README.txt

Every data sample belongs to one of the train, val, or test split. Beware that the numbers are not always the same. For example, there might be a video but then if its transcription is not available or if audio extraction fails, then it won't have text or audio modalities.

DATASET is the name of the dataset.
raw-videos contains the raw non-processed videos.
raw-audios contains the raw non-processed audios.
raw-texts contains the raw non-processed texts.
face-videos contains the face videos made from the facial features.
face-features contains the detected faces and their features using this repo and repo The features are age, bounding box, face detection probability, gender, five landmarks, and 512-dimensional arcface embedding vector.
visual-features are other visual features (e.g. COCO 80 objects) that are not facial features.
audio-features contains audio features (e.g. spectrogram, encoded embeddings, etc.)
text-features contains text features (e.g. word embeddings, BERT-like model features, etc.)
*.json are some important data-specific text files (e.g. label, etc.)
README.txt briefly explains the dataset and gives you the metadata

How to process each dataset

There several levels.

Use python >= 3.8

Dataset extraction

This is about extracting the original datasets into the above mentioned structure (i.e. raw-videos, raw-audios, raw-texts)

Since I don't have license to all of the datasets, you should contact the dataset authors and download them yourself.
Install the python requirements by
```
pip install -r requirements-extract-dataset.txt
```
I highly recommend you to run it in a virtual environment.
Move the archive in the corresponding directory.
- Put MELD.Raw.tar.gz in MELD/
- Put IEMOCAP_full_release.tar.gz in IEMOCAP/
- Put CarLani.zip in CarLani/
In this current directory, where README.md is located, run
```
python extract-dataset.py --dataset DATASET
```
Replace DATASET with your desired dataset (e.g. python extract-dataset.py --dataset MELD)

In the next sections, we will go through extracting features and annotating with EMISSOR. Go ahead if you want to do it youself, otherwise you can just download them from the below links.

MELD

In the current repo root directory, unzip what you downloaded into the directory ./MELD/
IEMOCAP

In the current repo root directory, unzip what you downloaded into the directory ./IEMOCAP/
CarLani

In the current repo root directory, unzip what you downloaded into the directory ./CarLani/

Feature Extraction

This is about extracting the featrues (e.g. face-features)

For facial features, you will pull three docker images.
For visual features, you should build something. I'll soon make them into docker containers.
For audio features, you should build something. I'll soon make them into docker containers.
For text features, you should build something. I'll soon make them into docker containers.

Install the python requirements by
```
pip install -r requirements-extract-features.txt
```
I highly recommend you to run it in a virtual environment.
In this current directory, where README.md is located, run
```
python extract-features.py --dataset DATASET --face-features --face-videos --visual-features --audio-features --text-features --run-on-gpu --num-jobs NUM_JOBS
```
Replace DATASET with your desired dataset. Only add the boolean flags (i.e. --face-features, --face-videos, --visual-features, --audio-features, --text-features) that you want to extract. For example, if you only want to extract face features and audio features from the MELD dataset, the command should be python extract-features.py --dataset MELD --face-features --audio-features. If you want to run in parallel, you can add the gpu flag --run-on-gpu and even add more workers --num-jobs NUM_JOBS. Running on GPU requires you to have a NVIDIA GPU and you should build the GPU images for this. Read https://github.com/tae898/face for more information.

Annotate the dataset in the EMISSOR format (optional)

This is optional. The processed datasets can also be annotated in the EMISSOR annotation format. Using the EMISSOR annotation tool, users can visualize the data or even annotate the data by themselves.

Install the python requirements by
```
pip install -r requirements-annotate-emissor.txt
```
I highly recommend you to run it in a virtual environment.
In this current directory, where README.md is located, run
```
python annotate-emissor.py --dataset DATASET --num-jobs NUM_JOBS
```
The python script can only use the features extracted.

You can also download them from the below link.

MELD

In the current repo root directory, unzip what you downloaded into the directory ./MELD/
IEMOCAP

In the current repo root directory, unzip what you downloaded into the directory ./IEMOCAP/
CarLani

In the current repo root directory, unzip what you downloaded into the directory ./CarLani/

Troubleshooting

The best way to find and solve your problems is to see in the github issue tab. If you can't find what you want, feel free to raise an issue. We are pretty responsive.

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Run make style && make quality in the root repo directory, to ensure code
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

If you have any questions, or have interesting datasets, then please let me know.

Taewoon Kim

tae898/multimodal-datasets