audio-dataset-tracking

Emotion Dataset Tracking

Dataset Remarks
Librispeech "A dataset for speaker count estimation." Contains speaker id, activity period, sex. Could be useful for expanding on voxceleb.
Morgan Emotional Speech Set Category ratings and emotional dimension ratings of activation and pleasantness are available from the researcher upon request.
Clotho AQA A dataset containing Q and A with each answer being a one word reply. 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word.
ASVP-ESD Dataset sourced from website, Youtube, and movies. Contains emotions, age range, language, emotional intensity (normal, high), sex.
CREMA-D 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Age, Sex, Race, and Ethicity.
CaFE The Canadian French Emotional (CaFE) speech dataset contains six different sentences, pronounced by six male and six female actors, in six basic emotions plus one neutral emotion. The six basic emotions are acted in two different intensities: mild ("Faible") and strong ("Fort").
eNTERFACE'05 Dataset with audio and video data. Emotions: happiness, sadness, surprise, anger, disgust and fear.
BIRAFFE Dataset containing signals captured from electrocardiogram (ECG), galvanic skin reaction (GSR) and changes in facial expression after a video stimuli. Contains also participants' self-assessment of their emotion states, valence and arousal level, and "Big Five" personality traits.

Audio Event Dataset

Dataset Remarks
Vanuatu Language Dataset Phonetically transcribed audios
VidTimit " Video and corresponding audio recordings of 43 people, reciting short sentences. Useful for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification."
Audio Caption - Hospital and Car "This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019. Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021. Original captions in Mandarin Chinese, with English translations provided. "
Artificial sound mixes with event insertion "The mixes were created using background and event audio recordings from Tampere University's Detection and Classification of Acoustic Scenes and Events (DCASE) Community." Contains event id, activity start and end time."

Voice-to-Face

Dataset Remarks
RAVDESS Facial Landmark Tracking "This data set contains accurate estimates of actors' 3D head poses. To produce these, camera parameters at the time of recording were required (distance from camera to actor, and camera field of view). These values were used with OpenCV's camera calibration procedure, described here, to produce estimates of the camera's focal length and optical center at the time of actor recordings. The four values produced by the calibration procedure (fx,fy,cx,cy) were input to OpenFace as command line arguments during facial tracking, described here, to produce accurate estimates of 3D head pose."