A curated list of papers and datsets for various audio-visual tasks, inspired by awesome-computer-vision.
- Audio-Visual Localization
- Audio-Visual Separation
- Audio-Visual Representation/Classification
- Audio-Visual Action Recognition
- Audio-Visual Spatial/Depth
- Audio-Visual Navigation/RL
- Audio-Visual Faces/Speech
- Cross-modal Generation (Audio-Video / Video-Audio)
- Multi-modal Architectures
- Uncategorized Papers
- Localizing Visual Sounds the Hard Way - Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (CVPR 2021) [code] [project page]
- Positive Sample Propagation along the Audio-Visual Event Line - Zhou, J., Zheng, L., Zhong, Y., Hao, S., & Wang, M. (CVPR 2021) [code]
- Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing - Wu Y., Yang Y. (CVPR 2021) [code]
- Audio-Visual Localization by Synthetic Acoustic Image Generation - Sanguineti V., Morerio P., Del Bue A., Murino V.(AAAI 2021)
- Binaural Audio-Visual Localization - Wu, X., Wu, Z., Ju L., Wang S. (AAAI 2021) [dataset]
- Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching - Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., Dou, D. (NeurIPS 2020) [code] [dataset] [demo]
- Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision - Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., & Yang, Z. (ECCV 2020)[project page/dataset]
- Do We Need Sound for Sound Source Localization? - Oya, T., Iwase, S., Natsume, R., Itazuri, T., Yamaguchi, S., & Morishima, S. (arXiv 2020)
- Multiple Sound Sources Localization from Coarse to Fine - Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., & Lin, W. (ECCV 2020) [code]
- Learning Differentiable Sparse and Low Rank Networks for Audio-Visual Object Localization - Pu, J., Panagakis, Y., & Pantic, M. (ICASSP 2020)
- What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization - Ramaswamy, J. (ICASSP 2020)
- Self-supervised learning for audio-visual speaker diarization - Ding, Y., Xu, Y., Zhang, S. X., Cong, Y., & Wang, L. (ICASSP 2020)
- See the Sound, Hear the Pixels - Ramaswamy, J., & Das, S. (WACV 2020)
- Dual Attention Matching for Audio-Visual Event Localization - Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (ICCV 2019)
- Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events - Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (arxiv, 2018) CPVRW2018
- Learning to Localize Sound Source in Visual Scenes - Senocak, A., Oh, T. H., Kim, J., Yang, M. H., & Kweon, I. S. (CVPR 2018)
- Objects that Sound - Arandjelovic, R., & Zisserman, A. (ECCV 2018)
- Audio-Visual Event Localization in Unconstrained Videos - Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (ECCV 2018) [project page] [code]
- Audio-visual object localization and separation using low-rank and sparsity - Pu, J., Panagakis, Y., Petridis, S., & Pantic, M. (ICASSP 2017)
- VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency - Gao, R., & Grauman, K. (CVPR 2021) [code] [project page]
- Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation - Tian, Y., Hu, D., & Xu, C. (CVPR 2021) [code]
- Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation Lee, J., Chung, S. W., Kim, S., Kang, H. G., & Sohn, K. (CVPR 2021) [project page]
- Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds - zinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D.P. and Hershey, J.R. (ICLR 2021) [project page]
- Sep-stereo: Visually guided stereophonic audio generation by associating source separation - Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (ECCV 2020) [project page] [code]
- Visually Guided Sound Source Separation using Cascaded Opponent Filter Network. arXiv - Zhu, L., & Rahtu, E. (arXiv 2020) [project page]
- Music Gesture for Visual Sound Separation - Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (CVPR 2020) [project page] [code]
- Recursive Visual Sound Separation Using Minus-Plus Net - Xudong Xu, Bo Dai, Dahua Lin (ICCV 2019)
- Co-Separating Sounds of Visual Objects - Gao, R. & Grauman, K. (ICCV 2019) [project page]
- The sound of Motions - Zhao, H., Gan, C., Ma, W. & Torralba, A. (ICCV 2019)
- Learning to Separate Object Sounds by Watching Unlabeled Video - Gao, R., Feris, R., & Grauman, K. (ECCV 2018 (Oral)) [project page] [code] [dataset]
- The Sound of Pixels - Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (ECCV 2018) [project page] [code] [dataset]
- Spoken moments: Learning joint audio-visual representations from video descriptions - Monfort, M., Jin, S., Liu, A., Harwath, D., Feris, R., Glass, J., & Oliva, A. (CVPR 2021) [project page/dataset]
- Robust Audio-Visual Instance Discrimination - Morgado, P., Misra, I., & Vasconcelos, N. (CVPR 2021)
- Distilling Audio-Visual Knowledge by Compositional Contrastive Learning - Chen, Y., Xian, Y., Koepke, A., Shan, Y., & Akata, Z. (CVPR 2021) [code]
- Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning - Zhang, J., Xu, X., Shen, F., Lu, H., Liu, X., & Shen, H. T. (AAAI 2021)
- Active Contrastive Learning of Ausio-Visual Video Representations - Ma, S., Zeng, Z., McDuff, D., & Song, Y. (ICLR 2021) [code]
- Labelling unlabelled videos from scratch with multi-modal self-supervision - Asano, Y., Patrick, M., Rupprecht, C., & Vedaldi, A. (NeruIPS 2020) [project page]
- Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation learning - Cheng, Y., Wang, R., Pan, Z., Feng, R., & Zhang, Y. (ACM MM 202)
- Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition - Di Hu, X. L., Mou, L., Jin, P., Chen, D., Jing, L., Zhu, X., & Dou, D. (ECCV 2020) [code]
- Leveraging Acoustic Images for Effective Self-Supervised Audio Representation Learning - Sanguineti, V., Morerio, P., Pozzetti, N., Greco, D., Cristani, M., & Murino, V. (ECCV 2020) [code]
- Self-Supervised Learning of Audio-Visual Objects from Video - Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (ECCV 2020) [project page]
- Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing - Tian, Y., Li, D., & Xu, C. (ECCV 2020)
- Audio-Visual Instance Discrimination with Cross-Modal Agreement - Morgado, P., Vasconcelos, N., & Misra, I. (CVPR 2021)
- Vggsound: A Large-Scale Audio-Visual Dataset - Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. (ICASSP 2020) [project page/dataset] [code]
- Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data - Fayek, H. M., & Kumar, A. (IJCAI 2020)
- Multi-modal Self-Supervision from Generalized Data Transformations - Patrick, M., Asano, Y. M., Fong, R., Henriques, J. F., Zweig, G., & Vedaldi, A. (arXiv 2020)
- Curriculum Audiovisual Learning - Hu, D., Wang, Z., Xiong, H., Wang, D., Nie, F., & Dou, D. (arXiv 2020)
- Audio-visual model distillation using acoustic images - Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (WACV 2020) [code] [dataset]
- Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos - Parida, K., Matiyali, N., Guha, T., & Sharma, G. (WACV 2020) [project page][Dataset]
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering - Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. (NeurIPS 2020)
- Look, listen, and learn more: Design choices for deep audio embeddings - Cramer, J., Wu, H. H., Salamon, J., & Bello, J. P. (ICASSP 2019) [code] [L3-embedding]
- Self-supervised audio-visual co-segmentation - Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (ICASSP 2019)
- Deep Multimodal Clustering for Unsupervised Audiovisual Learning - (Hu, D., Nie, F., & Li, X. (CVPR 2019))
- Cooperative learning of audio and video models from self-supervised synchronization - (Korbar, B., Tran, D., & Torresani, L. (NeurIPS 2108)) [project page][trained model 1][trained model 2]
- Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description - Hori, C., Hori, T., Wichern, G., Wang, J., Lee, T. Y., Cherian, A., & Marks, T. K. (CVPRW 2018)
- Audio-Visual Scene Analysis with Self-Supervised Multisensory Features - Owens, A., & Efros, A. A. (ECCV 2018 (Oral)) [project page] [code]
- Look, listen and learn - Arandjelovic, R., & Zisserman, A. (ICCV 2017) [Keras-code]
- Ambient Sound Provides Supervision for Visual Learning - Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (ECCV 2016(Oral)) [journal version] [project page]
- Soundnet: Learning sound representations from unlabeled video - Aytar, Y., Vondrick, C., & Torralba, A. (NIPS 2016) [project page] [code]
- See, hear, and read: Deep aligned representations - Aytar, Y., Vondrick, C., & Torralba, A. (arXiv 2017) [project page]
- Cross-Modal Embeddings for Video and Audio Retrieval -Surís, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (ECCVW, 2018)
- Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization - Lee, J., Jain, M., Park, H., & Yun, S. (ICLR 2021)
- Speech2Action: Cross-modal Supervision for Action Recognition - Nagrani, A., Sun, C., Ross, D., Sukthankar, R., Schmid, C., & Zisserman, A. (CVPR 2020) project page, dataset
- Listen to Look: Action Recognition by Previewing Audio - Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani (CVPR 2020) [project page]
- EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition - Kazakos, E., Nagrani, A., Zisserman, A., & Damen, D. (ICCV 2019) [project page] [code]
- Uncertainty-aware Audiovisual Activity Recognition using Deep Bayesian Variational Inference - Subedar, M., Krishnan, R., Meyer, P. L., Tickoo, O., & Huang, J. (ICCV 2019)
- Seeing and Hearing Egocentric Actions: How Much Can We Learn? - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (ICCVW 2019)
- How Much Does Audio Matter to Recognize Egocentric Object Interactions? - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (EPIC CVPRW 2019)
- Visually Informed Binaural Audio Generation without Binaural Audios - Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (CVPR 2021) [code]
- Beyond image to depth: Improving depth prediction using echoes - Parida, K. K., Srivastava, S., & Sharma, G. (CVPR 2021) [code] [project page]
- Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation - Lin, Yan-Bo and Wang, Yu-Chiang Frank, (AAAI 2021)
- Learning Representations from Audio-Visual Spatial Alignment - Morgado, P., Li, Y., & Nvasconcelos, N. (NeurIPS 2020) [code]
- VisualEchoes: Spatial Image Representation Learning through Echolocation - Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (ECCV 2020)
- BatVision with GCC-PHAT Features for Better Sound to Vision Predictions - Christensen, J. H., Hornauer, S., & Yu, S. (CVPRW 2020)
- BatVision: Learning to See 3D Spatial Layout with Two Ears - Christensen, J. H., Hornauer, S., & Yu, S. (ICRA 2020) [dataset/code]
- Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds - Vasudevan, A. B., Dai, D., & Van Gool, L. (arXiv 2020) [project page]
- Audio-Visual SfM towards 4D reconstruction under dynamic scenes - Konno, A., Nishida K., Itoyama K., Nakadai K. (CVPRW 2020)
- Telling Left From Right: Learning Spatial Correspondence of Sight and Sound - Yang, K., Russell, B., & Salamon, J. (CVPR 2020) [project page / dataset]
- 2.5D Visual Sound - Gao, R., & Grauman, K. (CVPR 2019) [project page] [dataset] [code]
- Self-supervised generation of spatial audio for 360 video - Morgado, P., Nvasconcelos, N., Langlois, T., & Wang, O. (NeurIPS 2018) [project page] [code/dataset]
- Self-supervised audio spatialization with correspondence classifier - Lu, Y. D., Lee, H. Y., Tseng, H. Y., & Yang, M. H. (ICIP 2019)
- Semantic Audio-Visual Navigation - Chen, C., Al-Halah, Z., & Grauman, K. (CVPR 2021) [project page] [code]
- Learning to set waypoints for audio-visual navigation - Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (ICLR 2021) [project page] [code]
- See, hear, explore: Curiosity via audio-visual association - Dean, V., Tulsiani, S., & Gupta, A. (arXiv 2020) [project page] [code]
- Audio-Visual Embodied Navigation - Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson P., Grauman, K. (ECCV 2020) [project page]
- Look, listen, and act: Towards audio-visual embodied navigation - Gan, C., Zhang, Y., Wu, J., Gong, B., & Tenenbaum, J. B. (ICRA 2020) [project page/dataset]
- Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association - Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., & Huang, Q. (CVPR 2021) [code]
- Audio-Driven Emotional Video Portraits - Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., & Xu, F. (CVPR 2021) [project page] [code]
- Pose-controllable talking face generation by implicitly modularized audio-visual representation - Zhou, H., Sun, Y., Wu, W., Loy, C. C., Wang, X., & Liu, Z. (CVPR 2021) [project page] [code]
- One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing - Wang, T. C., Mallya, A., & Liu, M. Y. (CVPR 2021) [project page]
- Unsupervised audiovisual synthesis via exemplar autoencoders - Deng, K., Bansal, A., & Ramanan, D. [project page] [project page]
- Mead: A large-scale audio-visual dataset for emotional talking-face generation - Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao Y., Loy, C. C. (ECCV 2020) [project page/dataset]
- Discriminative Multi-modality Speech Recognition - Xu, B., Lu, C., Guo, Y., & Wang, J. (CVPR 2020)
- Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis - Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (CVPR 2020) [project page/dataset] [code]
- DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads - Zhang, X., Wu, X., Zhai, X., Ben, X., & Tu, C. (CVPR 2020)
- Learning to Have an Ear for Face Super-Resolution - Meishvili, G., Jenni, S., & Favaro, P. (CVPR 2020) [project page] [code]
- ASR is all you need: Cross-modal distillation for lip reading - Afouras, T., Chung, J. S., & Zisserman, A. (ICASSP 2020)
- Visually guided self supervised learning of speech representations - Shukla, A., Vougioukas, K., Ma, P., Petridis, S., & Pantic, M. (ICASSP 2020)
- Disentangled Speech Embeddings using Cross-modal Self-supervision - Nagrani, A., Chung, J. S., Albanie, S., & Zisserman, A. (ICASSP 2020)
- Animating Face using Disentangled Audio Representations - Mittal, G., & Wang, B. (WACV 2020)
- Deep Audio-Visual Speech Recognition - T. Afouras, J.S. Chung*, A. Senior, O. Vinyals, A. Zisserman (TPAMI 2019)
- Reconstructing faces from voices - Yandong Wen, Rita Singh, Bhiksha Raj (NIPS 2019)[project page]
- Learning Individual Styles of Conversational Gesture - Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., & Malik, J. (CVPR 2019) [project page] [dataset]
- Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss - Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (CVPR 2019)[project page]
- Speech2Face: Learning the Face Behind a Voice - Oh, T. H., Dekel, T., Kim, C., Mosseri, I., Freeman, W. T., Rubinstein, M., & Matusik, W. (CVPR 2019)[project page]
- My lips are concealed: Audio-visual speech enhancement through obstructions - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2019) [project page]
- Talking Face Generation by Adversarially Disentangled Audio-Visual Representation - Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang (AAAI 2019) [project page] [code]
- Disjoint mapping network for cross-modal matching of voices and faces - Wen, Y., Ismail, M. A., Liu, W., Raj, B., & Singh, R. (ICLR 2019)[project page]
- X2Face: A network for controlling face generation using images, audio, and pose codes - Wiles, O., Sophia Koepke, A., & Zisserman, A. (ECCV 2018)[project page][code]
- Learnable PINs: Cross-Modal Embeddings for Person Identity - Nagrani, A., Albanie, S., & Zisserman, A. (ECCV 2018)[project page]
- Seeing voices and hearing faces: Cross-modal biometric matching - Nagrani, A., Albanie, S., & Zisserman, A. (CVPR 2018) [project page][code](trained moodel only)
- Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation - Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T. and Rubinstein, M., (SIGGRAPH 2018) [project page]
- The Conversation: Deep Audio-Visual Speech Enhancement - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [project page]
- VoxCeleb2: Deep Speaker Recognition - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [dataset]
- You said that? - Son Chung, J., Jamaludin, A., & Zisserman, A. (BMVC 2017) [project page] [code](trained model, evaluation code)
- VoxCeleb: a large-scale speaker identification dataset - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2017) [project page][code] [dataset]
- Out of time: automated lip sync in the wild - J.S. Chung & A. Zisserman (ACCVW 2016)
- Sound2Sight: Generating Visual Dynamics from Sound and Context - Cherian, A., Chatterjee, M., & Ahuja, N. (ECCV 2020)
- Generating Visually Aligned Sound from Videos - Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (IEEE Transactions on Image Processing 2020)
- Audeo: Audio Generation for a Silent Performance Video - Su, K., Liu, X., & Shlizerman, E. (NeurIPS 2020)
- Foley Music: Learning to Generate Music from Videos - Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (ECCV 2020) [project page]
- Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation - Tan, H., Wu, G., Zhao, P., & Chen, Y. (ICASSP 2020)
- Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck - (Shuang Ma, Daniel McDuff, Yale Song (ICCV 2019)) [code]
- Listen to the Image - (Hu, D., Wang, D., Li, X., Nie, F., & Wang, Q. (CVPR 2019))
- Cascade attention guided residue learning GAN for cross-modal translation - Duan, B., Wang, W., Tang, H., Latapie, H., & Yan, Y. (arXiv 2019) [code]
- Visual to Sound: Generating Natural Sound for Videos in the Wild - (Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (CVPR 2018))[project page]
- Image generation associated with music data - Qiu, Y., & Kataoka, H. (CVPRW 2018)
- CMCGAN: A uniform framework for cross-modal visual-audio mutual generation - Hao, W., Zhang, Z., & Guan, H. (AAAI 2018)
- What Makes Training Multi-Modal Networks Hard? - Wang, W., Tran, D., & Feiszli, M. (arXiv 2019)
- MFAS: Multimodal Fusion Architecture Search - Pérez-Rúa, J. M., Vielzeuf, V., Pateux, S., Baccouche, M., & Jurie, F. (CVPR 2019)
- GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition - (CVPR 2021)
- There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge - Valverde, F. R., Hurtado, J. V., & Valada, A. (CVPR 2021) [code] [project page/dataset]
- Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks? - Tian, Y., & Xu, C. (CVPR 2021) [code]
- Sight to sound: An end-to-end approach for visual piano transcription - Koepke, A. S., Wiles, O., Moses, Y., & Zisserman, A. (ICASSP 2020) [project page/dataset]
- Solos: A Dataset for Audio-Visual Music Analysis - Montesinos, J. F., Slizovskaia, O., & Haro, G. (arXiv 2020) [project page] [dataset]
- Cross-Task Transfer for Multimodal Aerial Scene Recognition - Hu, D., Li, X., Mou, L., Jin, P., Chen, D., Jing, L., ... & Dou, D. (arXiv 2020) [code] [dataset]
- STAViS: Spatio-Temporal AudioVisual Saliency Network - Tsiami, A., Koutras, P., & Maragos, P. (CVPR 2020) [code]
- AlignNet: A Unifying Approach to Audio-Visual Alignment - Wang, J., Fang, Z., & Zhao, H. (WACV 2020) [project page] [code]
- Self-supervised Moving Vehicle Tracking with Stereo Sound - Gan, C., Zhao, H., Chen, P., Cox, D., & Torralba, A. (ICCV 2019) [project page/dataset]
- Vision-Infused Deep Audio Inpainting - Zhou, H., Liu, Z., Xu, X., Luo, P., & Wang, X. (ICCV 2019) [project page] [code]
- ISNN: Impact Sound Neural Network for Audio-Visual Object Classification - Sterling, A., Wilson, J., Lowe, S., & Lin, M. C. (ECCV 2018) [project page] [dataset1][dataset2] [model]
- Audio to Body Dynamics - Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (CVPR 2018) [project page][code]
- A Multimodal Approach to Mapping Soundscapes - Salem, T., Zhai, M., Workman, S., & Jacobs, N. (CVPRW 2018) [project page]
- Shape and material from sound - Zhang, Z., Li, Q., Huang, Z., Wu, J., Tenenbaum, J., & Freeman, B. (NeurIPS 2017)
- AudioSet - Audio-Visual Classification
- MUSIC - Audio-Visual Source Separation
- AudioSetZSL - Audio-Visual Zero-shot Learning
- Visually Engaged and Grounded AudioSet (VEGAS) - Sound generation from video
- SoundNet-Flickr - Image-Audio pair for cross-modal learning
- Audio-Visual Event (AVE) - Audio-Visual Event Localization
- AudioSet Single Source - Subset of AudioSet videos containing only a single souding object
- Kinetics-Sounds - Subset of Kinetics dataset
- EPIC-Kitchens - Egocentric Audio-Visual Action Recogniton
- Audio-Visually Indicated Actions Dataset - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera
- IMSDb dataset - Movie scripts downloaded from The Internet Script Movie Database
- YOUTUBE-ASMR-300K dataset - ASMR videos collected from YouTube that contains stereo audio
- FAIR-Play - 1,871 video clips and their corresponding binaural audio clips recorded in a music room
- VGG-Sound - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
- XD-Violence - weakly annotated dataset for audio-visual violence detection
- AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) - Geotagged aerial images and sounds, classified into 13 scene classes
- auDIoviSual Crowd cOunting dataset (DISCO) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.
- MUSIC-Synthetic dataset- Category-balanced multi-source videos by artificially synthesizing solo videos from the MUSIC dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.
- VoxCeleb - Audio-Visual Speaker Identification, contains two versions
- EmoVoxCeleb
- Speech2Gesture - Gesture prediction from speech
- AVSpeech
- LRW Dataset
- LRS2, LRS3, LRS3 Language - Lip Reading Datasets
License
To the extent possible under law, Kranti Kumar Parida has waived all copyright and related or neighboring rights to this work.
Please feel free to send me pull requests or email (kranti@cse.iitk.ac.in) to add links, correct wrong ones or if you find any broken links.