INTERSPEECH2020

Mon-1-1 ASR Neural Network Architectures I

  • Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu. On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition. [INTERSPEECH 2020]

    • ASR RNN-T RNN-A Transformer
    • 65,000 Hours
    • Non-streaming: Transformer > RNN-A >> RNN-T
    • Streaming: Transformer > RNN-T (Custom LSTM, CE init) > RNN-A
  • Zhifu Gao, ShiLiang Zhang, Ming Lei, Ian McLoughlin. SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. [INTERSPEECH 2020]

    • ASR AISHELL-1
    • Value + DFSMN
  • Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf. Contextual RNN-T for Open Domain ASR. [INTERSPEECH 2020]

  • Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma. ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition. [INTERSPEECH 2020]

  • Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo. Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity. [INTERSPEECH 2020]

  • Timo Lohrenz, Tim Fingscheidt. BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. [INTERSPEECH 2020]

  • Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel. Relative Positional Encoding for Speech Recognition and Direct Translation. [INTERSPEECH 2020]

  • Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. [INTERSPEECH 2020]

  • Takashi Fukuda, Samuel Thomas. Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework. [INTERSPEECH 2020]

  • Jinhwan Park, Wonyong Sung. Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition. [INTERSPEECH 2020]

Mon-1-2 Multi-Channel Speech Enhancement

Mon-1-3 Speech Processing in the Brain

Mon-1-4 Speech Signal Representation

  • Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie. Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism. [INTERSPEECH 2020]

Mon-1-5 Speech Synthesis: Neural Waveform Generation I

Mon-SS-1-6 Automatic Speech Recognition for Non-Native Childrens Speech

Mon-1-7 Speaker Diarization

Mon-1-8 Noise Robust and Distant Speech Recognition

Mon-1-9 Speech in Multimodality (MULTIMODAL)

Mon-1-10 Speech, Language, and Multimodal Resources

Mon-1-11 Language Recognition

Mon-S&T-1-1 Speech Processing and Analysis

Mon-2-1 Speech Emotion Recognition I (SER I)

Mon-2-2 ASR Neural Network Architectures and Training I

Mon-2-3 Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation

Mon-2-4 Phonetics and Phonology

Mon-2-5 Topics in ASR I

Mon-2-6 Large-Scale Evaluation of Short-Duration Speaker Verification

Mon-2-7 Voice Conversion and Adaptation I

Mon-2-8 Acoustic Event Detection

Mon-2-9 Spoken Language Understanding I

Mon-2-10 DNN Architectures for Speaker Recognition

  • Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang. AutoSpeech: Neural Architecture Search for Speaker Recognition. [INTERSPEECH 2020]

  • Ya-Qi Yu, Wu-Jun Li. Densely Connected Time Delay Neural Network for Speaker Verification. [INTERSPEECH 2020]

    • SI-SV VoxCeleb Dense Connection Multi-brach
    • Densely Connected TDNN (D-TDNN)
    • Statistics-and-Selection (attention-based weights)
      • Q: Learned parameters
      • K: Global embedding (equal weights)
      • V: Hidden features
  • Siqi Zheng, Hongbin Suo, Yun Lei. Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. [INTERSPEECH 2020]

    • SI-SV Short Duration NIST SRE VoxCeleb Phonetic Information Triplet Loss
    • Phonetically-Aware Coupled Network (PacNet)
    • 'Triplet loss training scheme is more fitting than softmax loss system for normalizing phonetic contents.'
  • Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoi Rin Kim. Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention. [INTERSPEECH 2020]

    • SI-SV KWS Phonetic Information
    • Global Attention (attention-based weights)
      • Q: Global embedding (LSTM-based weights)
      • KV: Hidden features
  • Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu. Vector-based attentive pooling for text-independent speaker verification. [INTERSPEECH 2020]

    • SI-SV VoxCeleb SITW
    • 'Most attentive pooling methods are not more effective than statistics pooling.'
  • Pooyan Safari, Miquel India, Javier Hernando. Self-Attention Encoding and Pooling for Speaker Recognition. [INTERSPEECH 2020]

    • SI-SV VoxCeleb Self-Attention Attentive Pooling
  • Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu. ARET: Aggregated Residual Extended Time-delay Neural Networks for Speaker Verification. [INTERSPEECH 2020]

    • SI-SV VoxCeleb Residual Connection Grouped Conv
  • Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee, Jianguo Wei. Adversarial Separation Network for Speaker Recognition. [INTERSPEECH 2020]

    • SI-SV VCTK Adversarial Attack
    • Reconstruct adversarial perturbations
  • Jingyu Li, Tan Lee. Text-Independent Speaker Verification with Dual Attention Network. [INTERSPEECH 2020]

    • SI-SV VoxCeleb Dual Attention
    • Global Attention (attention-based weights)
      • Q: Global embedding (equal weights)
      • K: Hidden features (deeper layers)
      • V: Hidden features
    • Mutual Attention
      • Q: Global embedding (attention-based weights, from another utterance)
      • K: Hidden features (deeper layers)
      • V: Hidden features
  • Xiaoyang Qu, Jianzong Wang, Jing Xiao. Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification. [INTERSPEECH 2020]

Mon-2-11 ASR Model Training and Strategies

Mon-3-1 Cross/Multi-Lingual and Code-Switched Speech Recognition

Mon-3-2 Anti-Spoofing and Liveness Detection

Mon-3-3 Noise Reduction and Intelligibility

Mon-3-4 Acoustic Scene Classification

Mon-3-5 Singing Voice Computing and Processing in Music

Mon-3-7 Acoustic Model Adaptation for ASR

Mon-3-8 Singing and Multimodal Synthesis

Mon-3-9 Intelligibility-Enhancing Speech Modification

Mon-3-10 Human Speech Production I

Mon-3-11 Targeted Source Separation

  • Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li. SpEx+: A Complete Time Domain Speaker Extraction Network. [INTERSPEECH 2020]

    • WSJ0-2mix
    • Replace frequency-domain speaker encoder with time-domain one to alleviate mismatch.
  • Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li. Atss-Net: Target Speaker Separation via Attention-based Neural Network. [INTERSPEECH 2020]

  • Leyuan Qu, Cornelius Weber, Stefan Wermter. Multimodal Target Speech Separation with Voice and Face References. [INTERSPEECH 2020]

    • LRS3 Face Embedding
    • Incorporate face embedding extracted from a single face profile into speech separation.
  • Zining Zhang, Bingsheng He, Zhenjie Zhang. X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network. [INTERSPEECH 2020]

    • LibriSpeech Alternative Training Speaker Presence
    • Alternative training: ABC + ABC + ABC
    • Loss
      • + SI-SNR for mixture signal of distortion speakers
      • + Penalty for remaining signal under the absent speaker condition
    • Metric
      • NSR (negative SI-SNRi rate) for distortion speaker
      • NER (negative enery rate) for absent speaker
  • Chenda Li, Yanmin Qian. Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. [INTERSPEECH 2020]

    • LRS2 Audio-Visual Speech Recognition
    • Robust phonetic embedding conditioned on visual embedding.
  • Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu. A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments. [INTERSPEECH 2020]

  • Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki. Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. [INTERSPEECH 2020]

    • WSJ0-2mix WSJ0-3mix LibriSpeech
  • Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki . Listen to What You Want: Neural Network-Based Universal Sound Selector. [INTERSPEECH 2020]

  • Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. [INTERSPEECH 2020]

  • Jiahao Xu, Kun Hu, Chang Xu, Duc Chung Tran, Zhiyong Wang. Speaker-Aware Monaural Speech Separation. [INTERSPEECH 2020]