-
Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu. On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition. [INTERSPEECH 2020]
ASR
RNN-T
RNN-A
Transformer
- 65,000 Hours
- Non-streaming: Transformer > RNN-A >> RNN-T
- Streaming: Transformer > RNN-T (Custom LSTM, CE init) > RNN-A
-
Zhifu Gao, ShiLiang Zhang, Ming Lei, Ian McLoughlin. SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. [INTERSPEECH 2020]
ASR
AISHELL-1
- Value + DFSMN
-
Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf. Contextual RNN-T for Open Domain ASR. [INTERSPEECH 2020]
-
Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma. ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition. [INTERSPEECH 2020]
-
Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo. Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity. [INTERSPEECH 2020]
-
Timo Lohrenz, Tim Fingscheidt. BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example. [INTERSPEECH 2020]
-
Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel. Relative Positional Encoding for Speech Recognition and Direct Translation. [INTERSPEECH 2020]
-
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. [INTERSPEECH 2020]
-
Takashi Fukuda, Samuel Thomas. Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework. [INTERSPEECH 2020]
-
Jinhwan Park, Wonyong Sung. Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition. [INTERSPEECH 2020]
- Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie. Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism. [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
Mon-2-3 Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
-
Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang. AutoSpeech: Neural Architecture Search for Speaker Recognition. [INTERSPEECH 2020]
-
Ya-Qi Yu, Wu-Jun Li. Densely Connected Time Delay Neural Network for Speaker Verification. [INTERSPEECH 2020]
SI-SV
VoxCeleb
Dense Connection
Multi-brach
- Densely Connected TDNN (D-TDNN)
- Statistics-and-Selection (attention-based weights)
- Q: Learned parameters
- K: Global embedding (equal weights)
- V: Hidden features
-
Siqi Zheng, Hongbin Suo, Yun Lei. Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. [INTERSPEECH 2020]
SI-SV
Short Duration
NIST SRE
VoxCeleb
Phonetic Information
Triplet Loss
- Phonetically-Aware Coupled Network (PacNet)
- 'Triplet loss training scheme is more fitting than softmax loss system for normalizing phonetic contents.'
-
Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoi Rin Kim. Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention. [INTERSPEECH 2020]
SI-SV
KWS
Phonetic Information
- Global Attention (attention-based weights)
- Q: Global embedding (LSTM-based weights)
- KV: Hidden features
-
Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu. Vector-based attentive pooling for text-independent speaker verification. [INTERSPEECH 2020]
SI-SV
VoxCeleb
SITW
- 'Most attentive pooling methods are not more effective than statistics pooling.'
-
Pooyan Safari, Miquel India, Javier Hernando. Self-Attention Encoding and Pooling for Speaker Recognition. [INTERSPEECH 2020]
SI-SV
VoxCeleb
Self-Attention
Attentive Pooling
-
Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu. ARET: Aggregated Residual Extended Time-delay Neural Networks for Speaker Verification. [INTERSPEECH 2020]
SI-SV
VoxCeleb
Residual Connection
Grouped Conv
-
Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee, Jianguo Wei. Adversarial Separation Network for Speaker Recognition. [INTERSPEECH 2020]
SI-SV
VCTK
Adversarial Attack
- Reconstruct adversarial perturbations
-
Jingyu Li, Tan Lee. Text-Independent Speaker Verification with Dual Attention Network. [INTERSPEECH 2020]
SI-SV
VoxCeleb
Dual Attention
- Global Attention (attention-based weights)
- Q: Global embedding (equal weights)
- K: Hidden features (deeper layers)
- V: Hidden features
- Mutual Attention
- Q: Global embedding (attention-based weights, from another utterance)
- K: Hidden features (deeper layers)
- V: Hidden features
-
Xiaoyang Qu, Jianzong Wang, Jing Xiao. Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification. [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
- . [INTERSPEECH 2020]
-
Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li. SpEx+: A Complete Time Domain Speaker Extraction Network. [INTERSPEECH 2020]
WSJ0-2mix
- Replace frequency-domain speaker encoder with time-domain one to alleviate mismatch.
-
Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li. Atss-Net: Target Speaker Separation via Attention-based Neural Network. [INTERSPEECH 2020]
-
Leyuan Qu, Cornelius Weber, Stefan Wermter. Multimodal Target Speech Separation with Voice and Face References. [INTERSPEECH 2020]
LRS3
Face Embedding
- Incorporate face embedding extracted from a single face profile into speech separation.
-
Zining Zhang, Bingsheng He, Zhenjie Zhang. X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network. [INTERSPEECH 2020]
LibriSpeech
Alternative Training
Speaker Presence
- Alternative training: A
BC+ABC+ABC - Loss
- + SI-SNR for mixture signal of distortion speakers
- + Penalty for remaining signal under the absent speaker condition
- Metric
- NSR (negative SI-SNRi rate) for distortion speaker
- NER (negative enery rate) for absent speaker
-
Chenda Li, Yanmin Qian. Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. [INTERSPEECH 2020]
LRS2
Audio-Visual
Speech Recognition
- Robust phonetic embedding conditioned on visual embedding.
-
Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu. A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments. [INTERSPEECH 2020]
-
Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki. Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding. [INTERSPEECH 2020]
WSJ0-2mix
WSJ0-3mix
LibriSpeech
-
Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki . Listen to What You Want: Neural Network-Based Universal Sound Selector. [INTERSPEECH 2020]
-
Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. [INTERSPEECH 2020]
-
Jiahao Xu, Kun Hu, Chang Xu, Duc Chung Tran, Zhiyong Wang. Speaker-Aware Monaural Speech Separation. [INTERSPEECH 2020]