
INTERSPEECH 2023 Papers: A complete collection of influential and exciting research papers from the INTERSPEECH 2023 conference. Explore the latest advances in speech and language processing. Code included. Star the repository to support the advancement of speech technology!

MIT LicenseMIT


Awesome Version GitHub repo size License: MIT Contributions welcome GitHub contributors GitHub commit activity (branch) GitHub closed issues GitHub issues GitHub closed pull requests GitHub pull requests GitHub last commit GitHub watchers GitHub forks GitHub Repo stars

INTERSPEECH 2023 Papers: A complete collection of influential and exciting research papers from the INTERSPEECH 2023 conference. Explore the latest advances in speech and language processing. Code included. ⭐ the repository to support the advancement of speech technology!

Interspeech 2023

Draft PDF version of the INTERSPEECH 2023 Conference Programme, which lists all accepted full papers together with their provisional mode of presentation and the time at which they will be presented.

Other collections of the best AI conferences

NOTE: Conference table will be up to date all the time.

Conference Year


Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.


NOTE: Final paper links will be added post-conference.

Resources for Spoken Language Processing

# Title Repo Paper
1686 Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech GitHub
Documentation Status
1049 MOCKS 1.0: Multilingual Open Custom Keyword Spotting Testset
2150 MD3: The Multi-Dialect Dataset of Dialogues Kaggle arXiv
2279 MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation GitHub arXiv
1828 Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition
2351 HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation arXiv

Speech Synthesis: Prosody and Emotion

# Title Repo Paper
749 Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks arXiv
1292 Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations
1317 EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis arXiv
806 Laughter Synthesis using Pseudo Phonetic Tokens with a Large-Scale In-the-Wild Laughter Corpus GitHub Page
2270 Explicit Intensity Control for Accented Text-to-Speech GitHub Page
834 Comparing Normalizing Flows and Diffusion Models for Prosody and Acoustic Modelling in Text-to-Speech

Statistical Machine Translation

# Title Repo Paper
2484 Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
1063 Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters arXiv
648 StyleS2ST: Zero-Shot Style Transfer for Direct Speech-to-Speech Translation GitHub Page arXiv
1767 Joint Speech Translation and Named Entity Recognition GitHub arXiv
2050 Analysis of Acoustic Information in End-to-End Spoken Language Translation
2004 LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model using Neural Transducers arXiv

Self-Supervised Learning in ASR

# Title Repo Paper
1213 DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models GitHub arXiv
1040 Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations GitHub arXiv
387 Dual Acoustic Linguistic Self-Supervised Representation Learning for Cross-Domain Speech Recognition
2166 O-1: Self-Training with Oracle and 1-best Hypothesis
822 MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets GitHub arXiv
1802 Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models


# Title Repo Paper
1781 Chinese EFL Learners' Perception of English Prosodic Focus
315 Pitch Accent Variation and the Interpretation of Rising and Falling Intonation in American English
1033 Tonal Coarticulation as a Cue for Upcoming Prosodic Boundary
2116 Alignment of Beat Gestures and Prosodic Prominence in German
1454 Creak Prevalence and Prosodic Context in Australian English
1651 Speech Reduction: Position within French Prosodic Structure

Speech Production

# Title Repo Paper
637 Transvelar Nasal Coupling Contributing to Speaker Characteristics in Non-nasal Vowels
286 Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
2283 The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech with a Siamese RNN GitHub arXiv
1933 Did You See that? Exploring the Role of Vision in the Development of Consonant Feature Contrasts in Children with Cochlear Implants

Dysarthric Speech Assessment

# Title Repo Paper
2017 Automatic Assessments of Dysarthric Speech: the Usability of Acoustic-Phonetic Features
1455 Classification of Multi-class Vowels and Fricatives from Patients Having Amyotrophic Lateral Sclerosis with Varied Levels of Dysarthria Severity
1627 Parameter-efficient Dysarthric Speech Recognition using Adapter Fusion and Householder Transformation arXiv
2481 Few-Shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation idiap
1921 Latent Phrase Matching for Dysarthric Speech arXiv
173 Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification GitHub arXiv

Speech Coding: Transmission

# Title Repo Paper
1562 CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding
1234 Target Speech Extraction with Conditional Diffusion Model
883 Towards Fully Quantized Neural Networks For Speech Enhancement GitHub
980 Complex Image Generation SwinTransformer Network for Audio Denoising GitHub

Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation

# Title Repo Paper
2118 Using Text Injection to Improve Recognition of Personal Identifiers in Speech
837 Investigating Wav2Vec2 Context Representations and the Effects of Fine-Tuning, a Case-Study of a Finnish Model GitHub
872 Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech
177 Iteratively Improving Speech Recognition and Voice Conversion GitHub Page arXiv
2001 LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems nottingham-repo
746 TranUSR: Phoneme-to-Word Transcoder Based Unified Speech Representation Learning for Cross-Lingual Speech Recognition arXiv
1124 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR
2417 GhostRNN: Reducing State Redundancy in RNN with Cheap Operations
1442 Task-Agnostic Structured Pruning of Speech Representation Models arXiv
485 Factual Consistency Oriented Speech Recognition arXiv
1036 Multi-Head State Space Model for Speech Recognition arXiv
341 Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search
2359 Probing Self-Supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration arXiv
739 Selective Biasing with Trie-based Contextual Adapters for Personalised Speech Recognition using Neural Transducers Amazon Science
213 A More Accurate Internal Language Model Score Estimation for the Hybrid Autoregressive Transducer
106 Attention Gate between Capsules in Fully Capsule-Network Speech Recognition
2585 OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking arXiv
1316 ML-SUPERB: Multilingual Speech Universal PERformance Benchmark GitHub Page arXiv
2389 General-purpose Adversarial Training for Enhanced Automatic Speech Recognition Model Generalization
275 Joint Instance Reconstruction and Feature Sub-space Alignment for Cross-Domain Speech Emotion Recognition
2280 Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data arXiv
1272 Random Utterance Concatenation based Data Augmentation for Improving Short-Video Speech Recognition arXiv
1189 Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers arXiv
223 Rethinking Speech Recognition with a Multimodal Perspective via Acoustic and Semantic Cooperative Decoding arXiv
923 Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation GitHub Page
2258 Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts arXiv
1184 DCCRN-KWS: An Audio Bias based Model for Noise Robust Small-Footprint Keyword Spotting arXiv
1609 OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition arXiv
2136 Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition arXiv
788 Rehearsal-Free Online Continual Learning for Automatic Speech Recognition GitHub arXiv
496 ASR Data Augmentation in Low-Resource Settings using Cross-Lingual Multi-Speaker TTS and Cross-Lingual Voice Conversion GitHub Page arXiv
642 Personality-aware Training based Speaker Adaptation for End-to-End Speech Recognition
2257 Target Vocabulary Recognition Based on Multi-task Learning with Decomposed Teacher Sequences
679 Wave to Syntax: Probing Spoken Language Models for Syntax GitHub arXiv
720 Effective Training of Attention-based Contextual Biasing Adapters with Synthetic Audio for Personalised ASR Amazon Science
630 Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation arXiv
1118 SlothSpeech: Denial-of-Service Attack Against Speech Recognition Models GitHub arXiv
503 CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition
159 Exploring Sources of Racial Bias in Automatic Speech Recognition through the Lens of Rhythmic Variation
1440 Can Contextual Biasing Remain Effective with Whisper and GPT-2? arXiv
221 Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation GitHub Page arXiv
2207 Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition
1216 MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition
1192 Improving Chinese Mandarin Speech Recognition using Graph Embedding Regularization
1276 Adapting Multi-Lingual ASR Models for Handling Multiple Talkers arXiv
1221 Adapter-Tuning with Effective Token-Dependent Representation Shift for Automatic Speech Recognition
1010 Model-Internal Slot-Triggered Biasing for Domain Expansion in Neural Transducer ASR Models Amazon Science
2508 Delay-Penalized CTC Implemented based on Finite State Transducer GitHub arXiv
101 Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition GitHub arXiv
1064 MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations GitHub arXiv
1422 Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator arXiv
2589 Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR GitHub arXiv
1091 Domain Adaptive Self-Supervised Training of Automatic Speech Recognition
1105 There is more than One Kind of Robustness: Fooling Whisper with Adversarial Examples GitHub arXiv
1176 Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute arXiv
759 Blank-Regularized CTC for Frame Skipping in Neural Transducer GitHub arXiv
2406 The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR arXiv
2354 Improving RNN-Transducers with Acoustic LookAhead arXiv
1847 Everyone has an Accent
2124 Some Voices are too Common: Building Fair Speech Recognition Systems using the Common-Voice Dataset arXiv
1168 Information Magnitude based Dynamic Sub-Sampling for Speech-to-Text
353 Towards Multi-task Learning of Speech and Speaker Recognition GitHub arXiv
2186 Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR
1012 2-bit Conformer Quantization for Automatic Speech Recognition arXiv
167 Time-Domain Speech Enhancement for Robust Automatic Speech Recognition arXiv
257 Multi-Channel Multi-Speaker Transformer for Speech Recognition
733 Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion arXiv
2463 Dialect Speech Recognition Modeling using Corpus of Japanese Dialects and Self-Supervised Learning-based Model XLSR
767 Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network arXiv
970 Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think arXiv
791 MMSpeech: Multi-Modal Multi-Task Encoder-Decoder Pre-training for Speech Recognition arXiv
2499 Biased Self-Supervised Learning for ASR arXiv
1300 A Unified Recognition and Correction Model under Noisy and Accent Speech Conditions
2470 Wav2Vec 2.0 ASR for Cantonese-Speaking Older Adults in a Clinical Setting
770 BAT: Boundary aware Transducer for Memory-Efficient and Low-Latency ASR GitHub arXiv
1342 Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
783 Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition arXiv

Analysis of Speech and Audio Signals

# Title Repo Paper
1173 Robust Prototype Learning for Anomalous Sound Detection
982 A Multimodal Prototypical Approach for Unsupervised Sound Classification GitHub arXiv
563 Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms
1082 Adapting Language-Audio Models as Few-Shot Audio Learners arXiv
914 Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention GitHub arXiv
734 TFECN: Time-Frequency Enhanced ConvNet for Audio Classification
350 Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detection
1174 Fine-Tuning Audio Spectrogram Transformer with Task-Aware Adapters for Sound Event Detection
1210 Small Footprint Multi-Channel Network for Keyword Spotting with Centroid Based Awareness
1380 Few-Shot Class-Incremental Audio Classification using Adaptively-Refined Prototypes arXiv
1549 Interpretable Latent Space using Space-Filling Curves for Phonetic Analysis in Voice Conversion GitLab Aalto
1861 Topological Data Analysis for Speech Processing GitHub Page arXiv
1329 Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation GitHub arXiv
932 Personalized Acoustic Scene Classification in Ultra-Low Power Embedded Devices using Privacy-Preserving Data Augmentation
176 Background Domain Switch: A Novel Data Augmentation Technique for Robust Sound Event Detection
1021 Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning GitHub Pdf
2416 Anomalous Sound Detection using Self-Attention-based Frequency Pattern Analysis of Machine Sounds
1478 Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
979 Ontology-aware Learning and Evaluation for Audio Tagging GitHub arXiv
575 Differential Privacy enabled Dementia Classification: An Exploration of the Privacy-Accuracy Trade-off in Speech Signal Data
1595 Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech GitHub Page arXiv
1816 Towards Multi-Lingual Audio Question Answering GitHub
477 Wav2ToBI: A New Approach to Automatic ToBI Transcription
1579 MCR-Data2vec 2.0: Improving Self-Supervised Speech Pre-training via Model-Level Consistency Regularization arXiv
591 Anomalous Sound Detection based on Sound Separation arXiv
2089 Random Forest Classification of Breathing Phases from Audio Signals Recorded using Mobile Devices
1581 GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos
358 Emotion-aware Audio-Driven Face Animation via Contrastive Feature Disentanglement
344 Joint-Former: Jointly Regularized and Locally Down-Sampled Conformer for Semi-Supervised Sound Event Detection
245 Towards Attention-based Contrastive Learning for Audio Spoof Detection
2488 Masked Audio Modeling with CLAP and Multi-Objective Learning
1904 Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems GitHub arXiv
481 Self-Supervised Dataset Pruning for Efficient Training in Audio Anti-Spoofing
491 Semantic Segmentation with Bidirectional Language Models Improves Long-Form ASR arXiv
684 Multi-Microphone Automatic Speech Segmentation in Meetings based on Circular Harmonics Features arXiv
542 Advanced RawNet2 with Attention-based Channel Masking for Synthetic Speech Detection
88 Insights Into End-to-End Audio-to-Score Transcription with Real Recordings: A Case Study with Saxophone Works
2193 Whisper-AT: Noise-Robust Automatic Speech Recognizers are also Strong Audio Event Taggers GitHub
1621 Synthetic Voice Spoofing Detection based on Feature Pyramid Conformer
1383 Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection
2011 Application of Knowledge Distillation to Multi-Task Speech Representation Learning arXiv
2297 DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes arXiv
1965 Variational Classifier for Unsupervised Anomalous Sound Detection under Domain Generalization
745 FlexiAST: Flexibility is What AST Needs GitHub arXiv
1344 Blind Estimation of Room Impulse Response from Monaural Reverberant Speech with Segmental Generative Neural Network ResearchGate
613 Dual-Memory Multi-Modal Learning for Continual Spoken Keyword Spotting with Confidence Selection and Diversity Enhancement
1431 An Efficient Speech Separation Network based on Recurrent Fusion Dilated Convolution and Channel Attention arXiv
801 Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation
2015 Binaural Sound Localization in Noisy Environments using Frequency-based Audio Vision Transformer (FAViT)
1723 Contrastive Learning based Deep Latent Masking for Music Source Separation
655 Speaker Extraction with Detection of Presence and Absence of Target Speakers
889 PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
2117 Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning Apple
1309 Image-Driven Audio-Visual Universal Source Separation
2520 Joint Blind Source Separation and Dereverberation for Automatic Speech Recognition using Delayed-Subsource
1766 SDNet: Stream-Attention and Dual-Feature Learning Network for Ad-hoc Array Speech Separation
2451 Deeply Supervised Curriculum Learning for Deep Neural Network-based Sound Source Localization
164 Multi-Channel Separation of Dynamic Speech and Sound Events GitHub
2545 Rethinking the Visual Cues in Audio-Visual Speaker Extraction GitHub arXiv
85 Using Semi-Supervised Learning for Monaural Time-Domain Speech Separation with a Self-Supervised Learning-based SI-SNR Estimator
1158 Investigation of Training Mute-Expressive End-to-End Speech Separation Networks for an Unknown Number of Speakers
2369 SR-SRP: Super-Resolution based SRP-PHAT for Sound Source Localization and Tracking
165 Time-Frequency Domain Filter-and-Sum Network for Multi-Channel Speech Separation
714 FN-SSL: Full-Band and Narrow-Band Fusion for Sound Source Localization GitHub arXiv
696 A Neural State-Space Modeling Approach to Efficient Speech Separation arXiv
1777 Locate and Beamform: Two-Dimensional Locating All-Neural Beamformer for Multi-Channel Speech Separation GitHub arXiv
518 Monaural Speech Separation Method based on Recurrent Attention with Parallel Branches
951 What do Self-Supervised Speech Representations Encode? An Analysis of Languages, Varieties, Speaking Styles and Speakers
1696 A Compressed Synthetic Speech Detection Method with Compression Feature Embedding
572 Outlier-aware Inlier Modeling and Multi-Scale Scoring for Anomalous Sound Detection via Multitask Learning
263 MOSLight: A Lightweight Data-Efficient System for Non-Intrusive Speech Quality Assessment
1626 A Multi-Scale Attentive Transformer for Multi-Instrument Symbolic Music Generation GitHub arXiv
2494 MTANet: Multi-band Time-Frequency Attention Network for Singing Melody Extraction from Polyphonic Music
119 Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer based on Generative Adversarial Network GitHub Page arXiv
2190 Do Vocal Breath Sounds Encode Gender cues for Automatic Gender Classification?
202 Automatic Exploration of Optimal Data Processing Operations for Sound Data Augmentation using Improved Differentiable Automatic Data Augmentation
1430 A Snoring Sound Dataset for Body Position Recognition: Collection, Annotation, and Analysis
528 RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music GitHub arXiv
832 Spatialization Quality Metric for Binaural Speech
428 AsthmaSCELNet: A Lightweight Supervised Contrastive Embedding Learning Framework for Asthma Classification using Lung Sounds
1426 Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification GitHub arXiv
2115 Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance Pdf
852 AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation GitHub Page
209 Obstructive Sleep Apnea Screening with Breathing Sounds and Respiratory Effort: A Multimodal Deep Learning Approach
2275 Investigation of Music Emotion Recognition based on Segmented Semi-Supervised Learning

Speech Recognition: Architecture, Search, and Linguistic Components

# Title Repo Paper
2344 Diacritic Recognition Performance in Arabic ASR arXiv
990 Personalization for BERT-based Discriminative Speech Recognition Rescoring Amazon Science
2182 On the N-gram Approximation of Pre-trained Language Models arXiv
2147 Record Deduplication for Entity Distribution Modeling in ASR Transcripts arXiv
2205 Learning When to Trust Which Teacher for Weakly Supervised ASR arXiv
1313 Text-Only Domain Adaptation using Unified Speech-Text Representation in Transducer arXiv
1378 Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
2479 Knowledge Distillation Approach for Efficient Internal Language Model Estimation
276 Language Model Personalization for Improved Touchscreen Typing
1223 Blank Collapse: Compressing CTC Emission for the Faster Decoding GitHub arXiv
403 Improving Joint Speech-Text Representations without Alignment
1941 Leveraging Cross-Utterance Context for ASR Decoding arXiv
423 Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation GitHub arXiv
1517 Integration of Frame- and Label-Synchronous Beam Search for Streaming Encoder-Decoder Speech Recognition
1071 A Neural Time Alignment Module for End-to-End Automatic Speech Recognition
599 Accelerating Transducers through Adjacent Token Merging arXiv
617 Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition arXiv
2292 Language-Routing Mixture of Experts for Multi-Lingual and Code-Switching Speech Recognition
1437 Embedding Articulatory Constraints for Low-Resource Speech Recognition based on Large Pre-trained Model
2051 Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning arXiv
768 SpellMapper: A Non-Autoregressive Neural Spellchecker for ASR Customization with Candidate Retrieval based on N-Gram Mappings arXiv
2037 Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
1281 Confidence-based Ensembles of End-to-End Speech Recognition Models arXiv
1050 Unsupervised Code-Switched Text Generation from Parallel Text
258 A Binary Keyword Spotting System With Error-Diffusion Speech Feature Binarization
621 Language-Universal Phonetic Encoder for Low-Resource Speech Recognition arXiv
863 A Lexical-aware Non-Autoregressive Transformer-based ASR Model arXiv
1841 Improving Under-Resourced Code-Switched Speech Recognition: Large Pre-trained Models or Architectural Interventions
1194 A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks GitHub
61 A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization
137 Modeling Dependent Structure for Utterances in ASR Evaluation arXiv
757 ASR for Low Resource and Multilingual Noisy Code-Mixed Speech
390 Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System arXiv
737 Combining Multilingual Resources and Models to Develop State-of-the-Art E2E ASR for Swedish
1171 Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-Streaming Transducer arXiv
1867 Towards Continually Learning New Languages
1616 N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space arXiv
1432 SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge
1162 miniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming ASR Applications on the Edge
1469 CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning
1337 Exploration on HuBERT with Multiple Resolution arXiv
2045 Quantization-aware and Tensor-compressed Training of Transformers for Natural Language Understanding arXiv
2355 Word-Level Confidence Estimation for CTC Models
2235 Multilingual Contextual Adapters to Improve Custom Word Recognition in Low-Resource Languages arXiv
614 Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition
1303 4D ASR: Joint Modeling of CTC, Attention, Transducer, and Mask-Predict Decoders arXiv
1086 Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition GitHub arXiv
262 Language-Specific Boundary Learning for Improving Mandarin-English Code-Switching Speech Recognition
480 Mixture-of-Expert Conformer for Streaming Multilingual ASR arXiv
1665 Lossless 4-bit Quantization of Architecture Compressed Conformer ASR Systems on the 300-hr Switch-board Corpus
2544 Compressed MoE ASR Model Based on Knowledge Distillation and Quantization

Speech Recognition: Technologies and Systems for New Applications

# Title Repo Paper
2044 Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model GitHub arXiv
2032 Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization GitHub arXiv
235 Progress and Prospects for Spoken Language Technology: Results from Five Sexennial Surveys
268 Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling arXiv
601 CASA-ASR: Context-Aware Speaker-Attributed ASR arXiv
1321 Unsupervised Learning of Discrete Latent Representations with Data-Adaptive Dimensionality from Continuous Speech Streams
1167 AD-TUNING: An Adaptive CHILD-TUNING Approach to Efficient Hyperparameter Optimization of Child Networks for Speech Processing Tasks in the SUPERB Benchmark GitHub
190 Distilling Knowledge from Gaussian Process Teacher to Neural Network Student
135 Segmental SpeechCLIP: Utilizing Pretrained Image-Text Models for Audio-Visual Learning
421 Towards Hate Speech Detection in Low-Resource Languages: Comparing ASR to Acoustic Word Embeddings on Wolof and Swahili arXiv
385 Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification through Meta-Learning GitHub arXiv
664 Online Punctuation Restoration using ELECTRA Model for Streaming ASR Systems
2066 Language Agnostic Data-Driven Inverse Text Normalization arXiv
1079 How to Estimate Model Transferability of Pre-trained Speech Models? arXiv
1655 Transcribing Speech as Spoken and Written Dual Text using an Autoregressive Model
587 Phonetic and Prosody-aware Self-Supervised Learning Approach for Non-Native Fluency Scoring arXiv
380 Disentangling the Contribution of Non-Native Speech in Automated Pronunciation Assessment
337 A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning
1635 Assessing Intelligibility in Non-Native Speech: Comparing Measures Obtained at Different Levels
585 End-to-End Word-Level Pronunciation Assessment with MASK Pre-training arXiv
550 A Hierarchical Context-aware Modeling Approach for Multi-Aspect and Multi-Granular Pronunciation Assessment arXiv
2541 Automatic Prediction of Language Learners' Listenability using Speech and Text Features Extracted from Listening Drills
2371 Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-Level Goodness of Pronunciation Transformer
1899 Adapting an Unadaptable ASR System arXiv
533 Addressing Cold Start Problem for End-to-End Automatic Speech Scoring arXiv
816 Improving Grapheme-to-Phoneme Conversion by Learning Pronunciations from Speech Recordings Amazon Science
2577 Orthography-based Pronunciation Scoring for Better CAPT Feedback Pdf
1592 Zero-Shot Automatic Pronunciation Assessment arXiv
364 Mispronunciation Detection and Diagnosis Model for Tonal Language, Applied to Vietnamese
793 An Efficient and Noise-Robust Audiovisual Encoder for Audiovisual Speech Recognition
540 A Novel Self-training Approach for Low-Resource Speech Recognition
1428 FunASR: A Fundamental End-to-End Speech Recognition Toolkit GitHub arXiv
487 Streaming Audio-Visual Speech Recognition with Alignment Regularization arXiv
462 SparseVSR: Lightweight and Noise Robust Visual Speech Recognition arXiv
2262 Multimodal Speech Recognition for Language-Guided Embodied Agents GitHub arXiv

Lexical and Language Modeling for ASR

# Title Repo Paper
643 NoRefER: A Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning GitHub arXiv
2128 Scaling Laws for Discriminative Speech Recognition Rescoring Models Amazon Science
2429 Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition GitHub Page arXiv
1362 Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition arXiv
1251 Memory Network-based End-To-End Neural ES-KMeans for Improved Word Segmentation
1320 Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation arXiv

Language Identification and Diarization

# Title Repo Paper
304 Lightweight and Efficient Spoken Language Identification of Long-form Audio
1109 End-to-End Spoken Language Diarization with Wav2vec Embeddings
1986 Efficient Spoken Language Recognition via Multilabel Classification arXiv
1529 Description and Analysis of ABC Submission to NIST LRE 2022
1790 Exploring the Impact of Pretrained Models and Web-Scraped Data for the 2022 NIST Language Recognition Evaluation
1094 Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22

Speech Quality Assessment

# Title Repo Paper
1436 DeePMOS: Deep Posterior Mean-Opinion-Score of Speech
1644 The Role of Formant and Excitation Source Features in Perceived Naturalness of Low Resource Tribal Language TTS: An Empirical Study
811 A No-Reference Speech Quality Assessment Method based on Neural Network with Densely Connected Convolutional Architecture
2507 Probing Speech Quality Information in ASR Systems
589 Preference-based Training Framework for Automatic Speech Quality Assessment using Deep Neural Network
389 Crowdsourced Data Validation for ASR Training

Feature Modeling for ASR

# Title Repo Paper
2296 Re-Investigating the Efficient Transfer Learning of Speech Foundation Model using Feature Fusion Methods
1556 Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training
509 InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition arXiv
579 Transductive Feature Space Regularization for Few-Shot Bioacoustic Event Detection
615 Incorporating L2 Phonemes using Articulatory Features for Robust Speech Recognition arXiv
1510 On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning HAL Science

Interfacing Speech Technology and Phonetics

# Title Repo Paper
1846 Phonemic Competition in End-to-End ASR models
443 Automatic Speaker Recognition with Variation Across Vocal Conditions: A Controlled Experiment with Implications for Forensics
1398 Exploring Graph Theory Methods for the Analysis of Pronunciation Variation in Spontaneous Speech
680 Automatic Speaker Recognition Performance with Matched and Mismatched Female Bilingual Speech Data

Speech Synthesis: Multilinguality

# Title Repo Paper
2303 FACTSpeech: Speaking a Foreign Language Pronunciation using Only Your Native Characters
934 Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model arXiv
363 DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech GitHub Page arXiv
1467 Generating Multilingual Gender-Ambiguous Text-to-Speech Voices GitHub Page arXiv
2330 RADMMM: Multilingual Multiaccented Multispeaker Text-to-Speech NVidia AI
861 Multilingual Context-based Pronunciation Learning for Text-to-Speech Amazon Science

Speech Emotion Recognition

# Title Repo Paper
2170 Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
1113 The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers BIIC
1080 A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model Emulation AI arXiv
454 Privacy Risks in Speech Emotion Recognition: A Systematic Study on Gender Inference Attack
2111 Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition
80 Stable Speech Emotion Recognition with Head-k-Pooling Loss
1923 Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews GitHub arXiv
756 Two-Stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining
240 The Co-use of Laughter and Head Gestures Across Speech Styles
1351 EmotionNAS: Two-Stream Neural Architecture Search for Speech Emotion Recognition arXiv
136 Pre-Finetuning for Few-Shot Emotional Speech Recognition GitHub arXiv
293 Integrating Emotion Recognition with Speech Recognition and Speaker Diarization for Conversations
1075 Utility-Preserving Privacy-Enabled Speech Embeddings for Emotion Detection
890 A Context-Constrained Sentence Modeling for Deception Detection in Real Interrogation
1914 Laughter in Task-based Settings: Whom We Talk to Affects How, When, and How Often We Laugh
653 Exploring Downstream Transfer of Self-Supervised Features for Speech Emotion Recognition
1758 Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models arXiv
819 MetricAug: A Distortion Metric-Lead Augmentation Strategy for Training Noise-Robust Speech Emotion Recognizer GitHub
1311 Investigating Acoustic Cues for Multilingual Abuse Detection
1600 A Novel Frequency Warping Scale for Speech Emotion Recognition
1170 Multi-Scale Temporal Transformer for Speech Emotion Recognition
1169 Distant Speech Emotion Recognition in an Indoor Human-Robot Interaction Scenario
2498 A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation
2375 Improving Joint Speech and Emotion Recognition using Global Style Tokens
1163 Speech Emotion Recognition by Estimating Emotional Label Sequences with Phoneme Class Attribute
274 Unsupervised Transfer Components Learning for Cross-Domain Speech Emotion Recognition
1090 Dual Memory Fusion for Multimodal Speech Emotion Recognition
311 Hybrid Dataset for Speech Emotion Recognition in Russian Language
396 Speech Emotion Recognition using Decomposed Speech via Multi-Task Learning

Spoken Dialog Systems and Conversational Analysis

# Title Repo Paper
46 FC-MTLF: A Fine- and Coarse-grained Multi-task Learning Framework for Cross-Lingual Spoken Language Understanding
93 Cˆ2A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding
2300 Tri-level Joint Natural Language Understanding for Multi-turn Conversational Datasets GitHub arXiv
2234 Semantic Enrichment Towards Efficient Speech Representations
1299 Tensor Decomposition for Minimization of E2E SLU Model Toward On-Device Processing arXiv
699 DiffSLU: Knowledge Distillation based Diffusion Model for Cross-Lingual Spoken Language Understanding
1962 Integrating Pretrained ASR and LM to perform Sequence Generation for Spoken Language Understanding
644 Contrastive Learning based ASR Robust Knowledge Selection for Spoken Dialogue System
1859 Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space
198 An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for Spoken Language Understanding GitHub arXiv
1740 Enhancing New Intent Discovery via Robust Neighbor-based Contrastive Learning Pdf
211 Personalized Predictive ASR for Latency Reduction in Voice Assistants arXiv
1419 Compositional Generalization in Spoken Language Understanding
2314 Sampling Bias in NLU Models: Impact and Mitigation Amazon Science
1038 5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair arXiv
1236 Emotion Awareness in Multi-utterance Turn for Improving Emotion Prediction in Multi-Speaker Conversation
1505 WhiSLU: End-to-End Spoken Language Understanding with Whisper
1947 Relationship between Auditory and Semantic Entrainment using Deep Neural Networks (DNN)
1929 Unsupervised Auditory and Semantic Entrainment Models with Deep Neural Networks
952 Parsing Dialog Turns with Prosodic Features in English
320 Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness using Dynamic-Prompt-Tune
1885 Parameter Selection for Analyzing Conversations with Autism Spectrum Disorder
2341 Efficient Multimodal Neural Networks for Trigger-Less Voice Assistants arXiv
2332 Rapid Lexical Alignment to a Conversational Agent
578 Multimodal Turn-Taking Model using Visual cues for End-of-Utterance Prediction in Spoken Dialogue Systems
1464 Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer
1618 Improving the Response Timing Estimation for Spoken Dialogue Systems by Reducing the Effect of Speech Recognition Delay
555 Focus-Attention-Enhanced Cross-Modal Transformer with Metric Learning for Multimodal Speech Emotion Recognition
1717 A Multiple-Teacher Pruning based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting
789 Abusive Speech Detection in Indic Languages using Acoustic Features
1791 Listening to Silences In Contact Center Conversations using Textual cues
2475 I Learned Error, I Can Fix It!: A Detector-Corrector Structure for ASR Error Calibration
1074 Verbal and Nonverbal Feedback Signals in Response to Increasing Levels of Miscommunication
76 Speech-based Classification of Defensive Communication: A Novel Dataset and Results
1951 Quantifying the Perceptual Value of Lexical and Non-Lexical Channels in Speech GitHub Page
1267 Relationships between Gender, Personality Traits and Features of Multi-Modal Data to Responses to Spoken Dialog Systems Breakdown
1650 Speaker-aware Cross-Modal Fusion Architecture for Conversational Emotion Recognition

Speech Coding and Enhancement

# Title Repo Paper
936 Biophysically-Inspired Single-Channel Speech Enhancement in the Time Domain
1902 On-Device Speaker Anonymization of Acoustic Embeddings for ASR based on Flexible Location Gradient Reversal Layer
1901 How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning arXiv
1287 CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram
521 A Two-Stage Progressive Neural Network for Acoustic echo Cancellation ResearchGate
537 An Intra-BRNN and GB-RVQ based End-to-End Neural Audio Codec
1066 Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-Attended Speaker Representations
280 CFTNet: Complex-Valued Frequency Transformation Network for Speech Enhancement
623 Feature Normalization for Fine-Tuning Self-Supervised Models in Speech Enhancement arXiv
1490 Multi-Mode Neural Speech Coding based on Deep Generative Networks
751 Streaming Dual-Path Transformer for Speech Enhancement
1848 Sequence-to-Sequence Multi-Modal Speech In-Painting
984 Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression arXiv
551 Differentially Private Adapters for Parameter Efficient Acoustic Modeling GitHub arXiv
780 Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation GitHub Page
2568 Consonant-Emphasis Method Incorporating Robust Consonant-Section Detection to Improve Intelligibility of Bone-Conducted Speech
1578 Downstream Task-Agnostic Speech Enhancement with Self-Supervised Representation Loss arXiv
2305 Perceptual Improvement of Deep Neural Network (DNN) Speech Coder using Parametric and Nonparametric Density Models
2437 DeFT-AN RT: Real-Time Multichannel Speech Enhancement using Dense Frequency-Time Attentive Network and Non-overlapping Synthesis Window
1376 PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement
1364 Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation
365 Iterative Autoregression: A Novel Trick to Improve your Low-Latency Speech Enhancement Model arXiv
1084 A Multi-Dimensional Deep Structured State Space Approach to Speech Enhancement using Small-Footprint Models GitHub arXiv
705 Domain Adaptation for Speech Enhancement in a Large Domain Gap
456 SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks arXiv
339 A Mask Free Neural Network for Monaural Speech Enhancement GitHub arXiv
1548 A Training and Inference Strategy using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech GitHub arXiv
2418 A Simple RNN Model for Lightweight, Low-Compute and Low-Latency Multichannel Speech Enhancement in the Time Domain
1433 High Fidelity Speech Enhancement with Band-Split RNN arXiv
218 Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information GitHub Page
882 DFSNet: A Steerable Neural Beamformer Invariant to Microphone Array Configuration for Real-Time, Low-Latency Speech Enhancement arXiv
1323 Speaker-Aware Anti-Spoofing arXiv
1116 Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine
799 EffCRN: An Efficient Convolutional Recurrent Network for High-Performance Speech Enhancement arXiv
1795 HAD-ANC: A Hybrid System Comprising an Adaptive Filter and Deep Neural Networks for Active Noise Control
886 MSAF: A Multiple Self-Attention Field Method for Speech Enhancement
2302 Ultra Dual-Path Compression for Joint echo Cancellation and Noise Suppression
971 ABC-KD: Attention-based-Compression Knowledge Distillation for Deep Learning-based Noise Suppression arXiv
1532 PLCMOS – a Data-Driven Non-Intrusive Metric for the Evaluation of Packet Loss Concealment Algorithms GitHub Page
1910 Multi-Dataset Co-training with Sharpness-aware Optimization for Audio Anti-Spoofing arXiv
1445 Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement GitHub arXiv
901 Complex-valued Neural Networks for Voice Anti-Spoofing
1028 DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic echo Cancellation, Noise Suppression and Dereverberation arXiv
1547 Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement arXiv
1642 HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders arXiv
1441 MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra GitHub Page
565 TRIDENTSE: Guiding Speech Enhancement with 32 Global Tokens arXiv
1254 Detection of Cross-Dataset Fake Audio based on Prosodic and Pronunciation Features arXiv
1890 Self-Supervised Learning with Diffusion based Multichannel Speech Enhancement for Speaker Verification under Noisy Conditions arXiv
1341 Two-Stage Voice Anonymization for Enhanced Privacy arXiv
2055 Personalized Dereverberation of Speech
580 Weighted Von Mises Distribution-based Loss Function for Real-Time STFT Phase Reconstruction using DNN
272 Deep Multi-Frame Filtering for Hearing Aids GitHub arXiv
1232 Aligning Speech Enhancement for Improving Downstream Classification Performance
420 DNN-based Parameter Estimation for MVDR Beamforming and Post-Filtering
675 FRA-RIR: Fast Random Approximation of the Image-Source GitHub arXiv
686 Rethinking Complex-Valued Deep Neural Networks for Monaural Speech Enhancement arXiv
186 Harmonic Enhancement using Learnable Comb Filter for Light-Weight Full-band Speech Enhancement Model arXiv


# Title Repo Paper
1023 Detection of Emotional Hotspots in Meetings using a Cross-Corpus Approach
1412 Detection of Laughter and Screaming using the Attention and CTC Models
1852 Capturing Formality in Speech Across Domains and Languages Pdf
460 Towards Robust Family-Infant Audio Analysis based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio GitHub Page arXiv
778 Cues to Next-Speaker Projection in Conversational Swedish: Evidence from Reaction Times psyArXiv
1200 Multiple Instance Learning for Inference of Child Attachment from Paralinguistic Aspects of Speech
2070 Speaker Embeddings as Individuality Proxy for Voice Stress Detection arXiv
2213 From Interval to Ordinal: A HMM based Approach for Emotion Label Conversion
661 Turbo your Multi-Modal Classification with Contrastive Learning
497 Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition
1360 SOT: Self-Supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition
2464 On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion and Automatic Speech Recognition arXiv
830 Speaking State Decoder with Transition Detection for Next Speaker Prediction
1507 What are Differences? Comparing DNN and Human by their Performance and Characteristics in Speaker Age Estimation
846 Effects of Perceived Gender on the Perceived Social Function of Laughter
1999 Implicit Phonetic Information Modeling for Speech Emotion Recognition idiap
1034 Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters
300 Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions
1108 Preference Learning Labels by Anchoring on Consecutive Annotations
2561 Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks arXiv
543 Learning Local to Global Feature Aggregation for Speech Emotion Recognition arXiv
842 Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Speech Enhancement and Denoising

# Title Repo Paper
1088 Real-Time Joint Personalized Speech Enhancement and Acoustic echo Cancellation arXiv
514 TaylorBeamixer: Learning Taylor-Inspired All-Neural Multi-Channel Speech Enhancement from Beam-Space Dictionary Perspective GitHub Page arXiv
865 MFT-CRN:Multi-Scale Fourier Transform for Monaural Speech Enhancement
1265 Variance-Preserving-based Interpolation Diffusion Models for Speech Enhancement arXiv
318 Multi-Input Multi-Output Complex Spectral Mapping for Speaker Separation
992 Short-Term Extrapolation of Speech Signals using Recursive Neural Networks in the STFT Domain

Speech Synthesis: Evaluation

# Title Repo Paper
1843 Listener Sensitivity to Deviating Obstruents in WaveNet
981 How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics arXiv
2014 MOS vs. AB: Evaluating Text-to-Speech Systems Reliably using Clustered Standard Errors
851 RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting
2013 Can Better Perception Become a Disadvantage? Synthetic Speech Perception in Congenitally Blind Users
1076 Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech arXiv

End-to-End Spoken Dialog Systems

# Title Repo Paper
1799 Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding GitHub Page arXiv
1760 Improving End-to-End SLU performance with Prosodic Attention and Distillation GitHub arXiv
2575 Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
758 Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language
2018 ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding
41 GhostT5: Generate More Features with Cheap Operations to Improve Textless Spoken Question Answering

Biosignal-enabled Spoken Communication

# Title Repo Paper
278 Obstructive Sleep Apnea Detection using Pretrained Speech Representations
620 EEG-based Auditory Attention Detection with Spatiotemporal Graph and Graph Convolutional Network
1966 Silent Speech Recognition with Articulator Positions Estimated from Tongue Ultrasound and Lip Video
1377 Auditory Attention Detection in Real-Life Scenarios using Common Spatial Patterns from EEG
1381 Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG GitHub
40 Towards Ultrasound Tongue Image Prediction from EEG During Speech Production GitHub arXiv
1607 Adaptation of Tongue Ultrasound-based Silent Speech Interfaces using Spatial Transformer Networks arXiv
174 STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks
1881 Spanish Phone Confusion Analysis for EMG-based Silent Speech Interfaces
805 Hybrid Silent Speech Interface through Fusion of Electroencephalography and Electromyography

Neural-based Speech and Acoustic Analysis

# Title Repo Paper
1968 Can Self-Supervised Neural Representations Pre-trained on Human Speech Distinguish Animal Callers? GitHub arXiv
2342 Discovering COVID-19 Coughing and Breathing Patterns from Unlabeled Data using Contrastive Learning with Varying Pre-Training Domains arXiv
330 Background-aware Modeling for Weakly Supervised Sound Event Detection
1065 How to (Virtually) Train Your Speaker Localizer GitHub arXiv
2271 MMER: Multimodal Multi-task Learning for Speech Emotion Recognition GitHub arXiv
909 A Multi-task Learning Framework for Sound Event Detection using High-Level Acoustic Characteristics of Sounds arXiv

DiGo - Dialog for Good: Speech and Language Technology for Social Good

# Title Repo Paper
2194 A Multimodal Investigation of Speech, Text, Cognitive and Facial Video Features for Characterizing Depression with and without Medication Pdf
307 Understanding Disrupted Sentences using Underspecified Abstract Meaning Representation GitHub Amazon Science
2109 Developing Speech Processing Pipelines for Police Accountability arXiv
2086 Prosody-Controllable Gender-Ambiguous Speech Synthesis: A Tool for Investigating Implicit Bias in Speech Perception
848 Affective Attributes of French Caregivers' Professional Speech

Spoken Language Processing: Translation, Information Retrieval, Summarization, Resources, and Evaluation

# Title Repo Paper
180 Pragmatic Pertinence: A Learnable Confidence Metric to Assess the Subjective Quality of LM-Generated Text
2078 ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition arXiv
916 BASS: Block-wise Adaptation for Speech Summarization
1258 Speaker Tracking using Graph Attention Networks with Varying Duration Utterances in Multi-Channel Naturalistic Data: Fearless Steps Apollo 11 Audio Corpus
36 Combining Language Corpora in a Japanese Electromagnetic Articulography Database for Acoustic-to-Articulatory Inversion
523 A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition
2174 Large Dataset Generation of Synchronized Music Audio and Lyrics at Scale using Teacher-Student Paradigm
483 Enc-Dec RNN Acoustic Word Embeddings Learned via Pairwise Prediction GitHub
864 Query based Acoustic Summarization for Podcasts
1242 Spot Keywords from Very Noisy and Mixed Speech arXiv
891 Knowledge Distillation on Joint Task End-to-End Speech Translation Amazon Science
343 Investigating Pre-trained Audio Encoders in the Low-Resource Condition GitHub arXiv
1718 Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target arXiv
823 MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information GitHub arXiv
1674 CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition WEB Page arXiv
1762 Improving Zero-Shot Cross-Domain Slot Filling via Transformer-based Slot Semantics Fusion
619 Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
1468 Boosting Punctuation Restoration with Data Generation and Reinforcement Learning
695 J-ToneNet: A Transformer-based Encoding Network for Improving Tone Classification in Continuous Speech via F0 Sequences
1152 Towards Cross-Language Prosody Transfer for Dialog
2506 Strategies for Improving Low Resource Speech to Text Translation Relying on Pre-trained ASR Models arXiv
1980 ITALIC: An Italian Intent Classification Dataset GitHub
1778 Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation
1466 How ChatGPT is Robust for Spoken Language Understanding?
1233 GigaST: A 10,000-hour Pseudo Speech Translation Corpus GitHub Page arXiv
1570 Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism
2473 Crowdsource-based Validation of the Audio Cocktail as a Sound Browsing Tool
1675 PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts
1358 Speech-to-Face Conversion using Denoising Diffusion Probabilistic Models
2255 Inter-Connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation arXiv
1068 How Does Pretraining Improve Discourse-aware Translation? arXiv
1135 PATCorrect: Non-Autoregressive Phoneme-Augmented Transformer for ASR Error Correction arXiv
161 Model-assisted Lexical Tone Evaluation of Three-Year-Old Chinese-Speaking Children by also Considering Segment Production
1392 Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding GitHub arXiv
1582 Joint Time and Frequency Transformer for Chinese Opera Classification
116 AdaMS: Deep Metric Learning with Adaptive Margin and Adaptive Scale for Acoustic Word Discrimination arXiv
2252 Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective arXiv
2250 Combining Heterogeneous Structures for Event Causality Identification
1208 An Efficient Approach for the Automated Segmentation and Transcription of the People's Speech Corpus
1425 Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition
903 Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text GitHub arXiv
466 Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin GitHub arXiv
1878 Efficient Adaptation of Spoken Language Understanding based on End-to-End Automatic Speech Recognition
597 PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords GitHub
69 Mix before Align: Towards Zero-Shot Cross-Lingual Sentiment Analysis via Soft-Mix and Multi-View Learning
170 AlignAtt: using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation arXiv
2225 Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
1979 Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages GitHub arXiv

Speech, Voice, and Hearing Disorders

# Title Repo Paper
2421 Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test arXiv
2198 Multimodal Locally Enhanced Transformer for Continuous Sign Language Recognition
1759 Towards Supporting an Early Diagnosis of Multiple Sclerosis using Vocal Features
1891 Whisper Features for Dysarthric Severity-Level Classification
2191 A New Benchmark of Aphasia Speech Recognition and Detection based on E-Branchformer and Multi-task Learning GitHub GitHub arXiv
222 Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra
2026 A Stutter Seldom Comes Alone - Cross-Corpus Stuttering Detection as a Multi-label Problem arXiv
1542 Transfer Learning to Aid Dysarthria Severity Classification for Patients with Amyotrophic Lateral Sclerosis
2203 DuTa-VC: A Duration-aware Typical-to-Atypical Voice Conversion Approach with Diffusion Probabilistic Model arXiv
201 CNVVE: Dataset and Benchmark for Classifying Non-verbal Voice GitHub University of Southampton
1541 Arabic Dysarthric Speech Recognition using Adversarial and Signal-based Augmentation GitHub arXiv
1887 Weakly-Supervised Forced Alignment of Disfluent Speech using Phoneme-level Modeling GitHub arXiv
1998 Glottal Source Analysis of Voice Deficits in Basal Ganglia Dysfunction: Evidence from de novo Parkinson's Disease and Huntington's Disease
2478 An Analysis of Glottal Features of Chronic Kidney Disease Speech and its Application to CKD Detection
983 Weakly Supervised Glottis Segmentation in High-Speed Video Endoscopy using Bounding Box Labels
1669 Investigating the Dynamics of Hand and Lips in French Cued Speech using Attention Mechanisms and CTC-based Decoding arXiv
670 Hearing Loss Affects Emotion Perception in Older Adults: Evidence from a Prosody-Semantics Stroop Task
554 Cochlear-Implant Listeners Listening to Cochlear-Implant Simulated Speech
2168 Validation of a Task-Independent Cepstral Peak Prominence Measure with Voice Activity Detection
1679 Score-balanced Loss for Multi-aspect Pronunciation Assessment GitHub arXiv
2108 Federated Learning for Secure Development of AI Models for Parkinson's Disease Detection using Speech from Different Languages arXiv
652 F0inTFS: A Lightweight Periodicity Enhancement Strategy for Cochlear Implants
1678 Differentiating Acoustic and Physiological Features in Speech for Hypoxia Detection GitHub HAL Science
786 Mandarin Electrolaryngeal Speech Voice Conversion using Cross-Domain Features arXiv
866 Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion arXiv
1744 Which Aspects of Motor Speech Disorder are Captured by Mel Frequency Cepstral Coefficients? Evidence from the Change in STN-DBS Conditions in Parkinson's Disease
1096 Detecting Manifest Huntington's Disease using Vocal Data
1623 Exploring Multi-Task Learning and Data Augmentation in Dementia Detection with Self-Supervised Pre-trained Models

Spoken Term Detection and Voice Search

# Title Repo Paper
478 Matching Latent Encoding for Audio-Text based Keyword Spotting arXiv
1215 Self-Paced Pattern Augmentation for Spoken Term Detection in Zero-Resource
2362 On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation Amazon Science
90 Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics
689 Improving Small Footprint Few-Shot Keyword Spotting with Supervision on Auxiliary Data
2222 Robust Keyword Spotting for Noisy Environments by Leveraging Speech Enhancement and Speech Presence Probability

Models for Streaming ASR

# Title Repo Paper
831 Enhancing the Unified Streaming and Non-Streaming Model with Contrastive Learning arXiv
1497 ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs arXiv
361 Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation arXiv
1129 DCTX-Conformer: Dynamic Context Carry-over for Low Latency Unified Streaming and Non-Streaming Conformer arXiv
1121 Knowledge Distillation from Non-Streaming to Streaming ASR Encoder using Auxiliary Non-Streaming Layer
884 Adaptive Contextual Biasing for Transducer based Streaming Speech Recognition arXiv

Source Separation

# Title Repo Paper
1753 Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model GitHub arXiv
1389 Remixing-based Unsupervised Source Separation from Scratch
577 CAPTDURE: Captioned Sound Dataset of Single Sources arXiv
488 Recursive Sound Source Separation with Deep Learning-based Beamforming for Unknown Number of Sources
2537 Multi-Channel Speech Separation with Cross-Attention and Beamforming
185 Background-Sound Controllable Voice Source Separation

Speech Perception

# Title Repo Paper
1922 A Neural Architecture for Selective Attention to Speech Features
1122 Quantifying Informational Masking due to Masker Intelligibility in Same-Talker Speech-in-Speech Perception
1476 On the Benefits of Self-Supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions
2154 Predicting Perceptual Centers Located at Vowel Onset in German Speech using Long Short-Term Memory Networks
63 Exploring the Mutual Intelligibility Breakdown Caused by Sculpting Speech from a Competing Speech Signal
2103 Perception of Incomplete Voicing Neutralization of Obstruents in Tohoku Japanese

Phonetics and Phonology: Languages and Varieties

# Title Repo Paper
1879 The Emergence of Obstruent-Intrinsic f0 and VOT as Cues to the Fortis/Lenis Contrast in West Central Bavarian
431 〈'〉 in Tsimane': A Preliminary Investigation
2200 Segmental Features of Brazilian (Santa Catarina) Hunsrik
2337 Opening or Closing? An Electroglottographic Analysis of Voiceless Coda Consonants in Australian English
295 Increasing Aspiration of Word-Medial Fortis Plosives in Swiss Standard German
1456 Lexical Stress and Velar Palatalization in Italian: A Spatio-Temporal Interaction

Speaker and Language Identification

# Title Repo Paper
1989 Vietnam-Celeb: A Large-Scale Dataset for Vietnamese Speaker Recognition
2254 What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model GitHub Page arXiv
241 The 2022 NIST Language Recognition Evaluation arXiv
1725 ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention GitHub arXiv
402 Branch-ECAPA-TDNN: A Parallel Branch Architecture to Capture Local and Global Features for Speaker Verification
2052 Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech arXiv
2569 Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification
1407 A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-trained General Purpose Speech Model GitHub arXiv
2272 HABLA: A Dataset of Latin American Spanish Accents for Voice Anti-Spoofing
1702 Self-Supervised Learning Representation based Accent Recognition with Persistent Accent Memory
800 Extremely Low Bit Quantization for Mobile Speaker Verification Systems Under 1MB Memory
1974 Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis Distance arXiv
105 Pyannote.Audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark and Recipe GitHub Pdf
1524 Model Compression for DNN-based Speaker Verification using Weight Quantization arXiv
1354 Multi-Resolution Approach to Identification of Spoken Languages and to Improve Overall Language Diarization System using Whisper Model
125 Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms arXiv
849 Dynamic Fully-Connected Layer for Large-Scale Speaker Verification
1314 Mutual Information-based Embedding Decoupling for Generalizable Speaker Verification
1206 TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection arXiv
777 ECAPA++: Fine-grained Deep Embedding Learning for TDNN based Speaker Verification
100 Fooling Speaker Identification Systems with Adversarial Background Music
574 Target Active Speaker Detection with Audio-Visual Cues GitHub arXiv
2401 Improving End-to-End Neural Diarization using Conversational Summary Representations arXiv
2039 Phase Perturbation Improves Channel Robustness for Speech Spoofing Countermeasures GitHub arXiv
210 Improving Training Datasets for Resource-constrained Speaker Recognition Neural Networks
1498 Instance-based Temporal Normalization for Speaker Verification
881 On the Robustness of Wav2Vec 2.0 based Speaker Recognition Systems
697 P-Vectors: A Parallel-coupled TDNN/Transformer Network for Speaker Verification GitHub arXiv
844 Reversible Neural Networks for Memory-Efficient Speaker Verification
452 Robust Training for Speaker Verification against Noisy Labels GitHub arXiv
1404 Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization
1217 Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022 arXiv
1648 Describing the Phonetics in the Underlying Speech Attributes for Deep and Interpretable Speaker Recognition GitHub
1214 Range-based Equal Error Rate for Spoof Localization arXiv
1888 Exploring the English Accent-Independent Features for Speech Emotion Recognition using Filter and Wrapper-based Methods for Feature Selection
205 Powerset Multi-Class Cross Entropy Loss for Neural Speaker Diarization
394 A Method of Audio-Visual Person Verification by Mining Connections between Time Series
1249 Group GMM-ResNet for Detection of Synthetic Speech Attacks
605 One-Step Knowledge Distillation and Fine-Tuning in using Large Pre-trained Self-Supervised Learning Models for Speaker Verification GitHub arXiv
409 Defense Against Adversarial Attacks on Audio DeepFake Detection GitHub arXiv
1820 A Conformer-based Classifier for Variable-Length Utterance Processing in Anti-Spoofing
1557 Conformer-based Language Embedding with Self-Knowledge Distillation for Spoken Language Identification
2419 CommonAccent: Exploring Large Acoustic Pre-trained Models for Accent Classification based on Common Voice ResearchGate
266 From Adaptive Score Normalization to Adaptive Data Normalization for Speaker Verification Systems
1513 CAM++: A Fast and Efficient Network for Speaker Verification using Context-aware Masking GitHub arXiv
1928 North Sámi Dialect Identification with Self-Supervised Speech Models arXiv
2289 Encoder-Decoder Multimodal Speaker Change Detection arXiv
1603 Disentangled Representation Learning for Multilingual Speaker Recognition WEB Page arXiv
2310 A Compact End-to-End Model with Local and Global Context for Spoken Language Identification
1005 On the Robustness of Arabic Speech Dialect Identification arXiv
927 Adaptive Neural Network Quantization for Lightweight Speaker Verification
1205 Adversarial Diffusion Probability Model For Cross-Domain Speaker Verification Integrating Contrastive Loss
1554 Chinese Dialect Recognition based on Transfer Learning
270 Spoofing Attacker also Benefits from Self-Supervised Pretrained Model arXiv
854 Label aware Speech Representation Learning for Language Identification arXiv
1761 Exploring the Impact of Back-end Network on Wav2vec 2.0 for Dialect Identification
453 Improving Speaker Verification with Self-pretrained Transformer Models GitHub arXiv
155 Description and Analysis of the KPT system for NIST Language Recognition Evaluation 2022
372 Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-based, Alignment-Free and Hybrid Approaches arXiv

Speech Synthesis and Voice Conversion

# Title Repo Paper
2336 Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
160 Streaming Parrotron for On-Device Speech-to-Speech Conversion arXiv
2407 Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech
2518 E2E-S2S-VC: End-to-End Sequence-to-Sequence Voice Conversion
2403 DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer GitHub arXiv
419 Voice Conversion with Just Nearest Neighbors GitHub Page
1193 CFVC: Conditional Filtering for Controllable Voice Conversion
1157 DualVC: Dual-mode Voice Conversion using Intra-Model Knowledge Distillation and Hybrid Predictive Coding GitHub Page arXiv
39 Attention-based Interactive Disentangling Network for Instance-Level Emotional Voice Conversion
836 ALO-VC: Any-to-Any Low-Latency One-Shot Voice Conversion GitHub Page arXiv
1978 Evaluating and Reducing the Distance between Synthetic and Real Speech Distributions arXiv
2202 Decoupling Segmental and Prosodic cues of Non-Native Speech through Vector Quantization
2383 VC-T: Streaming Voice Conversion based on Neural Transducer
191 Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion Preserving Voice Conversion
1788 ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed GitHub arXiv
1356 Reverberation-Controllable Voice Conversion using Reverberation Time Estimator
2558 Cross-Utterance Conditioned Coherent Speech Editing

Speech and Language in Health: from Remote Monitoring to Medical Conversations

# Title Repo Paper
2287 An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients
1332 Personalization for Robust Voice Pathology Detection in Sound Waves GitHub
2249 Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders
1990 Capturing Mismatch between Textual and Acoustic Emotion Expressions for Mood Identification in Bipolar Disorder Pdf
296 FTA-Net: A Frequency and Time Attention Network for Speech Depression Detection
1709 Bayesian Networks for the Robust and Unbiased Prediction of Depression and its Symptoms Utilizing Speech and Multimodal Data Pdf
1263 Hyper-Parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition arXiv
1721 Classifying Depression Symptom Severity: Assessment of Speech Representations in Personalized and Generalized Machine Learning Models
1946 Active Learning for Abnormal Lung Sound Data Curation and Detection in Asthma
2079 Automatic Assessment of Alzheimer's across Three Languages using Speech and Language Features
301 On-the-Fly Feature based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition arXiv
1722 Relationship between LTAS-based Spectral Moments and Acoustic Parameters of Hypokinetic Dysarthria in Parkinson's Disease
963 Respiratory Distress Estimation in Human-Robot Interaction Scenario
1771 Prediction of the Gender-based Violence Victim Condition using Speech: What do Machine Learning Models rely on?
1916 Whisper Encoder features for Infant Cry Classification
1997 Classifying Dementia in the Presence of Depression: A Cross-Corpus Study
297 Exploiting Cross-Domain and Cross-Lingual Ultrasound Tongue Imaging Features for Elderly and Dysarthric Speech Recognition arXiv
464 Multi-Class Detection of Pathological Speech with Latent Features: How does It Perform on Unseen Data? arXiv
2002 Responsiveness, Sensitivity and Clinical Utility of Timing-Related Speech Biomarkers for Remote Monitoring of ALS Disease Progression Pdf
322 Use of Speech Impairment Severity for Dysarthric Speech Recognition arXiv
721 MMLung: Moving Closer to Practical Lung Health Estimation using Smartphones Pdf
913 Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization
2101 Non-Uniform Speaker Disentanglement for Depression Detection from Raw Speech Signals arXiv
753 PoCaPNet: A Novel Approach for Surgical Phase Recognition using Speech and X-Ray Images GitHub arXiv
2100 Combining Multiple Multimodal Speech Features into an Interpretable Index Score for Capturing Disease Progression in Amyotrophic Lateral Sclerosis Pdf
1438 The MASCFLICHT Corpus: Face Mask Type and Coverage Area Recognition from Speech
1435 Towards Reference Speech Characterization for Health Applications
2146 Automatic Classification of Hypokinetic and Hyperkinetic Dysarthria based on GMM-Supervectors
947 Towards Robust Paralinguistic Assessment for Real-World Mobile Health (mHealth) Monitoring: an Initial Study of Reverberation Effects on Speech arXiv

Novel Transformer Models for ASR

# Title Repo Paper
2228 Conmer: Streaming Conformer without Self-Attention for Interactive Voice Assistants Amazon Science
1255 Intra-Ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition
1194 A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks GitHub GitHub arXiv
1611 HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition arXiv
893 Memory-Augmented Conformer for Improved End-To-End Long-form ASR
552 Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems arXiv

Speaker Recognition

# Title Repo Paper
1294 An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification GitHub arXiv
1286 A Study on Visualization of Voiceprint Feature
1083 VoxTube: A Multilingual Speaker Recognition Dataset
1298 Visualizing Data Augmentation in Deep Speaker Recognition arXiv
1565 Ordered and Binary Speaker Embedding arXiv
2031 Self-FiLM: Conditioning GANs with Self-Supervised Representations for Bandwidth Extension based Speaker Recognition arXiv
1202 Curriculum Learning for Self-Supervised Speaker Verification arXiv
1558 Introducing Self-Supervised Phonetic Information for Text-Independent Speaker Verification
1379 A Teacher-Student Approach for Extracting Informative Speaker Embeddings from Speech Mixtures arXiv
1479 Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification arXiv

Cross-lingual and Multilingual ASR

# Title Repo Paper
1630 Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition
1338 UniSplice: Universal Cross-Lingual Data Splicing for Low-Resource ASR
772 Allophant: Cross-Lingual Phoneme Recognition with Articulatory Attributes GitHub GitHub arXiv
97 Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR System arXiv
1061 Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-training for Adaptation to Unseen Languages arXiv
1444 DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model GitHub arXiv

Voice Conversion

# Title Repo Paper
251 Emotional Voice Conversion with Semi-Supervised Generative Modeling
817 Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation
215 S2CD-VC: Self-Heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion
1508 Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-Shot Voice Conversion
1602 Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation GitHub Page arXiv
2298 End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions GitHub arXiv

Pathological Speech Analysis

# Title Repo Paper
2093 Multimodal Assessment of Bulbar Amyotrophic Lateral Sclerosis (ALS) using a Novel Remote Speech Assessment App
2181 On the use of High Frequency Information for Voice Pathology Classification
1784 Do Phonatory Features Display Robustness to Characterize Parkinsonian Speech Across Corpora?
2531 Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features
1915 Comparison of Acoustic Measures of Dysphonia in Parkinson's Disease and Huntington's Disease: Effect of Sex and Speaking Task
1734 Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses GitHub arXiv
1574 A Pipeline to Evaluate the Effects of Noise on Machine Learning Detection of Laryngeal Cancer GitHub
2474 ReCLR: Reference-Enhanced Contrastive Learning of Audio Representation for Depression Detection
234 Automated Multiple Sclerosis Screening based on Encoded Speech Representations
1934 Cross-Lingual Features for Alzheimer's Dementia Detection from Speech
1653 Careful Whisper - Leveraging Advances in Automatic Speech Recognition for Robust and Interpretable Aphasia Subtype Classification
1868 Behavioral Analysis of Pathological Speaker Embeddings of Patients During Oncological Treatment of Oral Cancer

Multimodal Speech Emotion Recognition

# Title Repo Paper
1832 LanSER: Language-Model Supported Speech Emotion Recognition
463 Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition
1591 Emotion Label Encoding using Word Embeddings for Speech Emotion Recognition
2444 Discrimination of the Different Intents Carried by the Same Text through Integrating Multimodal Information
510 Meta-Domain Adversarial Contrastive Learning for Alleviating Individual Bias in Self-Sentiment Predictions
413 SWRR: Feature Map Classifier Based on Sliding Window Attention and High-Response Feature Reuse for Multimodal Emotion Recognition

Phonetics, Phonology, and Prosody

# Title Repo Paper
1443 Effects of Meter, Genre and Experience on Pausing, Lengthening and Prosodic Phrasing in German Poetry Reading
1142 Comparing First Spectral Moment of Australian English /s/ between Straight and Gay Voices using Three Analysis Window Sizes
2584 Universal Automatic Phonetic Transcription into the International Phonetic Alphabet GitHub
2134 Voice Twins: Discovering Extremely Similar-Sounding, Unrelated Speakers
1042 Filling the Population Statistics Gap: Swiss German Reference Data on F0 and Speech Tempo for Forensic Contexts
1619 Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse Markers
2214 Evaluation of a Forensic Automatic Speaker Recognition System with Emotional Speech Recordings
1052 An Outlier Analysis of Vowel Formants from a Corpus Phonetics Pipeline GitHub Pdf
340 The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features
1880 Beatboxing Kick Drum Kinematics
536 Effects of Hearing Loss and Amplification on Mandarin Consonant Perception
2020 An Acoustic Analysis of Fricative Variation in Three Accents of English
109 Acoustic Cues to Stress Perception in Spanish – a Mismatch Negativity Study
976 Bulgarian Unstressed Vowel Reduction: Received Views vs Corpus Findings
1764 An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations arXiv
498 Identifying Stable Sections for Formant Frequency Extraction of French Nasal Vowels based on Difference Thresholds
1903 Evaluation of Delexicalization Methods for Research on Emotional Speech
1772 Nonbinary American English Speakers Encode Gender in Vowel Acoustics
44 Coarticulation of Sibe Vowels and Dorsal Fricatives in Spontaneous Speech: An Acoustic Study
1013 Using Speech Synthesis to Explain Automatic Speaker Recognition: A New Application of Synthetic Speech
2534 Same F0, Different Tones: A Multidimensional Investigation of Zhangzhou Tones
1985 Discovering Phonetic Feature Event Patterns in Transformer Embeddings
2204 A System for Generating Voice Source Signals that Implements the Transformed LF-Model Parameter Control
2352 Speaker-Independent Speech Inversion for Estimation of Nasalance arXiv
1359 Effects of Tonal Coarticulation and Prosodic Positions on Tonal Contours of Low Rising Tones: In the Case of Xiamen Dialect arXiv
2187 Durational and Non-Durational Correlates of Lexical and Derived Geminates in Arabic
68 Mapping Phonemes to Acoustic Symbols and Codes using Synchrony in Speech Modulation Vectors Estimated by the Travellingwave Filter Bank
1480 Rhythmic Characteristics of L2 German Speech by Advanced Chinese Learners
1538 (Dis)agreement and Preference Structure are Reflected in Matching Along Distinct Acoustic-Prosodic Features
995 Vowel Reduction by Greek-Speaking Children: The Effect of Stress and Word Length
1822 Pitch Distributions in a Very Large Corpus of Spontaneous Finnish Speech
828 Speech Enhancement Patterns in Human-Robot Interaction: A Cross-Linguistic Perspective

Speech Coding: Privacy

# Title Repo Paper
1026 Masking Kernel for Learning Energy-Efficient Representations for Speaker Recognition and Mobile Health arXiv
727 eSTImate: A Real-Time Speech Transmission Index Estimator with Speech Enhancement Auxiliary Task using Self-Attention Feature Pyramid Network
815 Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement arXiv
2138 Privacy-Preserving Representation Learning for Speech Understanding
448 Vocoder Drift in X-Vector–based Speaker Anonymization GitHub arXiv
703 Malafide: A Novel Adversarial Convolutive Noise Attack Against Deepfake and Spoofing Detection Systems arXiv

Analysis of Neural Speech Representations

# Title Repo Paper
1087 Speech Self-Supervised Representation Bench-Marking: Are We Doing it Right? GitHub Page arXiv
383 An Extension of Disentanglement Metrics and its Application to Voice
2131 An Information-Theoretic Analysis of Self-Supervised Discrete Representations of Speech GitHub arXiv
1823 SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? GitHub arXiv
1418 Comparison of GIF- and SSL-based Features in Pathological Voice Detection
1617 What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions

End-to-end ASR

# Title Repo Paper
1640 End-to-End Joint Target and Non-Target Speakers ASR arXiv
144 Improving Frame-Level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition arXiv
564 Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-Level Timestamp Prediction
101 Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition GitHub arXiv
906 Text-Only Domain Adaptation for End-to-End ASR using Integrated Text-to-Mel-Spectrogram Generator arXiv
142 Multi-Pass Training and Cross-Information Fusion for Low-Resource End-to-End Accented Speech Recognition arXiv

Spoken Language Understanding, Summarization, and Information Retrieval

# Title Repo Paper
461 Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling
277 Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models
1307 Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization arXiv
1136 Audio Retrieval with WavText5K and CLAP Training GitHub arXiv
242 Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding GitHub arXiv
1652 Contrastive Disentangled Learning for Memory-Augmented Transformer

Invariant and Robust Pre-trained Acoustic Models

# Title Repo Paper
438 ProsAudit, a Prosodic Benchmark for Self-Supervised Speech Models arXiv
871 Self-Supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces arXiv
1862 Evaluating Context-Invariance in Unsupervised Speech Representations GitHub arXiv
1390 CoBERT: Self-Supervised Speech Representation Learning through Code Representation Learning GitHub arXiv
847 Self-Supervised Fine-tuning for Improved Content Representations by Speaker-Invariant Clustering GitHub arXiv
359 Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Speech Synthesis: Representation Learning

# Title Repo Paper
1571 Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
2313 Adapter-based Extension of Multi-Speaker Text-To-Speech Model for New Speakers arXiv
2574 SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
2326 UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data GitHub Page
677 LightVoc: an Upsampling-Free GAN Vocoder based on Conformer and Inverse Short-time Fourier Transform
1095 ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings GitHub Page arXiv

Speech Perception, Production, and Acquisition

# Title Repo Paper
1330 Human Transcription Quality Improvement Amazon Science
1604 The Effect of Masking Noise on Listeners' Spectral Tilt Preferences
1967 The Effect of Whistled Vowels on Whistled Word Categorization for Naive Listeners
1481 Automatic Deep Neural Network-based Segmental Pronunciation Error Detection of L2 English Speech (L1 Bengali)
1662 The Effect of Stress on Mandarin Tonal Perception in Continuous Speech for Spanish-Speaking Learners
1918 Combining Acoustic and Aerodynamic Data Collection: A Perceptual Evaluation of Acoustic Distortions
953 Estimating Virtual Targets for Lingual Stop Consonants using General Tau Theory
1931 Using Random Forests to Classify Language as a Function of Syllable Timing in Two Groups: Children with Cochlear Implants and with Normal Hearing
2256 An Improved End-to-End Audio-Visual Speech Recognition Model
1954 What Influences the Foreign Accent Strength? Phonological and Grammatical Errors in the Perception of Accentedness
2077 Investigating the Perception Production Link through Perceptual Adaptation and Phonetic Convergence
1385 Emotion Prompting for Speech Emotion Recognition
1196 Speech-in-Speech Recognition is Modulated by Familiarity to Dialect
673 BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-Talker Conditions GitHub arXiv
2046 Are Retroflex-to-Dental Sibilant Substitutions in Polish Children's Speech an Example of a Covert Contrast? A Preliminary Acoustic Study
1123 First Language Effects on Second Language Perception: Evidence from English Low-Vowel Nasal Sequences Perceived by L1 Mandarin Chinese Listeners
2247 Motor Control Similarity between Speakers Saying "a Souk" using Inverse Atlas Tongue Modeling
910 Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language Models arXiv
317 A Relationship between Vocal Fold Vibration and Droplet Production
803 Audio, Visual and Audiovisual Intelligibility of Vowels Produced in Noise
593 Computational Modeling of Auditory Brainstem Responses Derived from Modified Speech
1732 Leveraging Label Information for Multimodal Emotion Recognition
1465 Improving End-to-End Modeling for Mandarin-English Code-Switching using Lightweight Switch-Routing Mixture-of-Experts
1803 Frequency Patterns of Individual Speaker Characteristics at Higher and Lower Spectral Ranges
1818 Adaptation to Predictive Prosodic cues in Non-Native Standard Dialect
1007 Head Movements in Two- and Four-Person Inter-Active Conversational Tasks in Noisy and Moderately Reverberant Conditions
334 Second Language Identification of Vietnamese Tones by Native Mandarin Learners
203 Nasal Vowel Production and Grammatical Processing in French-Speaking Children with Cochlear Implants and Normal-Hearing Peers
412 Emotion Classification with EEG Responses Evoked by Emotional Prosody of Speech
145 L2-Mandarin Regional Accent Variability During Mandarin Tone-Word Training Facilitates English listeners' Subsequent tone Categorizations
1680 HumanDiffusion: Diffusion Model using Perceptual Gradients arXiv
2087 Queer Events, Relationships, and Sports: Does Topic Influence Speakers' Acoustic Expression of Sexual Orientation?
172 Optimal Control of Speech with Context-Dependent Articulatory Targets

Acoustic Model Adaptation for ASR

# Title Repo Paper
583 Factorised Speaker-Environment Adaptive Training of Conformer Speech Recognition Systems arXiv
1349 Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition GitHub arXiv
327 Towards Cross-Lingual Cross-Age Adaptation for Low-Resource Elderly Speech Emotion Recognition GitHub arXiv
2215 Modular Domain Adaptation for Conformer-based Streaming ASR arXiv
2192 Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters arXiv
1282 SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization GitHub arXiv

Speech Synthesis: Expressivity

# Title Repo Paper
858 Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions GitHub
2242 Dual Audio Encoders based Mandarin Prosodic Boundary Prediction by using Multi-Granularity Prosodic Representations
645 NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS WEB Page arXiv
782 MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy GitHub Page arXiv
2469 Narrator or Character: Voice Modulation in an Expressive Multi-Speaker TTS
843 CASEIN: Cascading Explicit and Implicit Control for Fine-grained Emotion Intensity Regulation arXiv
1405 Semi-Supervised Learning for Continuous Emotional Intensity Controllable Speech Synthesis with Disentangled Representations arXiv
1905 Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
1460 ComedicSpeech: Adaptive Text to Speech For Stand-up Comedy in Low-Resource Scenario GitHub Page arXiv
1552 Neural Speech Synthesis with Enriched Phrase Boundaries
437 Cross-Lingual Prosody Transfer for Expressive Machine Dubbing arXiv
2178 Synthesis after a couple PINTs: Investigating the Role of Pause-Internal Phonetic Particles in Speech Synthesis and Perception GitHub
433 Accentor: An Explicit Lexical Stress Model for TTS Systems Pdf
1032 A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers
715 Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model GitHub Page arXiv
289 Prosody Modeling with 3D Visual Information for Expressive Video Dubbing
1528 LightClone: Speaker-Guided Parallel Subnet Selection for Few-Shot Voice Cloning
1671 EE-TTS: Emphatic Expressive TTS with Linguistic Information GitHub Page arXiv
1673 Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS arXiv
122 ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading GitHub Page arXiv
1779 PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions GitHub Page arXiv
1639 Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models GitHub Page arXiv
2453 A Generative Framework for Conversational Laughter: Its "Language Model" and Laughter Sound Synthesis arXiv
1754 Towards Spontaneous Style Modeling with Semi-Supervised Pre-training for Conversational Text-to-Speech Synthesis GitHub Page
2072 Beyond Style: Synthesizing Speech with Pragmatic Functions WEB Page
965 eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer arXiv

Multi-modal Systems

# Title Repo Paper
1146 BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion GitHub Page arXiv
370 Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech based on Metric Learning arXiv
989 Whistle-to-Text: Automatic Recognition of the Silbo Gomero Whistled Language
663 A Novel Interpretable and Generalizable Re-Synchronization Model for Cued Speech based on a Multi-Cuer Corpus arXiv
668 Visually Grounded Few-Shot Word Acquisition with Fewer Shots arXiv
183 JAMFN: Joint Attention Multi-Scale Fusion Network for Depression Detection

Question Answering from Speech

# Title Repo Paper
1485 Prompt Guided Copy Mechanism for Conversational Question Answering
1240 Composing Spoken Hints for Follow-on Question Suggestion in Voice Assistants
1391 On Monotonic Aggregation for Open-Domain QA GitHub
2240 Question-Context Alignment and Answer-Context Dependencies for Effective Answer Sentence Selection arXiv
1606 Multi-Scale Attention for Audio Question Answering arXiv
539 Enhancing Visual Question Answering via Deconstructing Questions and Explicating Answers

Multi-talker Methods in Speech Processing

# Title Repo Paper
1749 SEF-Net: Speaker Embedding Free Target Spekaer Extraction Network
1530 Overlap aware Continuous Speech Separation without Permutation Invariant Training Linfeng
1952 Cascaded Encoders for Fine-Tuning ASR Models on Overlapped Speech arXiv
2069 TokenSplit: using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
1422 Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator arXiv
2098 Time-Domain Transformer-based Audiovisual Speaker Separation
628 Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization arXiv
1502 Unsupervised Adaptation with Quality-aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization
1521 BA-SOT: Boundary-aware Serialized Output Training for Multi-Talker ASR arXiv
1172 Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation
975 Joint Compensation of Multi-Talker Noise and Reverberation for Speech Enhancement with Cochlear Implants using One or More Microphones
494 Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach
42 GPU-accelerated Guided Source Separation for Meeting Transcription GitHub arXiv
1280 Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition GitHub Page arXiv
2076 Directional Speech Recognition for Speaker Disambiguation and Cross-talk Suppression
1815 Mixture Encoder for Joint Speech Separation and Recognition arXiv


# Title Repo Paper
206 Aberystwyth English Pre-Aspiration in Apparent Time
1154 Speech Entrainment in Chinese Story-Style Talk Shows: The Interaction Between Gender and Role
1414 Sociodemographic and Attitudinal Effects on Dialect Speakers' Articulation of the Standard Language: Evidence from German-Speaking Switzerland
1704 Vowel Normalisation in Latent Space for Sociolinguistics

Speaker and Language Diarization

# Title Repo Paper
1228 Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor arXiv
1447 Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism
2367 The DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments GitHub Page arXiv
1982 Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction arXiv
1839 The SpeeD-ZevoTech Submission at DISPLACE 2023
656 End-to-End Neural Speaker Diarization with Absolute Speaker Loss

Anti-Spoofing for Speaker Verification

# Title Repo Paper
1402 Towards Single Integrated Spoofing-aware Speaker Verification Embeddings GitHub arXiv
1352 Pseudo-Siamese Network based Timbre-Reserved Black-Box Adversarial Attack in Speaker Identification arXiv
2335 Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion GitHub arXiv
1166 Robust Audio Anti-Spoofing Countermeasure with Joint Training of Front-end and Back-end and Models
1537 Improved DeepFake Detection using Whisper Features GitHub arXiv
371 DoubleDeceiver: Deceiving the Speaker Verification System Protected by Spoofing Countermeasures

Speech Coding: Intelligibility

# Title Repo Paper
2209 On Training a Neural Residual Acoustic echo Suppressor for Improved ASR
1429 Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation GitHub Page arXiv
378 UnSE: Unsupervised Speech Enhancement using Optimal Transport
1130 MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation GitHub Page arXiv
2177 Causal Signal-based DCCRN with Overlapped-Frame Prediction for Online Speech Enhancement
1511 Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction arXiv

New Computational Strategies for ASR Training and Inference

# Title Repo Paper
2183 A Metric-Driven Approach to Conformer Layer Pruning for Efficient ASR Inference
1981 Distillation Strategies for Discriminative Speech Recognition Rescoring arXiv
969 Another Point of View on Visual Speech Recognition
1062 RASR2: The RWTH ASR Toolkit for Generic Sequence-to-Sequence Speech Recognition arXiv
486 Streaming Speech-to-Confusion Network Speech Recognition arXiv
809 Accurate and Structured Pruning for Efficient Automatic Speech Recognition arXiv

MERLIon CCS Challenge: Multilingual Everyday Recordings - Language Identification On Code-Switched Child-Directed Speech

# Title Repo Paper
1446 MERLIon CCS Challenge: A English-Mandarin Code-Switching Child-directed Speech Corpus for Language Identification and Diarization GitHub arXiv
1335 Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech GitHub arXiv
1707 Investigating Model Performance in Language Identification: beyond Simple Error Statistics arXiv
2533 Improving Wav2vec2-based Spoken Language Identification by Learning Phonological Features
2047 Language Identification Networks for Multilingual Everyday Recordings

Health-Related Speech Analysis

# Title Repo Paper
2038 Classification of Vocal Intensity Category from Speech using the Wav2vec2 and Whisper Embeddings
1668 The Effect of Clinical Intervention on the Speech of Individuals with PTSD: Features and Recognition Performances
470 Analysis and Automatic Prediction of Exertion from Speech: Contrasting Objective and Subjective Measures Collected while Running
894 The Androids Corpus: A New Publicly Available Benchmark for Speech based Depression Detection
658 Comparing Hand-Crafted Features to Spectrograms for Autism Severity Estimation
839 Acoustic Characteristics of Depression in Older Adults' Speech: the Role of Covariates

Automatic Audio Classification and Audio Captioning

# Title Repo Paper
943 Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning arXiv
1564 Adapting a ConvNeXt Model to Audio Classification on AudioSet GitHub arXiv
1610 Few-Shot Class-Incremental Audio Classification using Stochastic Classifier GitHub arXiv
1614 Enhance Temporal Relations in Audio Captioning with Sound Event Detection arXiv

Speech Synthesis

# Title Repo Paper
407 Epoch-based Spectrum Estimation for Speech
1996 OverFlow: Putting Flows on Top of Neural Transducers for Better TTS GitHub Page
1568 AdapterMix: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation GitHub arXiv
506 Prior-Free Guided TTS: An Improved and Efficient Diffusion-based Text-Guided Speech Synthesis
367 UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion Model arXiv
1301 Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
1151 Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge GitHub Page arXiv
879 Towards Robust FastSpeech 2 by Modelling Residual Multimodality GitHub Page arXiv
1137 Real Time Spectrogram Inversion on Mobile Phone GitHub Page arXiv
58 Automatic Tuning of Loss Trade-offs without Hyper-Parameter Search in End-to-End Zero-Shot Speech Synthesis GitHub Page
2056 A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic
2173 Self-Supervised Solution to the Control Problem of Articulatory Synthesis
1128 Hierarchical Timbre-Cadence Speaker Encoder for Zero-Shot Speech Synthesis GitHub Page
754 ZET-Speech: Zero-Shot adaptive Emotion-Controllable Text-to-Speech Synthesis with Diffusion and Style-based Models GitHub Page arXiv
690 Improving WaveRNN with Heuristic Dynamic Blending for Fast and High-Quality GPU Vocoding
194 Intelligible Lip-to-Speech Synthesis with Speech Units arXiv
1212 Parameter-Efficient Learning for Text-to-Speech Accent Adaptation GitHub Page
820 Controlling Formant Frequencies with Neural Text-to-Speech for the Manipulation of Perceived Speaker Age
2379 FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder with Multiple STFTs GitHub Page arXiv
1726 iSTFTNet2: Faster and more Lightweight iSTFT-based Neural Vocoder using 1D-2D CNN
534 VITS2: Improving Quality and Efficiency of Single Stage Text to Speech with Adversarial Learning and Architecture Design
1175 Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme

Speech Synthesis: Controllability and Adaptation

# Title Repo Paper
1608 HierVST: Hierarchical Adaptive Zero-Shot Voice Style Transfer GitHub Page
391 VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer GitHub Page
700 EdenTTS: A Simple and Efficient Parallel Text-to-Speech Architecture with Collaborative Duration-Alignment Learning
368 Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
1020 Speech Inpainting: Context-based Speech Synthesis Guided by Video GitHub Page arXiv
2243 STEN-TTS: Improving Zero-Shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework

Search Methods and Decoding Algorithms for ASR

# Title Repo Paper
933 Average Token Delay: A Latency Metric for Simultaneous Translation arXiv
1450 Automatic Speech Recognition Transformer with Global Contextual Information Decoder
1333 Time-Synchronous One-Pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training
2065 Prefix Search Decoding for RNN Transducers
78 WhisperX: Time-Accurate Speech Transcription of Long-Form Audio GitHub arXiv
2449 Implementing Contextual Biasing in GPU Decoder for Online ASR GitHub arXiv

Speech Signal Analysis

# Title Repo Paper
2487 MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-Level Feature Fusion arXiv
2211 Enhancing Speech Articulation Analysis using A Geometric Transformation of the X-ray Microbeam Dataset arXiv
1729 Matching Acoustic and Perceptual Measures of Phonation Assessment in Disordered Speech - A Case Study
283 Improved Contextualized Speech Representations for Tonal Analysis
1738 A Study on the Importance of Formant Transitions for Stop-Consonant Classification in VCV Sequence idiap
2229 FusedF0: Improving DNN-based F0 Estimation by Fusion of Summary-Correlograms and Raw Waveform Representations of Speech Signals Pdf

Connecting Speech-science and Speech-technology for Children's Speech

# Title Repo Paper
928 Using Commercial ASR Solutions to Assess Reading Skills in Children: A Case Report
907 Uncertainty Estimation for Connectionist Temporal Classification based Automatic Speech Recognition Pdf
2185 Speech Breathing Behavior During Pauses in Children
926 Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech Pdf
1924 Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders arXiv
978 BabySLM: Language-Acquisition-Friendly Benchmark of Self-Supervised Spoken Language Models GitHub arXiv
702 Data Augmentation for Children ASR and Child-adult Speaker Classification using Voice Conversion Methods
2236 Developmental Articulatory and Acoustic Features for Six to Ten Year Old Children
2251 Automatically Predicting Perceived Conversation Quality in a Pediatric Sample Enriched for Autism
1257 An Equitable Framework for Automatically Assessing Children's Oral Narrative Language Abilities
743 An Analysis of Goodness of Pronunciation for Child Speech
1569 Measuring Language Development from Child-centered Recordings
2057 Speaking Clearly, Understanding Better: Predicting the L2 Narrative Comprehension of Chinese Bilingual Kindergarten Children based on Speech Intelligibility using a Machine Learning Approach
312 Classifying Rhoticity of /r/ in Speech Sound Disorder using Age-and-Sex Normalized Formants arXiv
1273 Understanding Spoken Language Development of Children with ASD using Pre-trained Speech Embeddings arXiv
2099 Measuring Phonological Precision in Children with Cleft Lip and Palate
937 A Study on using Duration and Formant Features in Automatic Detection of Speech Sound Disorder in Children
1873 Influence of Utterance and Speaker Characteristics on the Classification of Children with Cleft Lip and Palate
1882 Prospective Validation of Motor-based Intervention with Automated Mispronunciation Detection of Rhotics in Residual Speech Sound Disorders arXiv

Dialog Management

# Title Repo Paper
2238 Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning arXiv
2525 An Autoregressive Conversational Dynamics Model for Dialogue Systems
1983 Style-Transfer based Speech and Audio-Visual Scene Understanding for Robot Action Sequence Acquisition from Videos arXiv
1037 Speech aware Dialog System Technology Challenge (DSTC11) WEB Page arXiv
1397 Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision GitHub arXiv
2513 Tracking Must Go On: Dialogue State Tracking with Verified Self-Training

Speech Activity Detection and Modeling

# Title Repo Paper
558 GL-SSD: Global and Local Speech Style Disentanglement by Vector Quantization for Robust Sentence Boundary Detection in Speech Stream
598 Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction arXiv
2466 Dynamic Encoder RNN for Online Voice Activity Detection in Adverse Noise Conditions
996 Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets arXiv
716 Real-Time Causal Spectro-Temporal Voice Activity Detection based on Convolutional Encoding and Residual Decoding
2413 SVVAD: Personal Voice Activity Detection for Speaker Verification arXiv

Multilingual Models for ASR

# Title Repo Paper
1613 Learning Cross-Lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition arXiv
2122 AfriNames: Most ASR models "butcher" African Names arXiv
2528 Towards Dialect-Inclusive Recognition in a Low-Resource Language: are Balanced Corpora the Answer?
2588 Svarah: Evaluating English ASR Systems on Indian Accents GitHub arXiv
1044 N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition arXiv
1014 The MALACH Corpus: Results with End-to-End Architectures and Pretraining

Speech Enhancement and Bandwidth Expansion

# Title Repo Paper
232 Unsupervised Speech Enhancement with Deep Dynamical Generative Speech and Noise Models arXiv
857 Noise-Robust Bandwidth Expansion for 8K Speech Recordings
113 mdctGAN: Taming Transformer-based GAN for Speech Super-Resolution with Modified DCT Spectra GitHub arXiv
625 Zoneformer: On-Device Neural Beamformer for In-Car Multi-Zone Speech Separation, Enhancement and echo Cancellation
634 Low-Complexity Broadband Beampattern Synthesis using Array Response Control
904 A GAN Speech Inpainting Model for Audio Editing Software


# Title Repo Paper
2316 Deep Speech Synthesis from MRI-based Articulatory Representations GitHub arXiv
562 Learning to Compute the Articulatory Representations of Speech with the MIRRORNET GitHub Page
804 Generating High-Resolution 3D Real-Time MRI of the Vocal Tract
1593 Exploring a Classification Approach using Quantised Articulatory Movements for Acoustic to Articulatory Inversion

Neural Processing of Speech and Language: Encoding and Decoding the Diverse Auditory Brain

# Title Repo Paper
633 Coherence Estimation Tracks Auditory Attention in Listeners with Hearing Impairment
2378 Enhancing the EEG Speech Match Mismatch Tasks with Word Boundaries GitHub arXiv
1347 Similar Hierarchical Representation of Speech and Other Complex Sounds in the Brain and Deep Residual Networks: an MEG Study
121 Speech Taskonomy: Which Speech Tasks are the most Predictive of fMRI Brain Activity? HAL Science
282 MEG Encoding using Word Context Semantics in Listening Stories HAL Science
1949 Investigating the Cortical Tracking of Speech and Music with Sung Speech
414 Exploring Auditory Attention Decoding using Speaker Features
1776 Effects of Spectral Degradation on the Cortical Tracking of the Speech Envelope
964 Effects of Spectral and Temporal Modulation Degradation on Intelligibility and Cortical Tracking of Speech Signals

Perception of Paralinguistics

# Title Repo Paper
2061 Transfer Learning for Personality Perception via Speech Emotion Recognition arXiv
1131 A Stimulus-Organism-Response Model of Willingness to Buy from Advertising Speech using Voice Quality
1835 Voice Passing: A Non-Binary Voice Gender Prediction System for evaluating Transgender
1139 Influence of Personal Traits on Impressions of One's Own Voice
887 Pardon my Disfluency: The Impact of Disfluency Effects on the Perception of Speaker Competence and Confidence
711 Cross-Linguistic Emotion Perception in Human and TTS Voices WEB Page

Technologies for Child Speech Processing

# Title Repo Paper
1302 Joint Learning Feature and Model Adaptation for Unsupervised Acoustic Modelling of Child Speech
1681 Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics GitHub
2084 An ASR-enabled Reading Tutor: Investigating Feedback to Optimize Interaction for Learning to Read Pdf
935 Adaptation of Whisper Models to Child Speech Recognition

Speech Synthesis: Multilinguality; Evaluation

# Title Repo Paper
2064 Automatic Evaluation of Turn-Taking Cues in Conversational Speech Synthesis GitHub Page arXiv
441 Expressive Machine Dubbing through Phrase-Level Cross-Lingual Prosody Transfer arXiv
1691 Robust Feature Decoupling in Voice Conversion by using Locality-based Instance Normalization GitHub
612 Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network
2148 The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech GitHub Page arXiv
1727 GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech GitHub Page arXiv
1285 Analysis of Mean Opinion Scores in Subjective Evaluation of Synthetic Speech based on Tail Probabilities GitHub
1584 LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus GitHub Page
1067 UniFLG: Unified Facial Landmark Generator from Text or Speech GitHub Page arXiv
444 XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech GitHub arXiv
2224 ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus ClArTTS arXiv
154 Diffusion-based Accent Modelling in Speech Synthesis
249 Multilingual Text-to-Speech Synthesis for Turkic Languages using Transliteration GitHub arXiv
553 CVTE-Poly: A New Benchmark for Chinese Polyphone Disambiguation
709 Improve Bilingual TTS using Language and Phonology Embedding with Embedding Strength Modulator GitHub Page arXiv
2179 High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units GitHub Page
1097 PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text
2158 Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages arXiv
416 Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously arXiv
1622 Speaker-Independent Neural Formant Synthesis GitHub Page arXiv
1098 CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center GitHub Page arXiv
430 SASPEECH: A Hebrew Single Speaker Dataset for Text to Speech and Voice Conversion GitHub Page

Show and Tell: Health Applications and Emotion Recognition

# Title Repo Paper
2618 A Personalised Speech Communication Application for Dysarthric Speakers
2624 Video Multimodal Emotion Recognition System for Real World Applications
2626 Promoting Mental Self-Disclosure in a Spoken Dialogue System
2632 "Select Language, Modality or Put on a Mask!" Experiments with Multimodal Emotion Recognition
2635 My Vowels Matter: Formant Automation Tools for Diverse Child Speech
2636 NEMA: An Ecologically Valid Tool for Assessing Hearing Devices, Advanced Algorithms, and Communication in Diverse Listening Environments
2644 When Words Speak Just as Loudly as Actions: Virtual Agent based Remote Health Assessment Integrating What Patients Say with What They Do Pdf
2648 Stuttering Detection Application
2649 Providing Interpretable Insights for Neurological Speech and Cognitive Disorders from Interactive Serious Games
2651 Automated Neural Nursing Assistant (ANNA): An Over-the-Phone System for Cognitive Monitoring
2656 5G-IoT Cloud based Demonstration of Real-Time Audio-Visual Speech Enhancement for Multimodal Hearing-aids
2671 Towards Two-Point Neuron-Inspired Energy-Efficient Multimodal Open Master Hearing aid

Show and Tell: Speech Tools, Speech Enhancement, Speech Synthesis

# Title Repo Paper
2614 DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement GitHub arXiv
2615 Nkululeko: Machine Learning Experiments on Speaker Characteristics without Programming
2625 Sp1NY: A Quick and Flexible Python Speech Visualization Tool
2629 Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
2634 So-to-Speak: an Exploratory Platform for Investigating the Interplay between Style and Prosody in TTS
2638 Comparing /b/ and /d/ with a Single Physical Model of the Human Vocal Tract to Visualize Droplets Produced while Speaking
2640 Show & Tell: Voice Activity Projection and Turn-taking
2652 Real-Time Detection of Soft Voice for Speech Enhancement
2655 Data Augmentation for Diverse Voice Conversion in Noisy Environments arXiv
2667 Application for Real-Time Audio-Visual Speech Enhancement

Show and Tell: Language Learning and Educational Resources

# Title Repo Paper
2623 A Unified Framework to Improve Learners' Skills of Perception and Production based on Speech Shadowing and Overlapping
2633 Speak & Improve: L2 English Speaking Practice Tool
2641 Measuring Prosody in Child Speech using SoapBox Fluency API
2650 Teaching Non-native Sound Contrasts using Visual Biofeedback
2654 Large-Scale Automatic Audiobook Creation
2658 QVoice: Arabic Speech Pronunciation Learning Application arXiv
2659 Asking Questions: an Innovative Way to Interact with Oral History Archives
2660 DisfluencyFixer: A Tool to Enhance Language Learning through Speech to Speech Disfluency Correction arXiv
2661 Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages arXiv
2668 MyVoice: Arabic Speech Resource Collaboration Platform
2669 Personal Primer Prototype 1: Invitation to Make Your Own Embooked Speech-based Educational Artifact ResearchGate

Show and Tell: Media and Commercial Applications

# Title Repo Paper
2621 Let's Give a Voice to Conversational Agents in Virtual Reality GitHub
2622 FOOCTTS: Generating Arabic Speech with Acoustic Environment for Football Commentator arXiv
2637 Video Summarization Leveraging Multimodal Information for Presentations
2645 What Questions are My Customers Asking?: Towards Actionable Insights from Customer Questions in Contact Center Calls
2646 COnVoy: A Contact Center Operated Pipeline for Voice of Customer Discovery
2653 NeMo Forced Aligner and its Application to Word Alignment for Subtitle Generation
2662 CauSE: Causal Search Engine for Understanding Contact-Center Conversations
2663 Tailored Real-Time Call Summarization System for Contact Centers
2647 Federated Learning Toolkit with Voice-based User Verification Demo
2657 Learning when to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models GitHub arXiv
2628 Fast Enrollable Streaming Keyword Spotting System: Training and Inference using a Web Browser
2665 Cross-Lingual/Cross-Channel Intent Detection in Contact-Center Conversations

Star History

Star History Chart