INTERSPEECH-2023-Papers

INTERSPEECH 2023 Papers: A complete collection of influential and exciting research papers from the INTERSPEECH 2023 conference. Explore the latest advances in speech and language processing. Code included. ⭐ the repository to support the advancement of speech technology!

Draft PDF version of the INTERSPEECH 2023 Conference Programme, which lists all accepted full papers together with their provisional mode of presentation and the time at which they will be presented.

Contributors

Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.

Papers

NOTE: Final paper links will be added post-conference.

Resources for Spoken Language Processing

#	Title	Repo	Paper
1686	Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech		➖
1049	MOCKS 1.0: Multilingual Open Custom Keyword Spotting Testset	➖	➖
2150	MD3: The Multi-Dialect Dataset of Dialogues
2279	MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speechto-Text Translation
1828	Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition	➖	➖
2351	HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation	➖

Speech Synthesis: Prosody and Emotion

#	Title	Repo	Paper
749	Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks	➖
1292	Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations	➖	➖
1317	EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis	➖
806	Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus
2270	Explicit Intensity Control for Accented Text-to-speech
834	Comparing Normalizing Flows and Diffusion Models for Prosody and Acoustic Modelling in Text-to-speech	➖	➖

Statistical Machine Translation

#	Title	Repo	Paper
2484	Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer	➖	➖
1063	Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters	➖
648	StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation
1767	Joint Speech Translation and Named Entity Recognition
2050	Analysis of Acoustic Information in End-to-End Spoken Language Translation	➖	➖
2004	LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers	➖

Self-Supervised Learning in ASR

#	Title	Repo	Paper
1213	DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
1040	Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
387	Dual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition	➖	➖
2166	O-1: Self-training with Oracle and 1-best Hypothesis	➖	➖
822	MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets
1802	Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages	➖

Prosody

#	Title	Repo	Paper
1781	Chinese EFL Learners' Perception of English Prosodic Focus	➖	➖
315	Pitch Accent Variation and the Interpretation of Rising and Falling Intonation in American English	➖	➖
1033	Tonal Coarticulation as a Cue for Upcoming Prosodic Boundary	➖	➖
2116	Alignment of Beat Gestures and Prosodic Prominence in German	➖	➖
1454	Creak Prevalence and Prosodic Context in Australian English	➖	➖
1651	Speech Reduction: Position within French Prosodic Structure	➖	➖

Speech Production

#	Title	Repo	Paper
637	Transvelar Nasal Coupling Contributing to Speaker Characteristics in Non-nasal Vowels	➖	➖
286	Speech Synthesis from Articulatory Movements Recorded by Real-time MRI	➖	➖
2283	The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech with a Siamese RNN
1933	Did You See that? Exploring the Role of Vision in the Development of Consonant Feature Contrasts in Children with Cochlear Implants	➖	➖

Dysarthric Speech Assessment

#	Title	Repo	Paper
2017	Automatic Assessments of Dysarthric Speech: the Usability of Acoustic-phonetic Features	➖	➖
1455	Classification of Multi-class Vowels and Fricatives from Patients Having Amyotrophic Lateral Sclerosis with Varied Levels of Dysarthria Severity	➖	➖
1627	Parameter-efficient Dysarthric Speech Recognition using Adapter Fusion and Householder Transformation	➖
2481	Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation	➖
1921	Latent Phrase Matching for Dysarthric Speech	➖
173	Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification

Speech Coding: Transmission

#	Title	Repo	Paper
1562	CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding	➖	➖
1234	Target Speech Extraction with Conditional Diffusion Model	➖	➖
883	Towards Fully Quantized Neural Networks For Speech Enhancement	➖	➖
980	Complex Image Generation SwinTransformer Network for Audio Denoising		➖

Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation

#	Title	Repo	Paper
2118	Using Text Injection to Improve Recognition of Personal Identifiers in Speech	➖	➖
837	Investigating Wav2Vec2 Context Representations and the Effects of Fine-tuning, a Case-study of a Finnish Model		➖
872	Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech	➖	➖
177	Iteratively Improving Speech Recognition and Voice Conversion
2001	LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems	➖
746	TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition	➖
1124	Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR	➖	➖
2417	GhostRNN: Reducing State Redundancy in RNN with Cheap Operations	➖	➖
1442	Task-Agnostic Structured Pruning of Speech Representation Models	➖
485	Factual Consistency Oriented Speech Recognition	➖
1036	Multi-Head State Space Model for Speech Recognition	➖
341	Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search	➖	➖
2359	Probing Self-supervised Speech Models for Phonetic and Phonemic Information: a Case Study in Aspiration	➖
739	Selective Biasing with Trie-based Contextual Adapters for Personalised Speech Recognition using Neural Transducers	➖
213	A More Accurate Internal Language Model Score Estimation for the Hybrid Autoregressive Transducer	➖	➖
2280	Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data	➖
2585	OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking	➖
1316	ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
2389	General-purpose Adversarial Training for Enhanced Automatic Speech Recognition Model Generalization	➖	➖
275	Joint Instance Reconstruction and Feature Sub-space Alignment for Cross-Domain Speech Emotion Recognition	➖	➖
106	Attention Gate between Capsules in Fully Capsule-network Speech Recognition	➖	➖
1272	Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition	➖
1189	Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers	➖
223	Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding	➖
923	Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation
2258	Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts	➖
1184	DCCRN-KWS: An Audio Bias based Model for Noise Robust Small-footprint Keyword Spotting	➖
1609	OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition	➖
2136	Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition	➖
788	Rehearsal-Free Online Continual Learning for Automatic Speech Recognition
496	ASR Data Augmentation in Low-resource Settings using Cross-lingual Multi-speaker TTS and Cross-lingual Voice Conversion
642	Personality-aware Training based Speaker Adaptation for End-to-End Speech Recognition	➖	➖
2257	Target Vocabulary Recognition Based on Multi-Task Learning with Decomposed Teacher Sequences	➖	➖
679	Wave to Syntax: Probing Spoken Language Models for Syntax
720	Effective Training of Attention-based Contextual Biasing Adapters with Synthetic Audio for Personalised ASR	➖
630	Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation	➖
1118	SlothSpeech: Denial-of-service Attack Against Speech Recognition Models
503	CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition	➖	➖
159	Exploring Sources of Racial Bias in Automatic Speech Recognition through the Lens of Rhythmic Variation	➖	➖
1440	Can Contextual Biasing Remain Effective with Whisper and GPT-2?	➖
221	Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation
2207	Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition	➖	➖
1216	MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition	➖	➖
1192	Improving Chinese Mandarin Speech Recognition using Graph Embedding Regularization	➖	➖
1276	Adapting Multi-Lingual ASR Models for Handling Multiple Talkers	➖
1221	Adapter-tuning with Effective Token-dependent Representation Shift for Automatic Speech Recognition	➖	➖
1010	Model-Internal Slot-triggered Biasing for Domain Expansion in Neural Transducer ASR Models	➖
2508	Delay-penalized CTC implemented based on Finite State Transducer
101	Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition
1064	MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations
1422	Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator	➖
1413	Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification
2589	Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
1091	Domain Adaptive Self-supervised Training of Automatic Speech Recognition	➖	➖
1105	There is more than One Kind of Robustness: Fooling Whisper with Adversarial Examples
1176	Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute	➖
759	Blank-regularized CTC for Frame Skipping in Neural Transducer	➖
2406	The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR	➖
2354	Improving RNN-Transducers with Acoustic LookAhead	➖
1847	Everyone has an Accent	➖	➖
2124	Some Voices are too Common: Building Fair Speech Recognition Systems using the Common-Voice Dataset	➖
1168	Information Magnitude Based Dynamic Sub-sampling for Speech-to-text	➖	➖
353	Towards Multi-task Learning of Speech and Speaker Recognition
2186	Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR	➖	➖
1012	2-bit Conformer Quantization for Automatic Speech Recognition	➖
167	Time-Domain Speech Enhancement for Robust Automatic Speech Recognition	➖
257	Multi-channel Multi-speaker Transformer for Speech Recognition	➖	➖
733	Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion	➖
2463	Dialect Speech Recognition Modeling using Corpus of Japanese Dialects and Self-Supervised Learning-based Model XLSR	➖	➖
767	Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network	➖
970	Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think	➖
791	MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition	➖
2499	Biased Self-supervised Learning for ASR	➖
1300	A Unified Recognition and Correction Model under Noisy and Accent Speech Conditions	➖	➖
2470	Wav2Vec 2.0 ASR for Cantonese-Speaking Older Adults in a Clinical Setting	➖	➖
770	BAT: Boundary Aware Transducer for Memory-efficient and Low-latency ASR	➖
1342	Bayes Risk Transducer: Transducer with Controllable Alignment Prediction	➖	➖
783	Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition	➖

Analysis of Speech and Audio Signals

#	Title	Repo	Paper
1173	Robust Prototype Learning for Anomalous Sound Detection	➖	➖
982	A Multimodal Prototypical Approach for Unsupervised Sound Classification
563	Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms	➖	➖
1082	Adapting Language-Audio Models as Few-Shot Audio Learners	➖
914	Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
734	TFECN: Time-Frequency Enhanced ConvNet for Audio Classification	➖	➖
350	Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detection	➖	➖
1174	Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection	➖	➖
1210	Small Footprint Multi-channel Network for Keyword Spotting with Centroid Based Awareness	➖	➖
1380	Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes	➖
1549	Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion
1861	Topological Data Analysis for Speech Processing
1329	Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation
932	Personalized Acoustic Scene Classification in Ultra-low Power Embedded Devices using Privacy-preserving Data Augmentation	➖	➖
176	Background Domain Switch: A Novel Data Augmentation Technique for Robust Sound Event Detection	➖	➖
1021	Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning
2416	Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds	➖	➖
1478	Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions	➖	➖
979	Ontology-aware Learning and Evaluation for Audio Tagging
575	Differential Privacy enabled Dementia Classification: An Exploration of the Privacy-Accuracy Trade-off in Speech Signal Data	➖	➖
1595	Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
1816	Towards Multi-Lingual Audio Question Answering	➖	➖
477	Wav2ToBI: a New Approach to Automatic ToBI Transcription	➖	➖
1579	MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization	➖
591	Anomalous Sound Detection Based on Sound Separation	➖
2089	Random Forest Classification of Breathing Phases from Audio Signals Recorded using Mobile Devices	➖	➖
1581	GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos	➖	➖
358	Emotion-Aware Audio-Driven Face Animation via Contrastive Feature Disentanglement	➖	➖
344	Joint-Former: Jointly Regularized and Locally Down-sampled Conformer for Semi-supervised Sound Event Detection	➖	➖
245	Towards Attention-based Contrastive Learning for Audio Spoof Detection	➖	➖
2488	Masked Audio Modeling with CLAP and Multi-Objective Learning	➖	➖
1904	Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems
481	Self-Supervised Dataset Pruning for Efficient Training in Audio Anti-spoofing	➖	➖
491	Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR	➖
684	Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features	➖
542	Advanced RawNet2 with Attention-based Channel Masking for Synthetic Speech Detection	➖	➖
88	Insights Into End-to-End Audio-to-Score Transcription with Real Recordings: A Case Study with Saxophone Works	➖	➖
2193	Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers
1621	Synthetic Voice Spoofing Detection based on Feature Pyramid Conformer	➖	➖
1383	Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection	➖	➖
2011	Application of Knowledge Distillation to Multi-task Speech Representation Learning	➖
2297	DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes	➖
1965	Variational Classifier for Unsupervised Anomalous Sound Detection under Domain Generalization	➖	➖
745	FlexiAST: Flexibility is What AST Needs	➖	➖
1344	Blind Estimation of Room Impulse Response from Monaural Reverberant Speech with Segmental Generative Neural Network	➖
852	AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
613	Dual-Memory Multi-Modal Learning for Continual Spoken Keyword Spotting with Confidence Selection and Diversity Enhancement	➖	➖
1431	An Efficient Speech Separation Network Based on Recurrent Fusion Dilated Convolution and Channel Attention	➖
801	Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation	➖	➖
2015	Binaural Sound Localization in Noisy Environments Using Frequency-Based Audio Vision Transformer (FAViT)	➖	➖
1723	Contrastive Learning based Deep Latent Masking for Music Source Separation	➖	➖
655	Speaker Extraction with Detection of Presence and Absence of Target Speakers	➖	➖
889	PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network	➖	➖
2117	Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning	➖
1309	Image-Driven Audio-Visual Universal Source Separation	➖	➖
2520	Joint Blind Source Separation and Dereverberation for Automatic Speech Recognition using Delayed-Subsource	➖	➖
1766	SDNet: Stream-attention and Dual-feature Learning Network for Ad-hoc Array Speech Separation	➖	➖
2451	Deeply Supervised Curriculum Learning for Deep Neural Network-based Sound Source Localization	➖	➖
164	Multi-Channel Separation of Dynamic Speech and Sound Events	➖	➖
2545	Rethinking the Visual Cues in Audio-Visual Speaker Extraction
85	Using Semi-supervised Learning for Monaural Time-domain Speech Separation with a Self-supervised Learning-based SI-SNR Estimator	➖	➖
1158	Investigation of Training Mute-Expressive End-to-End Speech Separation Networks for an Unknown Number of Speakers	➖	➖
2369	SR-SRP: Super-Resolution based SRP-PHAT for Sound Source Localization and Tracking	➖	➖
165	Time-frequency Domain Filter-and-sum Network for Multi-channel Speech Separation	➖	➖
714	FN-SSL: Full-Band and Narrow-Band Fusion for Sound Source Localization
696	A Neural State-Space Modeling Approach to Efficient Speech Separation	➖
1777	Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation
518	Monaural Speech Separation Method Based on Recurrent Attention with Parallel Branches	➖	➖

Speech Recognition: Architecture, Search, and Linguistic Components

#	Title	Repo	Paper
2344	Diacritic Recognition Performance in Arabic ASR	➖
990	Personalization for BERT-based Discriminative Speech Recognition Rescoring	➖
2182	On the N-gram Approximation of Pre-trained Language Models	➖
2147	Record Deduplication for Entity Distribution Modeling in ASR Transcripts	➖
2205	Learning When to Trust Which Teacher for Weakly Supervised ASR	➖
1313	Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer	➖
1378	Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation	➖	➖
2479	Knowledge Distillation Approach for Efficient Internal Language Model Estimation	➖	➖
276	Language Model Personalization for Improved Touchscreen Typing	➖	➖
1223	Blank Collapse: Compressing CTC Emission for the Faster Decoding
403	Improving Joint Speech-Text Representations Without Alignment	➖	➖
1941	Leveraging Cross-Utterance Context for ASR Decoding	➖
423	Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
1517	Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition	➖	➖
1071	A Neural Time Alignment Module for End-to-End Automatic Speech Recognition	➖	➖
599	Accelerating Transducers through Adjacent Token Merging	➖
617	Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition	➖
2292	Language-Routing Mixture of Experts for Multi-lingual and Code-Switching Speech Recognition	➖	➖
1437	Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model	➖	➖
2051	Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning	➖
768	SpellMapper: A Non-autoregressive Neural Spellchecker for ASR Customization with Candidate Retrieval based on N-gram Mappings	➖
2037	Text Injection for Capitalization and Turn-Taking Prediction in Speech Models	➖	➖
1281	Confidence-based Ensembles of End-to-End Speech Recognition Models	➖
1050	Unsupervised Code-switched Text Generation from Parallel Text	➖	➖
258	A Binary Keyword Spotting System With Error-Diffusion Speech Feature Binarization	➖	➖
621	Language-universal Phonetic Encoder for Low-resource Speech Recognition	➖
863	A Lexical-aware Non-autoregressive Transformer-based ASR Model	➖
1841	Improving Under-Resourced Code-Switched Speech Recognition: Large Pre-trained Models or Architectural Interventions	➖	➖
1194	A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
61	A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization	➖	➖
137	Modeling Dependent Structure for Utterances in ASR Evaluation	➖
757	ASR for Low Resource and Multilingual Noisy Code-Mixed Speech	➖	➖
390	Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System	➖
737	Combining Multilingual Resources and Models to Develop State-of-the-Art E2E ASR for Swedish	➖	➖
1171	Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer	➖
1867	Towards Continually Learning New Languages	➖	➖
1616	N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space	➖
1432	SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge	➖	➖
1162	miniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming ASR Applications on the Edge	➖	➖
1469	CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning	➖	➖
1337	Exploration on HuBERT with Multiple Resolution	➖
2045	Quantization-aware and Tensor-compressed Training of Transformers for Natural Language Understanding	➖
2355	Word-level Confidence Estimation for CTC Models	➖	➖
2235	Multilingual Contextual Adapters to Improve Custom Word Recognition in Low-resource Languages	➖
614	Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition	➖	➖
1303	4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict Decoders	➖
1086	Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition
262	Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition	➖	➖
480	Mixture-of-Expert Conformer for Streaming Multilingual ASR	➖
1665	Lossless 4-bit Quantization of Architecture Compressed Conformer ASR Systems on the 300-hr Switch-board Corpus	➖	➖
2544	Compressed MoE ASR Model Based on Knowledge Distillation and Quantization	➖	➖

Speech Recognition: Technologies and Systems for New Applications

#	Title	Repo	Paper
1079	How to Estimate Model Transferability of Pre-Trained Speech Models?	➖
235	Progress and Prospects for Spoken Language Technology: Results from Five Sexennial Surveys	➖	➖
268	Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling	➖
601	CASA-ASR: Context-Aware Speaker-Attributed ASR	➖
1321	Unsupervised Learning of Discrete Latent Representations with Data-Adaptive Dimensionality from Continuous Speech Streams	➖	➖
1167	AD-TUNING: An Adaptive CHILD-TUNING Approach to Efficient Hyperparameter Optimization of Child Networks for Speech Processing Tasks in the SUPERB Benchmark		➖
190	Distilling Knowledge from Gaussian Process Teacher to Neural Network Student	➖	➖
2032	Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
135	Segmental SpeechCLIP: Utilizing Pretrained Image-text Models for Audio-Visual Learning	➖	➖
421	Towards Hate Speech Detection in Low-resource Languages: Comparing ASR to Acoustic Word Embeddings on Wolof and Swahili	➖
385	Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning
664	Online Punctuation Restoration using ELECTRA Model for streaming ASR Systems	➖	➖
2066	Language Agnostic Data-Driven Inverse Text Normalization	➖
2044	Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
1655	Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model	➖	➖
2371	Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer	➖	➖
1592	Zero-Shot Automatic Pronunciation Assessment	➖
337	A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning	➖	➖
1635	Assessing Intelligibility in Non-native Speech: Comparing Measures Obtained at Different Levels	➖	➖
585	End-to-End Word-Level Pronunciation Assessment with MASK Pre-training	➖
550	A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment	➖
2541	Automatic Prediction of Language Learners' Listenability Using Speech and Text Features Extracted from Listening Drills	➖	➖
380	Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment	➖	➖
1899	Adapting an Unadaptable ASR System	➖
533	Addressing Cold Start Problem for End-to-end Automatic Speech Scoring	➖
816	Improving Grapheme-to-phoneme Conversion by Learning Pronunciations from Speech Recordings	➖
2577	Orthography-based Pronunciation Scoring for Better CAPT Feedback	➖
587	Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring	➖
364	Mispronunciation Detection and Diagnosis Model for Tonal Language, Applied to Vietnamese	➖	➖

Lexical and Language Modeling for ASR

#	Title	Repo	Paper
643	NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning
2128	Scaling Laws for Discriminative Speech Recognition Rescoring Models	➖
2429	Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition
1362	Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition	➖
1251	Memory Network-Based End-To-End Neural ES-KMeans for Improved Word Segmentation	➖	➖
1320	Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation	➖

Language Identification and Diarization

#	Title	Repo	Paper
304	Lightweight and Efficient Spoken Language Identification of Long-form Audio	➖	➖
1109	End-to-End Spoken Language Diarization with Wav2vec Embeddings	➖	➖
1986	Efficient Spoken Language Recognition via Multilabel Classification	➖
1529	Description and Analysis of ABC Submission to NIST LRE 2022	➖	➖
1790	Exploring the Impact of Pretrained Models and Web-Scraped Data for the 2022 NIST Language Recognition Evaluation	➖	➖
1094	Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22	➖	➖

Speech Quality Assessment

#	Title	Repo	Paper
1436	DeePMOS: Deep Posterior Mean-Opinion-Score of Speech	➖	➖
1644	The Role of Formant and Excitation Source Features in Perceived Naturalness of Low Resource Tribal Language TTS: An Empirical Study	➖	➖
811	A No-reference Speech Quality Assessment Method based on Neural Network with Densely Connected Convolutional Architecture	➖	➖
2507	Probing Speech Quality Information in ASR Systems	➖	➖
589	Preference-based Training Framework for Automatic Speech Quality Assessment using Deep Neural Network	➖	➖
389	Crowdsourced Data Validation for ASR Training	➖	➖

Feature Modeling for ASR

#	Title	Repo	Paper
2296	Re-investigating the Efficient Transfer Learning of Speech Foundation Model using Feature Fusion Methods	➖	➖
1556	Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training	➖	➖
509	InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition	➖
579	Transductive Feature Space Regularization for Few-shot Bioacoustic Event Detection	➖	➖
615	Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition	➖
1510	On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning	➖

Interfacing Speech Technology and Phonetics

#	Title	Repo	Paper
1846	Phonemic Competition in End-to-end ASR models	➖	➖
443	Automatic Speaker Recognition with Variation Across Vocal Conditions: a Controlled Experiment with Implications for Forensics	➖	➖
1398	Exploring Graph Theory Methods for the Analysis of Pronunciation Variation in Spontaneous Speech	➖	➖
680	Automatic Speaker Recognition Performance with Matched and Mismatched Female Bilingual Speech Data	➖	➖

Speech Synthesis: Multilinguality

#	Title	Repo	Paper
2303	FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters	➖	➖
934	Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model	➖
363	DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
1467	Generating Multilingual Gender-Ambiguous Text-to-Speech Voices
2330	RADMMM: Multilingual Multiaccented Multispeaker Text to Speech	➖
861	Multilingual Context-based Pronunciation Learning for Text-to-Speech	➖

Speech Emotion Recognition

#	Title	Repo	Paper
2170	Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition	➖	➖
1113	The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers	➖
1080	A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model
454	Privacy Risks in Speech Emotion Recognition: A Systematic Study on Gender Inference Attack	➖	➖
2111	Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition	➖	➖
80	Stable Speech Emotion Recognition with Head-k-Pooling Loss	➖	➖
1923	Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews
756	Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining	➖	➖
240	The Co-use of Laughter and Head Gestures Across Speech Styles	➖	➖
1351	EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition	➖
136	Pre-Finetuning for Few-Shot Emotional Speech Recognition
293	Integrating Emotion Recognition with Speech Recognition and Speaker Diarization for Conversations	➖	➖
1075	Utility-Preserving Privacy-Enabled Speech Embeddings for Emotion Detection	➖	➖
890	A Context-Constrained Sentence Modeling for Deception Detection in Real Interrogation	➖	➖
1914	Laughter in Task-based Settings: Whom We Talk to Affects How, When, and How Often We Laugh	➖	➖
653	Exploring Downstream Transfer of Self-Supervised Features for Speech Emotion Recognition	➖	➖
1758	Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models	➖
819	MetricAug: A Distortion Metric-Lead Augmentation Strategy for Training Noise-Robust Speech Emotion Recognizer		➖
1311	Investigating Acoustic Cues for Multilingual Abuse Detection	➖	➖
1600	A Novel Frequency Warping Scale for Speech Emotion Recognition	➖	➖
1170	Multi-scale Temporal Transformer for Speech Emotion Recognition	➖	➖
1169	Distant Speech Emotion Recognition in an Indoor Human-Robot Interaction Scenario	➖	➖
2498	A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation	➖	➖
2375	Improving Joint Speech and Emotion Recognition using Global Style Tokens	➖	➖
1163	Speech Emotion Recognition by Estimating Emotional Label Sequences with Phoneme Class Attribute	➖	➖
274	Unsupervised Transfer Components Learning for Cross-Domain Speech Emotion Recognition	➖	➖
1090	Dual Memory Fusion for Multimodal Speech Emotion Recognition	➖	➖
311	Hybrid Dataset for Speech Emotion Recognition in Russian Language	➖	➖
396	Speech Emotion Recognition using Decomposed Speech via Multi-task Learning	➖	➖

Spoken Dialog Systems and Conversational Analysis

#	Title	Repo	Paper
1236	Emotion Awareness in Multi-utterance Turn for Improving Emotion Prediction in Multi-Speaker Conversation	➖	➖
2300	Tri-level Joint Natural Language Understanding for Multi-turn Conversational Datasets
2234	Semantic Enrichment Towards Efficient Speech Representations	➖	➖
1299	Tensor Decomposition for Minimization of E2E SLU Model Toward On-device Processing	➖
46	FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding	➖	➖
699	DiffSLU: Knowledge Distillation Based Diffusion Model for Cross-Lingual Spoken Language Understanding	➖	➖
1962	Integrating Pretrained ASR and LM to perform Sequence Generation for Spoken Language Understanding	➖	➖
644	Contrastive Learning Based ASR Robust Knowledge Selection For Spoken Dialogue System	➖	➖
1859	Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space	➖	➖
198	An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for Spoken Language Understanding
1740	Enhancing New Intent Discovery via Robust Neighbor-based Contrastive Learning	➖
211	Personalized Predictive ASR for Latency Reduction in Voice Assistants	➖
1419	Compositional Generalization in Spoken Language Understanding	➖	➖
2314	Sampling bias in NLU models: Impact and Mitigation	➖
1038	5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair	➖
93	Cˆ2A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding	➖	➖
1505	WhiSLU: End-to-End Spoken Language Understanding with Whisper	➖	➖
2475	I Learned Error, I Can Fix It!: A Detector-Corrector Structure for ASR Error Calibration	➖	➖
1951	Quantifying the Perceptual Value of Lexical and Non-lexical Channels in Speech		➖
952	Parsing Dialog Turns with Prosodic Features in English	➖	➖
320	Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness using Dynamic-Prompt-Tune	➖	➖
1885	Parameter Selection for Analyzing Conversations with Autism Spectrum Disorder	➖	➖
2341	Efficient Multimodal Neural Networks for Trigger-less Voice Assistants	➖
2332	Rapid Lexical Alignment to a Conversational Agent	➖	➖
578	Multimodal Turn-Taking Model using Visual Cues for End-of-Utterance Prediction in Spoken Dialogue Systems	➖	➖
1464	Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer	➖	➖
1618	Improving the Response Timing Estimation for Spoken Dialogue Systems by Reducing the Effect of Speech Recognition Delay	➖	➖
555	Focus-attention-enhanced Cross-modal Transformer with Metric Learning for Multimodal Speech Emotion Recognition	➖	➖
1717	A Multiple-Teacher Pruning Based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting	➖	➖
789	Abusive Speech Detection in Indic Languages using Acoustic Features	➖	➖
1791	Listening To Silences In Contact Center Conversations using Textual Cues	➖	➖
1947	Relationship between auditory and semantic entrainment using Deep Neural Networks (DNN)	➖	➖
1074	Verbal and Nonverbal Feedback Signals in Response to Increasing Levels of Miscommunication	➖	➖
76	Speech-Based Classification of Defensive Communication: A Novel Dataset and Results	➖	➖
1929	Unsupervised Auditory and Semantic Entrainment Models with Deep Neural Networks	➖	➖
1267	Relationships between Gender, Personality Traits and Features of Multi-Modal Data to Responses to Spoken Dialog Systems Breakdown	➖	➖
1650	Speaker-aware Cross-modal Fusion Architecture for Conversational Emotion Recognition	➖	➖

Speech Coding and Enhancement

#	Title	Repo	Paper
936	Biophysically-inspired Single-channel Speech Enhancement in the Time Domain	➖	➖
1902	On-Device Speaker Anonymization of Acoustic Embeddings for ASR based on Flexible Location Gradient Reversal Layer	➖	➖
1901	How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning	➖
1287	CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram	➖	➖
521	A Two-stage Progressive Neural Network for Acoustic Echo Cancellation	➖
537	An Intra-BRNN and GB-RVQ Based End-to-end Neural Audio Codec	➖	➖
1066	Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations	➖	➖
280	CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement	➖	➖
623	Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement	➖
1490	Multi-mode Neural Speech Coding Based on Deep Generative Networks	➖	➖
751	Streaming Dual-Path Transformer for Speech Enhancement	➖	➖
1848	Sequence-to-Sequence Multi-Modal Speech In-Painting	➖	➖
984	Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression	➖
551	Differentially Private Adapters for Parameter Efficient Acoustic Modeling
780	Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
2568	Consonant-emphasis Method Incorporating Robust Consonant-section Detection to Improve Intelligibility of Bone-conducted Speech	➖	➖
1578	Downstream Task-Agnostic Speech Enhancement with Self-Supervised Representation Loss	➖
2305	Perceptual Improvement of Deep Neural Network (DNN) Speech Coder Using Parametric and Nonparametric Density Models	➖	➖
2437	DeFT-AN RT: Real-time Multichannel Speech Enhancement using Dense Frequency-Time Attentive Network and Non-overlapping Synthesis Window	➖	➖
365	Iterative Autoregression: a Novel Trick to Improve your Low-latency Speech Enhancement Model	➖
1116	Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine	➖	➖
1364	Exploring the Interactions between Target Positive and Negative Information for Acoustic Echo Cancellation	➖	➖
1084	A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement using Small-footprint Models
705	Domain Adaptation for Speech Enhancement in a Large Domain Gap	➖	➖
456	SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks	➖
339	A Mask Free Neural Network for Monaural Speech Enhancement
1548	A Training and Inference Strategy using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech
2418	A Simple RNN Model for Lightweight, Low-compute and Low-latency Multichannel Speech Enhancement in the Time Domain	➖	➖
1433	High Fidelity Speech Enhancement with Band-split RNN	➖
218	Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
882	DFSNet: A Steerable Neural Beamformer Invariant to Microphone Array Configuration for Real-Time, Low-Latency Speech Enhancement	➖
1323	Speaker-Aware Anti-spoofing	➖
1376	PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement	➖	➖
799	EffCRN: An Efficient Convolutional Recurrent Network for High-Performance Speech Enhancement	➖
1795	HAD-ANC: A Hybrid System Comprising an Adaptive Filter and Deep Neural Networks for Active Noise Control	➖	➖
886	MSAF: A Multiple Self-Attention Field Method for Speech Enhancement	➖	➖
2302	Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression	➖	➖
971	ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression	➖
1532	PLCMOS – a Data-driven Non-intrusive Metric for the Evaluation of Packet Loss Concealment Algorithms
1910	Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing	➖
1445	Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement
901	Complex-valued Neural Networks for Voice Anti-spoofing	➖	➖
1028	DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic echo Cancellation, Noise Suppression and Dereverberation	➖
1547	Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement	➖
1642	HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders	➖
1441	MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra
565	TRIDENTSE: Guiding Speech Enhancement with 32 Global Tokens	➖
1254	Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features	➖
1890	Self-Supervised Learning with Diffusion based Multichannel Speech Enhancement for Speaker Verification under Noisy Conditions	➖
1341	Two-Stage Voice Anonymization for Enhanced Privacy	➖
2055	Personalized Dereverberation of Speech	➖	➖
580	Weighted Von Mises Distribution-based Loss Function for Real-time STFT Phase Reconstruction using DNN	➖	➖
272	Deep Multi-Frame Filtering for Hearing Aids
1232	Aligning Speech Enhancement for Improving Downstream Classification Performance	➖	➖
420	DNN-based Parameter Estimation for MVDR Beamforming and Post-filtering	➖	➖
675	FRA-RIR: Fast Random Approximation of the Image-source
686	Rethinking Complex-Valued Deep Neural Networks for Monaural Speech Enhancement	➖
186	Harmonic Enhancement using Learnable Comb Filter for Light-weight Full-band Speech Enhancement Model	➖

Paralinguistics

#	Title	Repo	Paper
1023	Detection of Emotional Hotspots in Meetings using a Cross-Corpus Approach	➖	➖
1412	Detection of Laughter and Screaming using the Attention and CTC Models	➖	➖
1852	Capturing Formality in Speech Across Domains and Languages	➖
460	Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio
778	Cues to Next-speaker Projection in Conversational Swedish: Evidence from Reaction Times	➖
1200	Multiple Instance Learning for Inference of Child Attachment From Paralinguistic Aspects of Speech	➖	➖
2070	Speaker Embeddings as Individuality Proxy for Voice Stress Detection	➖
2213	From Interval to Ordinal: A HMM based Approach for Emotion Label Conversion	➖	➖
661	Turbo your Multi-modal Classification with Contrastive Learning	➖	➖
497	Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition	➖	➖
1360	SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition	➖	➖
2464	On the Efficacy and Noise-robustness of Jointly Learned Speech Emotion and Automatic Speech Recognition	➖
830	Speaking State Decoder with Transition Detection for Next Speaker Prediction	➖	➖
1507	What are Differences? Comparing DNN and Human by their Performance and Characteristics in Speaker Age Estimation	➖	➖
846	Effects of Perceived Gender on the Perceived Social Function of Laughter	➖	➖
1999	Implicit Phonetic Information Modeling for Speech Emotion Recognition	➖
1034	Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters	➖	➖
300	Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions	➖	➖
1108	Preference Learning Labels by Anchoring on Consecutive Annotations	➖	➖
2561	Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks	➖
543	Learning Local to Global Feature Aggregation for Speech Emotion Recognition	➖
842	Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition	➖	➖

Speech Enhancement and Denoising

#	Title	Repo	Paper
1088	Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation	➖
514	TaylorBeamixer: Learning Taylor-Inspired All-Neural Multi-Channel Speech Enhancement from Beam-Space Dictionary Perspective
865	MFT-CRN:Multi-scale Fourier Transform for Monaural Speech Enhancement	➖	➖
1265	Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement	➖
318	Multi-input Multi-output Complex Spectral Mapping for Speaker Separation	➖	➖
992	Short-term Extrapolation of Speech Signals Using Recursive Neural Networks in the STFT Domain	➖	➖

Speech Synthesis: Evaluation

#	Title	Repo	Paper
1843	Listener Sensitivity to Deviating Obstruents in WaveNet	➖	➖
981	How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics	➖
2014	MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors	➖	➖
851	RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting	➖	➖
2013	Can Better Perception Become a Disadvantage? Synthetic Speech Perception in Congenitally Blind Users	➖	➖
1076	Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech	➖

End-to-End Spoken Dialog Systems

#	Title	Repo	Paper
1799	Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
1760	Improving End-to-End SLU performance with Prosodic Attention and Distillation	➖
2575	Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding	➖	➖
758	Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language	➖	➖
2018	ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding	➖	➖
41	GhostT5: Generate More Features with Cheap Operations to Improve Textless Spoken Question Answering	➖	➖

Biosignal-enabled Spoken Communication

#	Title	Repo	Paper
278	Obstructive Sleep Apnea Detection using Pretrained Speech Representations	➖	➖
620	EEG-based Auditory Attention Detection with Spatiotemporal Graph and Graph Convolutional Network	➖	➖
1966	Silent Speech Recognition with Articulator Positions Estimated from Tongue Ultrasound and Lip Video	➖	➖
1377	Auditory Attention Detection in Real-Life Scenarios Using Common Spatial Patterns from EEG	➖	➖
1381	Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG		➖
40	Towards Ultrasound Tongue Image Prediction from EEG During Speech Production
1607	Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces using Spatial Transformer Networks	➖
174	STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks	➖	➖
1881	Spanish Phone Confusion Analysis for EMG-Based Silent Speech Interfaces	➖	➖
805	Hybrid Silent Speech Interface Through Fusion of Electroencephalography and Electromyography	➖	➖

Neural-based Speech and Acoustic Analysis

#	Title	Repo	Paper
1968	Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?
2342	Discovering COVID-19 Coughing and Breathing Patterns from Unlabeled Data Using Contrastive Learning with Varying Pre-Training Domains	➖
330	Background-aware Modeling for Weakly Supervised Sound Event Detection	➖	➖
1065	How to (Virtually) Train Your Speaker Localizer
2271	MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
909	A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds	➖

DiGo - Dialog for Good: Speech and Language Technology for Social Good

#	Title	Repo	Paper
2194	A Multimodal Investigation of Speech, Text, Cognitive and Facial Video Features for Characterizing Depression with and without Medication	➖
307	Understanding Disrupted Sentences using Underspecified Abstract Meaning Representation
2109	Developing Speech Processing Pipelines for Police Accountability	➖
2086	Prosody-controllable Gender-ambiguous Speech Synthesis: a Tool for Investigating Implicit Bias in Speech Perception	➖	➖
848	Affective Attributes of French Caregivers' Professional Speech	➖	➖

Spoken Language Processing: Translation, Information Retrieval, Summarization, Resources, and Evaluation

#	Title	Repo	Paper
180	Pragmatic Pertinence: A Learnable Confidence Metric to Assess the Subjective Quality of LM-Generated Text	➖	➖
2078	ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition	➖
916	BASS: Block-wise Adaptation for Speech Summarization	➖	➖
1258	Speaker Tracking using Graph Attention Networks with Varying Duration Utterances in Multi-Channel Naturalistic Data: Fearless Steps Apollo 11 Audio Corpus	➖	➖
36	Combining Language Corpora in a Japanese Electromagnetic Articulography Database for Acoustic-to-articulatory Inversion	➖	➖
523	A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition	➖	➖
2174	Large Dataset Generation of Synchronized Music Audio and Lyrics at Scale using Teacher-Student Paradigm	➖	➖
483	Enc-Dec RNN Acoustic Word Embeddings Learned via Pairwise Prediction	➖	➖
864	Query Based Acoustic Summarization for Podcasts	➖	➖
1242	Spot Keywords from Very Noisy and Mixed Speech	➖
891	Knowledge Distillation on Joint Task End-to-End Speech Translation	➖
343	Investigating Pre-trained Audio Encoders in the Low-Resource Condition
1718	Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target	➖
823	MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
1674	CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition
1762	Improving Zero-shot Cross-domain Slot Filling via Transformer-based Slot Semantics Fusion	➖	➖
619	Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer	➖	➖
1468	Boosting Punctuation Restoration with Data Generation and Reinforcement Learning	➖	➖
695	J-ToneNet: A Transformer-based Encoding Network for Improving Tone Classification in Continuous Speech via F0 Sequences	➖	➖
1152	Towards Cross-language Prosody Transfer for Dialog	➖	➖
2506	Strategies for Improving Low Resource Speech to Text Translation Relying on Pre-trained ASR Models	➖
1980	ITALIC: An Italian Intent Classification Dataset
1778	Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation	➖	➖
1466	How ChatGPT is Robust for Spoken Language Understanding?	➖	➖
1233	GigaST: A 10,000-hour Pseudo Speech Translation Corpus
1570	Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism	➖	➖
2473	Crowdsource-based Validation of the Audio Cocktail as a Sound Browsing Tool	➖	➖
1675	PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts	➖	➖
1358	Speech-to-Face Conversion using Denoising Diffusion Probabilistic Models	➖	➖
2255	Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation	➖
1068	How Does Pretraining Improve Discourse-Aware Translation?	➖
1135	PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction	➖
161	Model-assisted Lexical Tone Evaluation of Three-year-old Chinese-speaking Children by also Considering Segment Production	➖	➖
1392	Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding
1582	Joint Time and Frequency Transformer for Chinese Opera Classification	➖	➖
116	AdaMS: Deep Metric Learning with Adaptive Margin and Adaptive Scale for Acoustic Word Discrimination	➖
2252	Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective	➖
2250	Combining Heterogeneous Structures for Event Causality Identification	➖	➖
1208	An Efficient Approach for the Automated Segmentation and Transcription of the People's Speech Corpus	➖	➖
1425	Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition	➖	➖
903	Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text
466	Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin
1878	Efficient Adaptation of Spoken Language Understanding based on End-to-End Automatic Speech Recognition	➖	➖
597	PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords		➖
69	Mix before Align: Towards Zero-shot Cross-lingual Sentiment Analysis via Soft-Mix and Multi-View Learning	➖	➖
170	AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation	➖
2225	Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff	➖	➖
1979	Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages

Speech, Voice, and Hearing Disorders

#	Title	Repo	Paper
2421	Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test	➖
2198	Multimodal Locally Enhanced Transformer for Continuous Sign Language Recognition	➖	➖
1759	Towards Supporting an Early Diagnosis of Multiple Sclerosis using Vocal Features	➖	➖
1891	Whisper Features for Dysarthric Severity-Level Classification	➖	➖
2191	A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning
222	Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra	➖	➖
2026	A Stutter Seldom Comes Alone - Cross-Corpus Stuttering Detection as a Multi-label Problem	➖
1542	Transfer Learning to Aid Dysarthria Severity Classification for Patients with Amyotrophic Lateral Sclerosis	➖	➖
2203	DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model	➖
201	CNVVE: Dataset and Benchmark for Classifying Non-verbal Voice
1541	Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation
1887	Weakly-supervised Forced Alignment of Disfluent Speech using Phoneme-level Modeling
1998	Glottal Source Analysis of Voice Deficits in Basal Ganglia Dysfunction: Evidence from de novo Parkinson's Disease and Huntington's Disease	➖	➖
2478	An Analysis of Glottal Features of Chronic Kidney Disease Speech and its Application to CKD Detection	➖	➖
983	Weakly Supervised Glottis Segmentation in High-speed Video Endoscopy using Bounding Box Labels	➖	➖

Spoken Term Detection and Voice Search

#	Title	Repo	Paper
478	Matching Latent Encoding for Audio-Text based Keyword Spotting	➖
1215	Self-Paced Pattern Augmentation for Spoken Term Detection in Zero-Resource	➖	➖
2362	On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation by Gene	➖
90	Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics	➖	➖
689	Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data	➖	➖
2222	Robust Keyword Spotting for Noisy Environments by Leveraging Speech Enhancement and Speech Presence Probability	➖	➖

Models for Streaming ASR

#	Title	Repo	Paper
831	Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning	➖
1497	ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs	➖
361	Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation	➖
1129	DCTX-Conformer: Dynamic Context Carry-over for Low Latency Unified Streaming and Non-streaming Conformer	➖
1121	Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer	➖	➖
884	Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition	➖

Source Separation

#	Title	Repo	Paper
1753	Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
1389	Remixing-based Unsupervised Source Separation from Scratch	➖	➖
577	CAPTDURE: Captioned Sound Dataset of Single Sources	➖
488	Recursive Sound Source Separation with Deep Learning-based Beamforming for Unknown Number of Sources	➖	➖
2537	Multi-Channel Speech Separation with Cross-Attention and Beamforming	➖	➖
185	Background-Sound Controllable Voice Source Separation	➖	➖

Speech Perception

#	Title	Repo	Paper
1922	A Neural Architecture for Selective Attention to Speech Features	➖	➖
1122	Quantifying Informational Masking due to Masker Intelligibility in Same-talker Speech-in-speech Perception	➖	➖
1476	On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions	➖	➖
2154	Predicting Perceptual Centers Located at Vowel Onset in German Speech using Long Short-Term Memory Networks	➖	➖
63	Exploring the Mutual Intelligibility Breakdown Caused by Sculpting Speech from a Competing Speech Signal	➖	➖
2103	Perception of Incomplete Voicing Neutralization of Obstruents in Tohoku Japanese	➖	➖

Phonetics and Phonology: Languages and Varieties

#	Title	Repo	Paper
1879	The Emergence of Obstruent-intrinsic f0 and VOT as Cues to the Fortis/Lenis Contrast in West Central Bavarian	➖	➖
431	〈'〉 in Tsimane': a Preliminary Investigation	➖	➖
2200	Segmental Features of Brazilian (Santa Catarina) Hunsrik	➖	➖
2337	Opening or closing? An Electroglottographic Analysis of Voiceless Coda Consonants in Australian English	➖	➖
295	Increasing Aspiration of Word-medial Fortis Plosives in Swiss Standard German	➖	➖
1456	Lexical Stress and Velar Palatalization in Italian: A Spatio-temporal Interaction	➖	➖

Speaker and Language Identification

#	Title	Repo	Paper
1989	Vietnam-Celeb: a Large-scale Dataset for Vietnamese Speaker Recognition	➖	➖
2254	What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model
241	The 2022 NIST Language Recognition Evaluation	➖
1446	MERLIon CCS Challenge: A English-Mandarin Code-switching Child-directed Speech Corpus for Language Identification and Diarization
1725	ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention
402	Branch-ECAPA-TDNN: A Parallel Branch Architecture to Capture Local and Global Features for Speaker Verification	➖	➖
2052	Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech	➖
2569	Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification	➖	➖
1407	A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General Purpose Speech Model
2272	HABLA: A Dataset of Latin American Spanish Accents for Voice Anti-spoofing	➖	➖
1702	Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory	➖	➖
800	Extremely Low Bit Quantization for Mobile Speaker Verification Systems Under 1MB Memory	➖	➖
1974	Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis Distance	➖
105	Pyannote.Audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark and Recipe
1524	Model Compression for DNN-based Speaker Verification using Weight Quantization	➖
1354	Multi-resolution Approach to Identification of Spoken Languages and to Improve Overall Language Diarization System using Whisper Model	➖	➖
125	Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms	➖
849	Dynamic Fully-Connected Layer for Large-Scale Speaker Verification	➖	➖
1314	Mutual Information-based Embedding Decoupling for Generalizable Speaker Verification	➖	➖
1206	TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection	➖
777	ECAPA++: Fine-grained Deep Embedding Learning for TDNN Based Speaker Verification	➖	➖
100	Fooling Speaker Identification Systems with Adversarial Background Music	➖	➖
574	Target Active Speaker Detection with Audio-visual Cues
2401	Improving End-to-End Neural Diarization using Conversational Summary Representations	➖
2039	Phase Perturbation Improves Channel Robustness for Speech Spoofing Countermeasures
210	Improving Training Datasets for Resource-constrained Speaker Recognition Neural Networks	➖	➖
1498	Instance-based Temporal Normalization for Speaker Verification	➖	➖
881	On the Robustness of Wav2Vec 2.0 based Speaker Recognition Systems	➖	➖
697	P-vectors: A Parallel-coupled TDNN/Transformer Network for Speaker Verification
844	Reversible Neural Networks for Memory-Efficient Speaker Verification	➖	➖
452	Robust Training for Speaker Verification against Noisy Labels
1404	Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization	➖	➖
1217	Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022	➖
1648	Describing the Phonetics in the Underlying Speech Attributes for Deep and Interpretable Speaker Recognition		➖
1214	Range-Based Equal Error Rate for Spoof Localization	➖
1888	Exploring the English Accent-independent Features for Speech Emotion Recognition using Filter and Wrapper-based Methods for Feature Selection	➖	➖
205	Powerset Multi-class Cross Entropy Loss for Neural Speaker Diarization	➖	➖
394	A Method of Audio-Visual Person Verification by Mining Connections between Time Series	➖	➖
1249	Group GMM-ResNet for Detection of Synthetic Speech Attacks	➖	➖

Speech Synthesis and Voice Conversion

#	Title	Repo	Paper
2336	Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction	➖	➖
160	Streaming Parrotron for On-device Speech-to-Speech Conversion	➖
2407	Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech	➖	➖
2518	E2E-S2S-VC: End-to-End Sequence-to-Sequence Voice Conversion	➖	➖
2403	DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer	➖
419	Voice Conversion With Just Nearest Neighbors
1193	CFVC: Conditional Filtering for Controllable Voice Conversion	➖	➖
1157	DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding
39	Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion	➖	➖
836	ALO-VC: Any-to-any Low-latency One-shot Voice Conversion
1978	Evaluating and Reducing the Distance between Synthetic and Real Speech Distributions	➖
2202	Decoupling Segmental and Prosodic cues of Non-native Speech through Vector Quantization	➖	➖
2383	VC-T: Streaming Voice Conversion based on Neural Transducer	➖	➖
191	Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion Preserving Voice Conversion	➖	➖
1788	ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
1356	Reverberation-Controllable Voice Conversion Using Reverberation Time Estimator	➖	➖
2558	Cross-utterance Conditioned Coherent Speech Editing	➖	➖

Speech and Language in Health: From Remote Monitoring to Medical Conversations

#	Title	Repo	Paper
963	Respiratory Distress Estimation in Human-robot Interaction Scenario	➖	➖
947	Towards Robust Paralinguistic Assessment for Real-world Mobile Health (mHealth) Monitoring: an Initial Study of Reverberation Effects on Speech	➖
301	On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition	➖
2079	Automatic Assessment of Alzheimer's across Three Languages using Speech and Language Features	➖	➖
1722	Relationship between LTAS-based Spectral Moments and Acoustic Parameters of Hypokinetic Dysarthria in Parkinson's Disease	➖	➖
1946	Active Learning for Abnormal Lung Sound Data Curation and Detection in Asthma	➖	➖
913	Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization	➖	➖
1263	Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition	➖
322	Use of Speech Impairment Severity for Dysarthric Speech Recognition	➖
1709	Bayesian Networks for the Robust and Unbiased Prediction of Depression and its Symptoms Utilizing Speech and Multimodal Data	➖
1332	Personalization for Robust Voice Pathology Detection in Sound Waves	➖	➖
2287	An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients	➖	➖
1997	Classifying Dementia in the Presence of Depression: A Cross-Corpus Study	➖	➖
2101	Non-uniform Speaker Disentanglement for Depression Detection from Raw Speech Signals	➖
296	FTA-net: A Frequency and Time Attention Network for Speech Depression Detection	➖	➖
2249	Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders	➖	➖
1990	Capturing Mismatch between Textual and Acoustic Emotion Expressions for Mood Identification in Bipolar Disorder	➖
297	Exploiting Cross-Domain and Cross-Lingual Ultrasound Tongue Imaging Features for Elderly and Dysarthric Speech Recognition	➖
2100	Combining Multiple Multimodal Speech Features into an Interpretable Index Score for Capturing Disease Progression in Amyotrophic Lateral Sclerosis	➖
2002	Responsiveness, Sensitivity and Clinical Utility of Timing-Related Speech Biomarkers for Remote Monitoring of ALS Disease Progression	➖
753	PoCaPNet: A Novel Approach for Surgical Phase Recognition using Speech and X-Ray Images
1721	Classifying Depression Symptom Severity: Assessment of Speech Representations in Personalized and Generalized Machine Learning Models	➖	➖
1435	Towards Reference Speech Characterization for Health Applications	➖	➖
1438	The MASCFLICHT Corpus: Face Mask Type and Coverage Area Recognition from Speech	➖	➖
721	MMLung: Moving Closer to Practical Lung Health Estimation using Smartphones	➖
1916	Whisper Encoder features for Infant Cry Classification	➖	➖
464	Multi-class Detection of Pathological Speech with Latent Features: How does It Perform on Unseen Data?	➖
2146	Automatic Classification of Hypokinetic and Hyperkinetic Dysarthria based on GMM-Supervectors	➖	➖
1771	Prediction of the Gender-based Violence Victim Condition using Speech: What do Machine Learning Models rely on?	➖	➖

Novel Transformer Models for ASR

#	Title	Repo	Paper
2228	Conmer: Streaming Conformer without Self-attention for Interactive Voice Assistants	➖
1255	Intra-ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition	➖	➖
1194	A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
1611	HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition	➖
893	Memory-augmented Conformer for Improved End-To-End Long-form ASR	➖	➖
552	Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems	➖

Speaker Recognition

#	Title	Repo	Paper
1294	An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification
1286	A Study on Visualization of Voiceprint Feature	➖	➖
1083	VoxTube: a Multilingual Speaker Recognition Dataset	➖	➖
1298	Visualizing Data Augmentation in Deep Speaker Recognition	➖
1565	Ordered and Binary Speaker Embedding	➖
2031	Self-FiLM: Conditioning GANs with Self-supervised Representations for Bandwidth Extension based Speaker Recognition	➖
1202	Curriculum Learning for Self-supervised Speaker Verification	➖
1558	Introducing Self-Supervised Phonetic Information for Text-Independent Speaker Verification	➖	➖
1379	A Teacher-Student Approach for Extracting Informative Speaker Embeddings from Speech Mixtures	➖
1479	Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification	➖

Cross-lingual and Multilingual ASR

#	Title	Repo	Paper
97	Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR System	➖
1338	UniSplice: Universal Cross-Lingual Data Splicing for Low-Resource ASR	➖	➖
772	Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes
1630	Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition	➖	➖
1061	Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages	➖
1444	DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Voice Conversion

#	Title	Repo	Paper
251	Emotional Voice Conversion with Semi-Supervised Generative Modeling	➖	➖
817	Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation	➖	➖
215	S2CD-VC: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion	➖	➖
1508	Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion	➖	➖
1602	Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation
2298	End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

Pathological Speech Analysis

#	Title	Repo	Paper
2093	Multimodal Assessment of Bulbar Amyotrophic Lateral Sclerosis (ALS) using a Novel Remote Speech Assessment App	➖	➖
2181	On the use of High Frequency Information for Voice Pathology Classification	➖	➖
1784	Do Phonatory Features Display Robustness to Characterize Parkinsonian Speech Across Corpora?	➖	➖
2531	Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features	➖	➖
1915	Comparison of Acoustic Measures of Dysphonia in Parkinson's Disease and Huntington's Disease: Effect of Sex and Speaking Task	➖	➖
1734	Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses
1574	A Pipeline to Evaluate the Effects of Noise on Machine Learning Detection of Laryngeal Cancer	➖	➖
2474	ReCLR: Reference-Enhanced Contrastive Learning of Audio Representation for Depression Detection	➖	➖
234	Automated Multiple Sclerosis Screening Based on Encoded Speech Representations	➖	➖
1934	Cross-Lingual Features for Alzheimer's Dementia Detection from Speech	➖	➖
1653	Careful Whisper - Leveraging Advances in Automatic Speech Recognition for Robust and Interpretable Aphasia Subtype Classification	➖	➖
1868	Behavioral Analysis of Pathological Speaker Embeddings of Patients During Oncological Treatment of Oral Cancer	➖	➖

Multimodal Speech Emotion Recognition

#	Title	Repo	Paper
1832	LanSER: Language-Model Supported Speech Emotion Recognition	➖	➖
463	Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition	➖	➖
1591	Emotion Label Encoding using Word Embeddings for Speech Emotion Recognition	➖	➖
2444	Discrimination of the Different Intents Carried by the Same Text Through Integrating Multimodal Information	➖	➖
510	Meta-domain Adversarial Contrastive Learning for Alleviating Individual Bias in Self-sentiment Predictions	➖	➖
413	SWRR: Feature Map Classifier Based on Sliding Window Attention and High-Response Feature Reuse for Multimodal Emotion Recognition	➖	➖

Phonetics, Phonology, and Prosody

#	Title	Repo	Paper
1443	Effects of Meter, Genre and Experience on Pausing, Lengthening and Prosodic Phrasing in German Poetry Reading	➖	➖
1142	Comparing First Spectral Moment of Australian English /s/ between Straight and Gay Voices using Three Analysis Window Sizes	➖	➖
2584	Universal Automatic Phonetic Transcription into the International Phonetic Alphabet		➖
2134	Voice Twins: Discovering Extremely Similar-sounding, Unrelated Speakers	➖	➖
1042	Filling the Population Statistics Gap: Swiss German Reference Data on F0 and Speech Tempo for Forensic Contexts	➖	➖
1619	Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse Markers	➖	➖
2214	Evaluation of a Forensic Automatic Speaker Recognition System with Emotional Speech Recordings	➖	➖
1052	An Outlier Analysis of Vowel Formants from a Corpus Phonetics Pipeline
340	The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features	➖	➖
1880	Beatboxing Kick Drum Kinematics	➖	➖
536	Effects of Hearing Loss and Amplification on Mandarin Consonant Perception	➖	➖
2020	An Acoustic Analysis of Fricative Variation in Three Accents of English	➖	➖
109	Acoustic Cues to Stress Perception in Spanish – a Mismatch Negativity Study	➖	➖
976	Bulgarian Unstressed Vowel Reduction: Received Views vs Corpus Findings	➖	➖
1764	An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations	➖
498	Identifying Stable Sections for Formant Frequency Extraction of French Nasal Vowels based on Difference Thresholds	➖	➖
1772	Nonbinary American English Speakers Encode Gender in Vowel Acoustics	➖	➖
44	Coarticulation of Sibe Vowels and Dorsal Fricatives in Spontaneous Speech: An Acoustic Study	➖	➖
1013	Using Speech Synthesis to Explain Automatic Speaker Recognition: a New Application of Synthetic Speech	➖	➖
2534	Same F0, Different Tones: A Multidimensional Investigation of Zhangzhou Tones	➖	➖
1985	Discovering Phonetic Feature Event Patterns in Transformer Embeddings	➖	➖
2204	A System for Generating Voice Source Signals that Implements the Transformed LF-model Parameter Control	➖	➖
2352	Speaker-independent Speech Inversion for Estimation of Nasalance	➖
1359	Effects of Tonal Coarticulation and Prosodic Positions on Tonal Contours of Low Rising Tones: In the Case of Xiamen Dialect	➖
2187	Durational and Non-durational Correlates of Lexical and Derived Geminates in Arabic	➖	➖
68	Mapping Phonemes to Acoustic Symbols and Codes using Synchrony in Speech Modulation Vectors Estimated by the Travellingwave Filter Bank	➖	➖
1480	Rhythmic Characteristics of L2 German Speech by Advanced Chinese Learners	➖	➖
1538	(Dis)agreement and Preference Structure are Reflected in Matching Along Distinct Acoustic-prosodic Features	➖	➖
995	Vowel Reduction by Greek-speaking Children: The Effect of Stress and Word Length	➖	➖
1822	Pitch Distributions in a Very Large Corpus of Spontaneous Finnish Speech	➖	➖
828	Speech Enhancement Patterns in Human-Robot Interaction: A Cross-Linguistic Perspective	➖	➖
1903	Evaluation of Delexicalization Methods for Research on Emotional Speech	➖	➖

Speech Coding: Privacy

#	Title	Repo	Paper
1026	Masking Kernel for Learning Energy-Efficient Representations for Speaker Recognition and Mobile Health	➖
727	eSTImate: A Real-time Speech Transmission Index Estimator with Speech Enhancement Auxiliary Task Using Self-Attention Feature Pyramid Network	➖	➖
815	Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement	➖
2138	Privacy-preserving Representation Learning for Speech Understanding	➖	➖
448	Vocoder Drift in X-vector–based Speaker Anonymization
703	Malafide: a Novel Adversarial Convolutive Noise Attack Against Deepfake and Spoofing Detection Systems	➖

Analysis of Neural Speech Representations

#	Title	Repo	Paper
1087	Speech Self-Supervised Representation Bench-marking: Are We Doing it Right?
383	An Extension of Disentanglement Metrics and Its Application to Voice	➖	➖
2131	An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech
1823	SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
1418	Comparison of GIF- and SSL-based Features in Pathological Voice Detection	➖	➖
1617	What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions	➖	➖

End-to-end ASR

#	Title	Repo	Paper
1640	End-to-End Joint Target and Non-Target Speakers ASR	➖
144	Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition	➖
564	Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction	➖	➖
101	Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition
906	Text-only Domain Adaptation for End-to-End ASR using Integrated Text-to-mel-Spectrogram Generator	➖
142	Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition	➖

Spoken Language Understanding, Summarization, and Information Retrieval

#	Title	Repo	Paper
461	Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling	➖	➖
277	Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models	➖	➖
1307	Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization	➖
1136	Audio Retrieval with WavText5K and CLAP Training
242	Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
1652	Contrastive Disentangled Learning for Memory-Augmented Transformer	➖	➖

Invariant and Robust Pre-trained Acoustic Models

#	Title	Repo	Paper
438	ProsAudit, a Prosodic Benchmark for Self-Supervised Speech Models	➖
1390	CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning
847	Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering
871	Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces	➖
1862	Evaluating Context-invariance in Unsupervised Speech Representations
359	Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder	➖	➖

Speech Synthesis: Representation Learning

#	Title	Repo	Paper
1571	Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech	➖	➖
2313	Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers	➖
2574	SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	➖	➖
2326	UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
677	LightVoc: an Upsampling-Free GAN Vocoder Based on Conformer and Inverse Short-time Fourier Transform	➖	➖
1095	ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings

Speech Perception, Production, and Acquisition

#	Title	Repo	Paper
1330	Human Transcription Quality Improvement	➖
1604	The Effect of Masking Noise on Listeners' Spectral Tilt Preferences	➖	➖
1967	The Effect of Whistled Vowels on Whistled Word Categorization for Naive Listeners	➖	➖
1481	Automatic Deep Neural Network-Based Segmental Pronunciation Error Detection of L2 English Speech (L1 Bengali)	➖	➖
1662	The Effect of Stress on Mandarin Tonal Perception in Continuous Speech for Spanish-speaking Learners	➖	➖
1918	Combining Acoustic and Aerodynamic Data Collection: A Perceptual Evaluation of Acoustic Distortions	➖	➖
953	Estimating Virtual Targets for Lingual Stop Consonants using General Tau Theory	➖	➖
1931	Using Random Forests to Classify Language as a Function of Syllable Timing in Two Groups: Children with Cochlear Implants and with Normal Hearing	➖	➖
2256	An Improved End-to-End Audio-Visual Speech Recognition Model	➖	➖
1954	What Influences the Foreign Accent Strength? Phonological and Grammatical Errors in the Perception of Accentedness	➖	➖
2077	Investigating the Perception Production Link through Perceptual Adaptation and Phonetic Convergence	➖	➖
1385	Emotion Prompting for Speech Emotion Recognition	➖	➖
1196	Speech-in-Speech Recognition is Modulated by Familiarity to Dialect	➖	➖
673	BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions
2046	Are Retroflex-to-dental Sibilant Substitutions in Polish Children's Speech an Example of a Covert Contrast? A Preliminary Acoustic Study	➖	➖
1123	First Language Effects on Second Language Perception: Evidence from English Low-vowel Nasal Sequences Perceived by L1 Mandarin Chinese Listeners	➖	➖
2247	Motor Control Similarity between Speakers Saying "a Souk" using Inverse Atlas Tongue Modeling	➖	➖
910	Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language Models	➖
317	A Relationship Between Vocal Fold Vibration and Droplet Production	➖	➖
803	Audio, Visual and Audiovisual Intelligibility of Vowels Produced in Noise	➖	➖
593	Computational Modeling of Auditory Brainstem Responses Derived from Modified Speech	➖	➖
1732	Leveraging Label Information for Multimodal Emotion Recognition	➖	➖
1465	Improving End-to-End Modeling for Mandarin-English Code-Switching using Lightweight Switch-Routing Mixture-of-Experts	➖	➖
1803	Frequency Patterns of Individual Speaker Characteristics at Higher and Lower Spectral Ranges	➖	➖
1818	Adaptation to Predictive Prosodic cues in Non-native Standard Dialect	➖	➖
1007	Head Movements in Two- and Four-person Inter-active Conversational Tasks in Noisy and Moderately Reverberant Conditions	➖	➖
334	Second Language Identification of Vietnamese Tones by Native Mandarin Learners	➖	➖
203	Nasal Vowel Production and Grammatical Processing in French-speaking Children with Cochlear Implants and Normal-hearing Peers	➖	➖
412	Emotion Classification with EEG Responses Evoked by Emotional Prosody of Speech	➖	➖
145	L2-Mandarin Regional Accent Variability During Mandarin Tone-word Training Facilitates English listeners' Subsequent tone Categorizations	➖	➖
1680	HumanDiffusion: Diffusion Model using Perceptual Gradients	➖
2087	Queer Events, Relationships, and Sports: Does Topic Influence Speakers' Acoustic Expression of Sexual Orientation?	➖	➖
172	Optimal Control of Speech with Context-dependent Articulatory Targets	➖	➖

Acoustic Model Adaptation for ASR

#	Title	Repo
583	Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems	➖
1349	Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition
327	Towards Cross-Lingual Cross-Age Adaptation for Low-Resource Elderly Speech Emotion Recognition
2215	Modular Domain Adaptation for Conformer-based Streaming ASR	➖
2192	Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters	➖
1282	SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization

Speech Synthesis: Expressivity

#	Title	Repo	Paper
858	Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions	➖	➖
2242	Dual Audio Encoders Based Mandarin Prosodic Boundary Prediction by using Multi-Granularity Prosodic Representations	➖	➖
645	NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS
782	MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy
2469	Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS	➖	➖
843	CASEIN: Cascading Explicit and Implicit Control for Fine-grained Emotion Intensity Regulation	➖
1405	Semi-supervised Learning for Continuous Emotional Intensity Controllable Speech Synthesis with Disentangled Representations	➖
1905	Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis	➖	➖
1460	ComedicSpeech: Adaptive Text To Speech For Stand-up Comedy in Low-Resource Scenario
1552	Neural Speech Synthesis with Enriched Phrase Boundaries	➖	➖
437	Cross-lingual Prosody Transfer for Expressive Machine Dubbing	➖
2178	Synthesis after a couple PINTs: Investigating the Role of Pause-Internal Phonetic Particles in Speech Synthesis and Perception	➖	➖
433	Accentor: An Explicit Lexical Stress Model for TTS Systems	➖
1032	A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers	➖	➖
715	Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
289	Prosody Modeling with 3D Visual Information for Expressive Video Dubbing	➖	➖
1528	LightClone: Speaker-guided Parallel Subnet Selection for Few-shot Voice Cloning	➖	➖
1671	EE-TTS: Emphatic Expressive TTS with Linguistic Information
1673	Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS	➖
122	ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading
1779	PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
1639	Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models
2453	A Generative Framework for Conversational Laughter: Its "Language Model" and Laughter Sound Synthesis	➖
1754	Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis	➖	➖
2072	Beyond Style: Synthesizing Speech with Pragmatic Functions	➖
965	eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer	➖

Multi-modal Systems

#	Title	Repo	Paper
1146	BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
370	Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning	➖
989	Whistle-to-Text: Automatic Recognition of the Silbo Gomero Whistled Language	➖	➖
663	A Novel Interpretable and Generalizable Re-synchronization Model for Cued Speech based on a Multi-Cuer Corpus	➖
668	Visually Grounded Few-shot Word Acquisition with Fewer Shots	➖
183	JAMFN: Joint Attention Multi-Scale Fusion Network for Depression Detection	➖	➖

Question Answering from Speech

#	Title	Repo	Paper
1485	Prompt Guided Copy Mechanism for Conversational Question Answering	➖	➖
1240	Composing Spoken Hints for Follow-on Question Suggestion in Voice Assistants	➖	➖
1391	On Monotonic Aggregation for Open-domain QA	➖	➖
2240	Question-Context Alignment and Answer-Context Dependencies for Effective Answer Sentence Selection	➖
1606	Multi-Scale Attention for Audio Question Answering	➖
539	Enhancing Visual Question Answering via Deconstructing Questions and Explicating Answers	➖	➖

Multi-talker Methods in Speech Processing

#	Title	Repo	Paper
1749	SEF-Net: Speaker Embedding Free Target Spekaer Extraction Network	➖	➖
1530	Overlap Aware Continuous Speech Separation without Permutation Invariant Training Linfeng	➖	➖
1952	Cascaded Encoders for Fine-Tuning ASR Models on Overlapped Speech	➖
2069	TokenSplit: using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition	➖	➖
1422	Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator	➖
2098	Time-domain Transformer-based Audiovisual Speaker Separation	➖	➖
628	Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization	➖
1502	Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization	➖	➖
1521	BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR	➖
1172	Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation	➖	➖
975	Joint Compensation of Multi-talker Noise and Reverberation for Speech Enhancement with Cochlear Implants using One or More Microphones	➖	➖
494	Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach	➖	➖
42	GPU-accelerated Guided Source Separation for Meeting Transcription
1280	Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition
2076	Directional Speech Recognition for Speaker Disambiguation and Cross-talk Suppression	➖	➖
1815	Mixture Encoder for Joint Speech Separation and Recognition	➖

Sociophonetics

#	Title	Repo	Paper
206	Aberystwyth English Pre-aspiration in Apparent Time	➖	➖
1154	Speech Entrainment in Chinese Story-Style Talk Shows: The Interaction Between Gender and Role	➖	➖
1414	Sociodemographic and Attitudinal Effects on Dialect Speakers' Articulation of the Standard Language: Evidence from German-Speaking Switzerland	➖	➖
1704	Vowel Normalisation in Latent Space for Sociolinguistics	➖	➖

Speaker and Language Diarization

#	Title	Repo	Paper
1228	Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor	➖
1447	Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism	➖	➖
2367	The DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments
1982	Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction	➖
1839	The SpeeD-ZevoTech submission at DISPLACE 2023	➖	➖
656	End-to-End Neural Speaker Diarization with Absolute Speaker Loss	➖	➖

Anti-Spoofing for Speaker Verification

#	Title	Repo	Paper
1402	Towards Single Integrated Spoofing-aware Speaker Verification Embeddings
1352	Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification	➖
2335	Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion	➖
1166	Robust Audio Anti-Spoofing Countermeasure with Joint Training of Front-end and Back-end and Models	➖	➖
1537	Improved DeepFake Detection using Whisper Features
371	DoubleDeceiver: Deceiving the Speaker Verification System Protected by Spoofing Countermeasures	➖	➖

Speech Coding: Intelligibility

#	Title	Repo	Paper
2209	On Training a Neural Residual Acoustic echo Suppressor for Improved ASR	➖	➖
1429	Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation
378	UnSE: Unsupervised Speech Enhancement using Optimal Transport	➖	➖
1130	MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation
2177	Causal Signal-based DCCRN with Overlapped-Frame Prediction for Online Speech Enhancement	➖	➖
1511	Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction	➖

New Computational Strategies for ASR Training and Inference

#	Title	Repo	Paper
2183	A Metric-Driven Approach to Conformer Layer Pruning for Efficient ASR Inference	➖	➖
1981	Distillation Strategies for Discriminative Speech Recognition Rescoring	➖
969	Another Point of View on Visual Speech Recognition	➖	➖
1062	RASR2: The RWTH ASR Toolkit for Generic Sequence-to-Sequence Speech Recognition	➖
486	Streaming Speech-to-Confusion Network Speech Recognition	➖
809	Accurate and Structured Pruning for Efficient Automatic Speech Recognition	➖

MERLIon CCS Challenge: Multilingual Everyday Recordings - Language Identification On Code-Switched Child-Directed Speech

#	Title	Repo	Paper
1335	Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech
1707	Investigating Model Performance in Language Identification: beyond Simple Error Statistics	➖
2533	Improving Wav2vec2-based Spoken Language Identification by Learning Phonological Features	➖	➖
2047	Language Identification Networks for Multilingual Everyday Recordings	➖	➖

Health-Related Speech Analysis

#	Title	Repo	Paper
2038	Classification of Vocal Intensity Category from Speech using the Wav2vec2 and Whisper Embeddings	➖	➖
1668	The Effect of Clinical Intervention on the Speech of Individuals with PTSD: Features and Recognition Performances	➖	➖
470	Analysis and Automatic Prediction of Exertion from Speech: Contrasting Objective and Subjective Measures Collected while Running	➖	➖
894	The Androids Corpus: A New Publicly Available Benchmark for Speech Based Depression Detection	➖	➖
658	Comparing Hand-Crafted Features to Spectrograms for Autism Severity Estimation	➖	➖
839	Acoustic Characteristics of Depression in Older Adults' Speech: the Role of Covariates	➖	➖

Automatic Audio Classification and Audio Captioning

#	Title	Repo
943	Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning	➖
1564	Adapting a ConvNeXt Model to Audio Classification on AudioSet
1610	Few-shot Class-incremental Audio Classification using Stochastic Classifier
1614	Enhance Temporal Relations in Audio Captioning with Sound Event Detection	➖

Speech Synthesis

#	Title	Repo	Paper
407	Epoch-Based Spectrum Estimation for Speech	➖	➖
1996	OverFlow: Putting Flows on Top of Neural Transducers for Better TTS
1568	AdapterMix: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation
506	Prior-free Guided TTS: An Improved and Efficient Diffusion-based Text-Guided Speech Synthesis	➖	➖
367	UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion Model	➖
1301	Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech	➖	➖
1151	Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge
879	Towards Robust FastSpeech 2 by Modelling Residual Multimodality
1137	Real Time Spectrogram Inversion on Mobile Phone
58	Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis	➖
2056	A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic	➖	➖
2173	Self-Supervised Solution to the Control Problem of Articulatory Synthesis	➖	➖
1128	Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis		➖
754	ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
690	Improving WaveRNN with Heuristic Dynamic Blending for Fast and High-Quality GPU Vocoding	➖	➖
194	Intelligible Lip-to-speech Synthesis with Speech Units	➖
1212	Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
820	Controlling Formant Frequencies with Neural Text-to-Speech for the Manipulation of Perceived Speaker Age	➖	➖
2379	FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs
1726	iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder using 1D-2D CNN	➖	➖
534	VITS2: Improving Quality and Efficiency of Single Stage Text to Speech with Adversarial Learning and Architecture Design	➖	➖
1175	Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme	➖	➖

Speech Synthesis: Controllability and Adaptation

#	Title	Repo	Paper
1608	HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer		➖
391	VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer
700	EdenTTS: A Simple and Efficient Parallel Text-to-speech Architecture with Collaborative Duration-alignment Learning	➖	➖
368	Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations	➖	➖
1020	Speech inpainting: Context-based Speech Synthesis Guided by Video
2243	STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework	➖	➖

Search Methods and Decoding Algorithms for ASR

#	Title	Repo	Paper
33	Average Token Delay: A Latency Metric for Simultaneous Translation	➖
1450	Automatic Speech Recognition Transformer with Global Contextual Information Decoder	➖	➖
1333	Time-synchronous One-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training	➖	➖
2065	Prefix Search Decoding for RNN Transducers	➖	➖
78	WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
2449	Implementing Contextual Biasing in GPU Decoder for Online ASR

Speech Signal Analysis

#	Title	Repo	Paper
2487	MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion	➖
2211	Enhancing Speech Articulation Analysis Using A Geometric Transformation of the X-ray Microbeam Dataset	➖
1729	Matching Acoustic and Perceptual Measures of Phonation Assessment in Disordered Speech - A Case Study	➖	➖
283	Improved Contextualized Speech Representations for Tonal Analysis	➖	➖
1738	A Study on the Importance of Formant Transitions for Stop-Consonant Classification in VCV Sequence	➖
2229	FusedF0: Improving DNN-based F0 Estimation by Fusion of Summary-Correlograms and Raw Waveform Representations of Speech Signals	➖

Connecting Speech-science and Speech-technology for Children's Speech

#	Title	Repo	Paper
928	Using Commercial ASR Solutions to Assess Reading Skills in Children: A Case Report	➖	➖
907	Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition	➖
2185	Speech Breathing Behavior During Pauses in Children	➖	➖
926	Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech	➖
1924	Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders	➖
978	BabySLM: Language-acquisition-friendly Benchmark of Self-supervised Spoken Language Models
702	Data Augmentation for Children ASR and Child-adult Speaker Classification using Voice Conversion Methods	➖	➖
2236	Developmental Articulatory and Acoustic Features for Six to Ten Year Old Children	➖	➖
2251	Automatically Predicting Perceived Conversation Quality in a Pediatric Sample Enriched for Autism	➖	➖
1257	An Equitable Framework for Automatically Assessing Children's Oral Narrative Language Abilities	➖	➖
743	An Analysis of Goodness of Pronunciation for Child Speech	➖	➖
1569	Measuring Language Development from Child-centered Recordings	➖	➖
2057	Speaking Clearly, Understanding Better: Predicting the L2 Narrative Comprehension of Chinese Bilingual Kindergarten Children Based on Speech Intelligibility using a Machine Learning Approach	➖	➖
312	Classifying Rhoticity of /r/ in Speech Sound Disorder using Age-and-Sex Normalized Formants	➖
1273	Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings	➖
2099	Measuring Phonological Precision in Children with Cleft Lip and Palate	➖	➖
937	A Study on using Duration and Formant Features in Automatic Detection of Speech Sound Disorder in Children	➖	➖
1873	Influence of Utterance and Speaker Characteristics on the Classification of Children with Cleft Lip and Palate	➖	➖
1882	Prospective Validation of Motor-Based Intervention with Automated Mispronunciation Detection of Rhotics in Residual Speech Sound Disorders	➖

Dialog Management

#	Title	Repo	Paper
2238	Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning	➖
2525	An Autoregressive Conversational Dynamics Model for Dialogue Systems	➖	➖
1983	Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos	➖
1037	Speech Aware Dialog System Technology Challenge (DSTC11)
1397	Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision
2513	Tracking Must Go On: Dialogue State Tracking with Verified Self-Training	➖	➖

Speech Activity Detection and Modeling

#	Title	Repo	Paper
558	GL-SSD: Global and Local Speech Style Disentanglement by Vector Quantization for Robust Sentence Boundary Detection in Speech Stream	➖	➖
598	Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction	➖
2466	Dynamic Encoder RNN for Online Voice Activity Detection in Adverse Noise Conditions	➖	➖
996	Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets	➖
716	Real-Time Causal Spectro-Temporal Voice Activity Detection based on Convolutional Encoding and Residual Decoding	➖	➖
2413	SVVAD: Personal Voice Activity Detection for Speaker Verification	➖

Multilingual Models for ASR

#	Title	Repo	Paper
1613	Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition	➖
2122	AfriNames: Most ASR models "butcher" African Names	➖
2528	Towards Dialect-inclusive Recognition in a Low-resource Language: are Balanced Corpora the Answer?	➖	➖
2588	Svarah: Evaluating English ASR Systems on Indian Accents
1044	N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition	➖
1014	The MALACH Corpus: Results with End-to-End Architectures and Pretraining	➖	➖

Speech Enhancement and Bandwidth Expansion

#	Title	Repo	Paper
232	Unsupervised Speech Enhancement with Deep Dynamical Generative Speech and Noise Models	➖
857	Noise-Robust Bandwidth Expansion for 8K Speech Recordings	➖	➖
113	mdctGAN: Taming Transformer-based GAN for Speech Super-resolution with Modified DCT Spectra
625	Zoneformer: On-device Neural Beamformer for In-car Multi-zone Speech Separation, Enhancement and echo Cancellation	➖	➖
634	Low-complexity Broadband Beampattern Synthesis using Array Response Control	➖	➖
904	A GAN Speech Inpainting Model for Audio Editing Software	➖	➖

Articulation

#	Title	Repo	Paper
2316	Deep Speech Synthesis from MRI-Based Articulatory Representations
562	Learning to Compute the Articulatory Representations of Speech with the MIRRORNET
804	Generating High-resolution 3D Real-time MRI of the Vocal Tract	➖	➖
1593	Exploring a Classification Approach using Quantised Articulatory Movements for Acoustic to Articulatory Inversion	➖	➖

Neural Processing of Speech and Language: Encoding and Decoding the Diverse Auditory Brain

#	Title	Repo	Paper
633	Coherence Estimation Tracks Auditory Attention in Listeners with Hearing Impairment	➖	➖
2378	Enhancing the EEG Speech Match Mismatch Tasks With Word Boundaries
1347	Similar Hierarchical Representation of Speech and Other Complex Sounds in the Brain and Deep Residual Networks: an MEG Study	➖	➖
121	Speech Taskonomy: Which Speech Tasks are the most Predictive of fMRI Brain Activity?	➖
282	MEG Encoding using Word Context Semantics in Listening Stories	➖
1949	Investigating the Cortical Tracking of Speech and Music with Sung Speech	➖	➖
414	Exploring Auditory Attention Decoding using Speaker Features	➖	➖
1776	Effects of Spectral Degradation on the Cortical Tracking of the Speech Envelope	➖	➖
964	Effects of Spectral and Temporal Modulation Degradation on Intelligibility and Cortical Tracking of Speech Signals	➖	➖

Perception of Paralinguistics

#	Title	Repo	Paper
2061	Transfer Learning for Personality Perception via Speech Emotion Recognition	➖
1131	A Stimulus-Organism-Response Model of Willingness to Buy from Advertising Speech using Voice Quality	➖	➖
1835	Voice Passing: a Non-Binary Voice Gender Prediction System for evaluating Transgender	➖	➖
1139	Influence of Personal Traits on Impressions of One's Own Voice	➖	➖
887	Pardon my Disfluency: The Impact of Disfluency Effects on the Perception of Speaker Competence and Confidence	➖	➖
711	Cross-linguistic Emotion Perception in Human and TTS Voices		➖

Technologies for Child Speech Processing

#	Title	Repo	Paper
1302	Joint Learning Feature and Model Adaptation for Unsupervised Acoustic Modelling of Child Speech	➖	➖
1681	Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics
2084	An ASR-enabled Reading Tutor: Investigating Feedback to Optimize Interaction for Learning to Read	➖
935	Adaptation of Whisper Models to Child Speech Recognition	➖	➖

Speech Synthesis: Multilinguality; Evaluation

#	Title	Repo	Paper
2064	Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
441	Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer	➖
1691	Robust Feature Decoupling in Voice Conversion by using Locality-Based Instance Normalization		➖
612	Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network	➖	➖
2148	The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech
1727	GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
1285	Analysis of Mean Opinion Scores in Subjective Evaluation of Synthetic Speech based on Tail Probabilities	➖	➖
1584	LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
1067	UniFLG: Unified Facial Landmark Generator from Text or Speech
444	XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
2224	ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus
154	Diffusion-based Accent Modelling in Speech Synthesis	➖	➖
249	Multilingual Text-to-Speech Synthesis for Turkic Languages using Transliteration
553	CVTE-Poly: A New Benchmark for Chinese Polyphone Disambiguation	➖	➖
709	Improve Bilingual TTS using Language and Phonology Embedding with Embedding Strength Modulator
2179	High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
1097	PronScribe: Highly Accurate Multimodal Phonemic Transcription From Speech and Text	➖	➖
2158	Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages	➖
416	Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously	➖
1622	Speaker-Independent Neural Formant Synthesis
1098	CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center
430	SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion		➖