NeXt-TDNN for Speaker Verification [ICASSP 2024]

This repository is the official implementation of "NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification" accepted in ICASSP 2024 Paper Link (Arxiv) / Paper Link (IEEE)

News

🔥 December, 2023: We have uploaded the pre-trained models of our NeXt-TDNN in the experiments folder!

🔥 February 2024, the NeXt-TDNN model was updated with cyclic learning rate scheduling. This update improved the EER from 0.79/1.04/1.82% to 0.72/0.94/1.68% in VoxCeleb1-O/E/H. Changes were made to the LR scheduling, gradient clipping value, and batch size. Please check configs/NeXt_TDNN_C256_B3_K65_7_cyclical_lr_step.py for details.

0. Getting Start

Prerequisites

This code requires the following:

lightning == 2.1.2

Installation

CUDA, PyToch installation

# CUDA
conda install -c "nvidia/label/cuda-11.8.0" cuda

# PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Data preparation

VoxCeleb Dataset
- To download VoxCeleb dataset fot train/test, execute the command described in the Data preparation section of the voxceleb_trainer repository
- Download VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H for test and locate it data directory

1. Model Training

To train ASV model, run main script in train mode. You can select the desired training configuration through config argument.

to train NeXt-TDNN(C=256, B=3)

python main.py --mode train --config configs/NeXt_TDNN_C256_B3_K65_7

2. Model Test

To test on VoxCeleb1, run the script below. As in training, select the desired test configuration.

# VoxCeleb1-O
python main.py --mode test --config configs/NeXt_TDNN_C256_B3_K65_7

# ⚡ VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H
python main.py --mode test_all --config configs/NeXt_TDNN_C256_B3_K65_7

3. Reference

4. Citation

If you find our work useful, please refer to

@INPROCEEDINGS{10447037,
  author={Heo, Hyun-Jun and Shin, Ui-Hyeop and Lee, Ran and Cheon, YoungJu and Park, Hyung-Min},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification}, 
  year={2024},
  volume={},
  number={},
  pages={11186-11190},
  keywords={Convolution;Speech recognition;Transformers;Acoustics;Task analysis;Speech processing;speaker recognition;speaker verification;TDNN;ConvNeXt;multi-scale},
  doi={10.1109/ICASSP48485.2024.10447037}}

dmlguq456/NeXt_TDNN_ASV