Duality-Temporal-Channel-Frequency-Attention-Enhanced-Speaker-Representation-Learning

Unofficial implementation of Duality Temporal Channel Frequency Attention Enhanced Speaker Representation Learning (ASRU 2021)

L. Zhang, Q. Wang and L. Xie, "Duality Temporal-Channel-Frequency Attention Enhanced Speaker Representation Learning," 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 206-213, doi: 10.1109/ASRU51503.2021.9688243.
arXiv: https://arxiv.org/abs/2110.06565

If you want to use this code, you need to modify some part of code.

Training

'train_dir': waveform files in the 'train_dir' are loaded.
'valid_dir', 'test_dir': It's the same as 'train_dir'.
If you want to use other features, make the feature extract function at feature_extraction.py and then set 'feature_name_list': [feature_func_name1, feature_func_name2, ...] (all features are assumed same sequence length.)
If you use more than 2 features, you declair the 'feature_kwargs' to list. For each feature extraction function, arguments can be provided in the form of a list of dictionary or by putting all parameters in one dictionary.

If you want to verify for validation set or test set, some information is needed.
Trainer().verify(epoch, enroll_query_label_triplet, mode)
epoch: load the model trained for config['epoch']
mode: 'valid' -> config['valid_dir'], 'test' -> config['test_dir']
enroll_query_label_triplet: here, the index refers to the index of the list in which wav in the config[f'{mode}_dir'] is sorted by the filename. enroll_query_label_triplet is a list that the elements are a tuple (enrollment_utterance_index, query_utterance_index, True/False(correspondence)) or (enrollment_utterance_index, query_utterance_index). In the former case, information such as equal error rate (EER) is returned, and in the latter case, only the similarity score is returned.