With the rapidly growing number of security sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems’ potential vulnerability to replay attacks. Previous efforts to address this concern have focused primarily on single-channel audio. In this work, we introduce a novel neural network-based replay attack detection model that further leverages spatial information of multi-channel audio and is able to significantly improve the replay attack detection performance.
June, 30, 2020:
- New uploads, might contain bugs
ReMASC Corpus (92.3GB)
Data we used in the paper. Please download it from [IEEE DataPort], it is free. You will need an IEEE account to download, which is also free.
The complete set consists of two disjoint set:
- Core Set: the suggest training and development set.
- Evaluation Set: the suggest evaluation set.
In this work, we do use the official core/evaluation split.
If you use our neural network code, please cite the following paper:
Yuan Gong, Jian Yang, Christian Poellabauer, "Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method" IEEE Signal Processing Letters, 2020.
If you also use the data, please cite the following paper:
Yuan Gong, Jian Yang, Jacob Huber, Mitchell MacKnight, Christian Poellabauer, "ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems", Interspeech 2019.
1. Prepare the dataset
-
Clone the Github reporsitory. Download the ReMASC corpus from [here] (it is completely free) and place it in the
data/
directory. We test the code usingtorch==1.5.0
,torchaudio==0.4.0
, andnumpy==1.18.4
. Check other dependencies we use inrequirement.txt
. -
In
src/constants.py
, line 9, change thePROJ_PATH
to your project path. -
In
src/uniform_sample_rate.py
, line 14, change to your desiring sampling rate (in this work, we use 44100), then runpython src/uniform_sample_rate.py
. -
Valid the data preparsion by running
python src/data_loader.py
, you should see a plot of the waveform from the dataset.
2. Select the hyper-parameters and run experiment
In src/exp_full.py
, line 81-91, the hyper-parameters are defined:
bsize_list = [64]
lr_list = [1e-5]
rdevice_list = [1, 2, 3, 4]
audio_len_list = [1.0]
filter_num_list = [64]
sr_list = [44100]
mch_setting = [True]
frame_time_list = [0.02]
where bsize_list
defines the list of batch size, lr_list
defines the list of learning rate, rdevice_list
defines the list of recording devices, audio_len_list
defines the list of used audio length, filter_num_list
defines the list of convolution filter number in the first layer, sr
list defines the list of sampling rate (you must first convert the sample rate using src/uniform_sample_rate.py
before runing experiments), mch_setting
defines if using real multi-channel or not, this should be True
unless you are running an ablation study, frame_time_list
defines a list of frame window size in second. Note you can test different settings in one run by adding multiple values in a list (e.g., bsize_list=[8, 16, 32, 64]
), all hyper-parameter combination will be tested. Nevertheless, the running time grows exponentially.
Then run:
python src/exp_full.py -d 0 -n exp1 -s 0
where -d
is for GPU device index; -n
is for the experiment name, which will be the name in the exp\
folder; -s
is for the random seed. You should be see the loss and EER printed each epoach, the result will be stored in exp\exp_name
.
3. Use your own model
The model we propose is in src/model.py
, you can revise it or use your own model to replace it.
If you have a question, please rasie an issue in this Github reporsity. You can also contact Yuan Gong (ygong1@nd.edu).