Replicating paper results
iankur opened this issue · 13 comments
Hi,
I am trying to replicate the best results mentioned in the paper with unfrozen whisper features, mfcc and mesonet. EER that I get for model from stage 1 (encoder frozen) is close to that frozen case mentioned in the paper. But when I use this checkpoint for further finetuning, i.e. unfreeze encoder, I get EER of 0.35 which is much worse than the reported result.
Any pointers on what can be going wrong? I am following the instructions provided in the repo for running these experiments without any change.
Hi,
could you please provide the exact configs which you used?
I will rerun the trainings and I will come back soon with the results.
Did you modify any codebase e.g. related to datasets?
What results do you achieve on InTheWild dataset using the pretrained model we provide?
Please provide us with as much information as possible to reproduce your results (including information about preparing the train and eval datasets).
@piotrkawa I did not modify the codebase, just followed the commands provided in readme. I am using config from here for stage 1 training and then modify the config for stage 2 finetuning by changing lr, freeze_encoder and ckpt path parameters. The exact command I used was as provided here in the readme (i changed epochs and config path for stage 1 and stage 2).
I am able to reproduce the numbers reported in the paper for InTheWild dataset with the provided checkpoint. Let me know for anything else.
"i changed epochs and config path for stage 1 and stage 2)" - what do you exactly mean by that? 1st training should be performed using 10 epochs, fine-tuning using 5 epochs.
Could you please retry training, but this time using not test_amount
but valid_amount
= 25,000? That's a mistake in the README - test_amount is in fact not used, as we evaluate on full ITW dataset, whereas we should validate on 25k ASV21DF samples.
Moreover - what ASV21DF labels do you use? Please refer to #3 (comment).
I used 10 epochs for 1st training and 5 epochs for finetuning as mentioned in the paper. I am not using test_amount
as I mentioned earlier that I only adjusted the parameters according to experimental procedure mentioned in the paper. So, I am using 25000 for valid_amount
. I am also using the key that was referred in the DeepFakeASVSpoofDataset
class, which I think is the same one you linked above.
I can retrain the model again but it seems there is no change to my existing configuration. Let me know if you think otherwise or there are other changes that I should try.
Hi @piotrkawa,
I've encountered the same issue where I'm unable to replicate the results from the paper. Below are the details of my setup and the commands I've used:
Training Command
python train_models.py \
--asv_path /home/man-group/chandler/Datasets/ASVspoof2021/DF \
--config configs/training/whisper_mesonet.yaml \
--batch_size 8 \
--epochs 10 \
--train_amount 100000 \
--test_amount 25000
Training Configuration (whisper_mesonet.yaml)
data:
seed: 42
checkpoint:
path: ""
model:
name: "whisper_mesonet"
parameters:
freeze_encoder: True
input_channels: 1
fc1_dim: 1024
frontend_algorithm: []
optimizer:
lr: 0.0001
weight_decay: 0.0001
Evaluation Command
python evaluate_models.py \
--in_the_wild_path /home/man-group/chandler/Datasets/release_in_the_wild \
--config ./configs/model__whisper_mesonet__1695441741.5227604.yaml \
--amount 25000
Evaluation Configuration (model__whisper_mesonet__1695441741.5227604.yaml)
checkpoint:
path: /home/man-group/chandler/Experiments/deepfake-whisper-features/trained_models/model__whisper_mesonet__1695441741.5227604/ckpt.pth
data:
seed: 42
model:
name: whisper_mesonet
optimizer:
lr: 0.0001
weight_decay: 0.0001
parameters:
fc1_dim: 1024
freeze_encoder: true
frontend_algorithm: []
input_channels: 1
Results
From your paper, the EER(frozen) should be 0.3856
.
However, I get this output below using the Evaluation Command:
eval/eer: 0.4117
eval/accuracy: 57.8361
eval/precision: 0.7281
eval/recall: 0.5249
eval/f1_score: 0.6100
eval/auc: 0.6129
Could you kindly assist me with the issue I mentioned above, @piotrkawa? I find the method in your paper to be potentially state-of-the-art and am planning to include it in our benchmark. However, I'm encountering difficulties in reproducing your results. Your guidance would be greatly appreciated.
Thank you for the detailed description of the steps you followed to run the code. Let us take a look at the problem.
In the meantime, we point out the checkpoints with which we achieved the results described in the paper: The best (MFCC+Whisper) MesoNet
& (Whisper) MesoNet
models.
We can provide more checkpoints if needed.
What is the environment you are using?
- OS,
- CUDA version, drivers,
- exact GPU,
- Python version,
- packages like torch, Whisper etc.
In the meantime, we would appreciate your results of the models not using Whisper - e.g. (LFCC) SpecRNet.
@piotrkawa This is the machine environment I have:
(audiofake) man-group@mangroup-1:~/chandler$ uname -a
Linux mangroup-1 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
(audiofake) man-group@mangroup-1:~/chandler$ nvidia-smi
Python 3.10.13
(audiofake) man-group@mangroup-1:~/chandler$ nvidia-smi
Thu Sep 28 21:42:23 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P8 22W / 420W | 140MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1888 G /usr/lib/xorg/Xorg 71MiB |
| 0 N/A N/A 2022 G /usr/bin/gnome-shell 58MiB |
+---------------------------------------------------------------------------------------+
(audiofake) man-group@mangroup-1:~/chandler/Experiments/deepfake-whisper-features$ pip freeze
asteroid-filterbanks==0.4.0
audioread==3.0.0
beautifulsoup4==4.12.2
bleach==6.0.0
brotlipy==0.7.0
cachetools==5.3.1
certifi @ file:///croot/certifi_1690232220950/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cryptography @ file:///croot/cryptography_1694444244250/work
decorator==5.1.1
ffmpeg-python==0.2.0
filelock==3.12.4
fsspec==2023.9.1
future==0.18.3
gdown==4.7.1
google-api-core==2.12.0
google-api-python-client==2.101.0
google-auth==2.23.1
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.17.2
idna @ file:///croot/idna_1666125576474/work
joblib==1.3.2
kaggle==1.5.16
librosa==0.9.2
llvmlite==0.40.1
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools==10.1.0
numba==0.57.1
numpy==1.24.4
openai-whisper @ git+https://github.com/openai/whisper.git@7858aa9c08d98f75575035ecd6481f462d66ca27
packaging==23.1
pandas==2.0.2
Pillow @ file:///croot/pillow_1695134008276/work
platformdirs==3.10.0
pooch==1.7.0
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pyOpenSSL @ file:///croot/pyopenssl_1690223430423/work
pyparsing==3.1.1
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.8.2
python-slugify==8.0.1
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.8.8
requests @ file:///croot/requests_1690400202158/work
resampy==0.4.2
rsa==4.9
safetensors==0.3.3
scikit-learn==1.3.1
scipy==1.11.2
six==1.16.0
soundfile==0.12.1
soupsieve==2.5
text-unidecode==1.3
threadpoolctl==3.2.0
tokenizers==0.13.3
torch==1.11.0
torchaudio==0.11.0
torchvision==0.12.0
tqdm==4.66.1
transformers==4.33.2
typing_extensions @ file:///croot/typing_extensions_1690297465030/work
tzdata==2023.3
uritemplate==4.1.1
urllib3==2.0.5
webencodings==0.5.1
Replication [w/o Whisper]
Training Command
python train_models.py \
--asv_path /home/man-group/chandler/Datasets/ASVspoof2021/DF \
--config configs/training/specrnet.yaml \
--batch_size 8 \
--epochs 10 \
--train_amount 100000 \
--test_amount 25000
Training Configuration (specrnet.yaml)
data:
seed: 42
checkpoint:
path: ""
model:
name: "specrnet"
parameters:
input_channels: 1
frontend_algorithm: ["lfcc"]
optimizer:
lr: 0.0001
weight_decay: 0.0001
Evaluation Command
python evaluate_models.py \
--in_the_wild_path /home/man-group/chandler/Datasets/release_in_the_wild \
--config ./configs/model__specrnet__1695874202.629074.yaml \
--amount 25000
Evaluation Configuration (model__specrnet__1695874202.629074.yaml)
checkpoint:
path: /home/man-group/chandler/Experiments/deepfake-whisper-features/trained_models/model__specrnet__1695874202.629074/ckpt.pth
data:
seed: 42
model:
name: specrnet
optimizer:
lr: 0.0001
weight_decay: 0.0001
parameters:
frontend_algorithm:
- lfcc
input_channels: 1
Results
From your paper, the (SpecRNet+ LFCC) EER should be 0.5184
.
However, I get this output below using the Evaluation Command:
- eval/eer: 0.6368
- eval/accuracy: 34.0634
- eval/precision: 0.3027
- eval/recall: 0.0381
- eval/f1_score: 0.0676
- eval/auc: 0.3068
Thank you for your patience. We have indeed noticed the reproducibility issues on other machines.
We tried scenarios of downloading datasets once again, running code on new conda env, and outside of the Docker env we originally worked on and on our machine - for all cases, we achieved the same results.
While the results are reproducible on the computer we used to prepare this work, they are not on other machines (https://discuss.pytorch.org/t/different-result-on-different-gpu/102502). This is also confirmed as @iankur achieved similar results (at least for 1st stage), whereas @chandlerbing65nm achieved much different results.
That is why we conclude that the problem may lie in the hardware rather than some mismatch in the dataset or a bug in the codebase.
The specs of the machine used to prepare the paper are as follows:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN RTX Off | 00000000:1A:00.0 Off | N/A |
| 41% 33C P8 5W / 280W | 769MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN RTX Off | 00000000:1B:00.0 Off | N/A |
| 40% 43C P8 15W / 280W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN RTX Off | 00000000:1E:00.0 Off | N/A |
|123% 77C P2 259W / 280W | 7507MiB / 24576MiB | 98% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN RTX Off | 00000000:3F:00.0 Off | N/A |
| 41% 30C P8 13W / 280W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN RTX Off | 00000000:40:00.0 Off | N/A |
| 40% 32C P8 11W / 280W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
requirements:
(mldev) $ pip freeze
asteroid-filterbanks==0.4.0
audioread==3.0.1
brotlipy==0.7.0
certifi @ file:///croot/certifi_1690232220950/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cryptography @ file:///croot/cryptography_1694444244250/work
decorator==5.1.1
ffmpeg-python==0.2.0
filelock==3.12.4
fsspec==2023.9.2
future==0.18.3
huggingface-hub==0.17.3
idna @ file:///croot/idna_1666125576474/work
joblib==1.3.2
librosa==0.9.2
llvmlite==0.41.0
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools==10.1.0
numba==0.58.0
numpy==1.25.2
openai-whisper @ git+https://github.com/openai/whisper.git@7858aa9c08d98f75575035ecd6481f462d66ca27
packaging==23.1
pandas==2.0.2
Pillow @ file:///croot/pillow_1695134008276/work
platformdirs==3.10.0
pooch==1.7.0
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pyOpenSSL @ file:///croot/pyopenssl_1690223430423/work
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.8.8
requests @ file:///croot/requests_1690400202158/work
resampy==0.4.2
safetensors==0.3.3
scikit-learn==1.3.1
scipy==1.11.3
six==1.16.0
soundfile==0.12.1
threadpoolctl==3.2.0
tokenizers==0.13.3
torch==1.11.0
torchaudio==0.11.0
torchvision==0.12.0
tqdm==4.66.1
transformers==4.33.3
typing_extensions @ file:///croot/typing_extensions_1690297465030/work
tzdata==2023.3
urllib3 @ file:///croot/urllib3_1686163155763/work
As mentioned, we retried the (LFCC) SpecRNet
experiment on another machine. We used RTX 3090 (Driver Version: 470.141.03, CUDA Version: 11.4). We got the same results as you reported (eer=0.6368 etc.). Moreover, during this investigation, we noticed that setting other seeds may result in better results: e.g. on RTX 3090 training with seed=1234 resulted for (LFCC) SpecRNet
in EER of 0.5855 (where 0.6368 was the result for the seed=42 we typically use in our research). The discrepancy is not an issue of only Whisper-based architectures we propose in this work, as it appears, for instance, in LCNN models as well - we got EER=0.7051 for (LFCC) LCNN
architecture on RTX 3090 instead of 0.77 on TITAN RTX (reported in paper).
We provide all of the models reported in our paper for clarity and as a confirmation of our results.
We will further investigate this issue, however at this moment, we can suggest the following solutions to reproduce our results:
- use the models we provide,
- train using different seeds (you can expect both better and worse results in relation to the reported ones),
- run training using the exact environment and machine as we did,
- try lower learning rate.
Moreover, please note that in our research we based on 125k training samples with no use of any augmentation - to further enhance these results one can use full datasets and apply data augmentation techniques.
Hi @piotrkawa. Firstly, I want to express my gratitude for your detailed explanation concerning the reproducibility issues I encountered with your paper. Your comprehensive insights and the troubleshooting steps you've provided are invaluable.
I understand the challenges associated with ensuring the reproducibility of results across different hardware configurations. To mitigate this and to allow for a more universal benchmark, would it be possible to rerun the experiments in Table 3, specifically those involving frozen and fine-tuned Whisper features, using multiple different seeds? Calculating the average performance metrics along with their standard deviation would provide a more reliable measure of the model's capabilities.
If this is an additional task that you're unable to undertake at the moment, I would be more than willing to run these experiments on our end and provide you with the updated results. This collaborative effort would contribute to the robustness and credibility of the published work.
Importantly, this would enable us to cite your work in our upcoming benchmarks. We would point to the updated results on your GitHub page as the source, rather than the original paper, given the updated nature of the findings.
We look forward to seeing these updated results on GitHub, which would serve as an essential resource for all researchers in this field.
Thank you once again for your time and significant contributions to this field.
@piotrkawa not related to reproducibility, is the same batch norm layer applied for two different inputs here and here intended? I just found out this issue has been raised with the pytorch implementation that this repo reuses. Also, correct me but there seems to be no activation function in the inception layer whereas the original keras implementation of meso inception net does have relu nonlinearity.
Can you also please share why this work does not compare with wav2vec and graph attention based systems?
Hi @chandlerbing65nm,
Unfortunately, I have to focus on my PhD dissertation currently and will not be able to rerun these experiments in a short period of time, but I am encouraging you to do so and run the trainings on different machines and seeds.
However, please bear in mind that, as we state in our paper, we only used a subset of the ASVspoof 2021 DF dataset and no augmentation techniques - we put the main focus on the benefits of using the Whisper model and perform different comparisons with other front-ends.
Exposing models to multiple attacks (i.e. using multiple datasets like ASVspoofs, WaveFake, FakeAVCeleb, ADD etc.) and enhancing the representation using augmentation techniques (e.g. audiomentations, RawBoost etc.) is commonly used in the field, as it significantly improves the models. In my opinion, the benchmark would benefit from the unified training procedure - i.e. similar training datasets and training techniques as these factors will be highly influencial.
@iankur we used the implementation of MesoNet provided with FakeAVCeleb baseline code.
We cited the methods you mentioned; however, we did not include them in the benchmark due to the manuscript space limit. Moreover, a reliable comparison would require training these models using the same environment (training set, number of epochs etc.) - we did not want to dilute the paper by adding another few models.