Reproduction of evaluation performance

Question

Reproduction of evaluation performance

Closed this issue 2 years ago · 12 comments

izumiakitoshi commented 3 years ago

Hi, I tried your python script but can't achieve this performance.
52.8% (+-0.6%) event-based F1-score

For us, F1-score was about 30%.

What should I do to reproduce ?

Answer 1 · 2021-05-31T07:02:44.000Z

Hi, thanks for your interest in our model. Lets try figuring out what goes wrong here.

So you are tried to reproduce the hybrid ensemble right away? What is the performance you achieve with a single tag conditioned CNN and what were the calls you used to train and evaluate it?

Answer 2 · 2021-05-31T08:21:45.000Z

Thank you so much your help!
I want to reproduce event based F1 score over 50%.
Because my English is poor, if context is unclear, please let me know.

In my case some command line can't work.
I guess that my machine os is windows, command line in ' or " or [ doesn't work.
As workaround I modified python file as below.
run_inference.py
#reference_files = None #[Izumi]
reference_files = ["datadir/real/metadata/validation/validation.csv",
"datadir/real/metadata/eval/eval_dcase2019.csv"]

So you are tried to reproduce the hybrid ensemble right away?
What is the performance you achieve with a single tag conditioned CNN and what were the calls you used to train and evaluate it?
->
--- FBCRNN Training ---
I Performed below command three times.
python -m pb_sed.experiments.dcase_2020_task_4.train_crnn

--- CNN Training ---
Then, I Performed below commands
python -m pb_sed.experiments.dcase_2020_task_4.train_cnn with 'pseudo_strong_suffix=2020-07-05-12-37-18_best_frame_f1_hybrid'
python -m pb_sed.experiments.dcase_2020_task_4.train_cnn with 'pseudo_strong_suffix=2020-07-05-12-37-18_best_frame_f1_hybrid'
python -m pb_sed.experiments.dcase_2020_task_4.train_cnn with 'pseudo_strong_suffix=2020-07-05-12-37-18_best_frame_f1_hybrid'

--- hyper parameter tuning ---
Then, I Performed below command with py file modification
python -m pb_sed.experiments.dcase_2020_task_4.tune_hyper_params
tune_hyper_params.py
crnn_dirs = ["pb_sed/exp/dcase_2020_crnn/2021-05-28-21-40-32",
"pb_sed/exp/dcase_2020_crnn/2021-05-29-08-35-28",
"pb_sed/exp/dcase_2020_crnn/2021-05-29-18-04-59"]
cnn_dirs = ["pb_sed/exp/dcase_2020_cnn/2021-05-30-08-28-52",
"pb_sed/exp/dcase_2020_cnn/2021-05-30-12-42-30",
"pb_sed/exp/dcase_2020_cnn/2021-05-30-17-38-43"]

-- inference ---
Then, I Performed below command with py file modification
python -m pb_sed.experiments.dcase_2020_task_4.run_inference with hyper_params_dir=pb_sed/exp/dcase_2020_hyper_params/2021-05-30-23-49-53
run_inference.py
reference_files = ["datadir/real/metadata/validation/validation.csv",
"datadir/real/metadata/eval/eval_dcase2019.csv"]

What is the performance you achieve with a single tag conditioned CNN
->
I don't know how can I evaluate with a single tag conditioned CNN.
What I know is

train some FBCRNN models
train some tag conditioned CNN models
hyper parameter tuning for 1. and 2.
inference with tuned hyper paramete

Answer 3 · 2021-05-31T09:19:43.000Z

ok, lets first check whether FBCRNN training is going right.

So you were running train_crnn with unlabel_in_domain_pseudo_weak_timestamp=None right? could you run hyperparameter tuning and inference with only a single of those FBCRNN? This should give something like 82.6% tagging f-score and 40% event-based detection f-score. You can additionally check the best validation tagging f-score during training which should also be something like 82.6%.

If you run train_crnn with unlabel_in_domain_pseudo_weak_timestamp='2020-07-03-20-48-45' (which is one of the files with weak pseudo labels that have been used for the results from our paper), you should get something like 84.8% tagging f score and 46% event-based detection f-score.

Could you please verify this?

Answer 4 · 2021-05-31T11:30:39.000Z

I attached three log files and three py files. Can you read it?
I trained FBCRNN without pseudo label.
There are some modification in py files.
For example, I changed iteration 40000->20000 because of processing time.
But, valication score is about 81%, I thought it enough.

And, .py file are no permitted in this system, I changed extention to .txt.

During training, verification score is about 81%.
But in hyper parameter tuning, and infarence, event-based detection f-score is about 28%.

hyper_parameter_tuning_log.txt
inference_log.txt
run_inference.py.txt
train_crnn.py.txt
tune_hyper_params.py.txt
fbcrnn_log.txt

Answer 5 · 2021-05-31T12:00:26.000Z

Ok, while the model achieves 82% tagging f-score during validation, it only scores 56% tagging f-score in inference so something seems to go wrong there. It also seems that the inference log you sent was using two crnns + two cnns, is that correct? Could you perform a hyper parameter tuning and inference only with the crnn that you sent the training log from.

Answer 6 · 2021-05-31T12:23:08.000Z

I tried perform a hyper parameter tuning with only one crnn, but error appeared.
Is there any wrong in my tune_hyper_params.py ?

----- Error message -----
ERROR - dcase_2020_hyper_params - Failed after 0:00:46!
Traceback (most recent calls WITHOUT Sacred internals):
File "C:\Users\XXXXXXX\Desktop\0_tec\210317_DCASE2020_Task4_3rd\pb_sed-master\pb_sed\pb_sed\experiments\dcase_2020_task_4\tune_hyper_params.py", line 265, in main
assert len(cnn_dirs) > 0
AssertionError

----- tune_hyper_params.py -----
crnn_dirs = ["pb_sed/exp/dcase_2020_crnn/2021-05-28-21-40-32"]
cnn_dirs = []

tune_hyper_params.py.txt

Answer 7 · 2021-05-31T12:27:53.000Z

ah right, you have to set ensembles=['crnn'] in tune_hyper_params to only evaluate crnn

Answer 8 · 2021-05-31T12:37:06.000Z

I set ensembles = ['cnn'] at line 84, but same error happened.
Is another modification need ?

tune_hyper_params.py.txt

----- error message -----
INFO - dcase_2020_hyper_params - Running command 'main'
INFO - dcase_2020_hyper_params - Started run with ID "1"
Data set length validation: 1009
Restored labels from C:\Users\4025151\Desktop\0_tec\210317_DCASE2020_Task4_3rd\pb_sed-master\pb_sed\exp\dcase_2020_crnn\2021-05-28-21-40-32\events.json
Audio Tagging:
F-scores: [0.894, 0.776, 0.814, 0.692, 0.797, 0.827, 0.753, 0.869, 0.95, 0.846]
Macro F-score: 0.822
Error-rates: [0.213, 0.438, 0.333, 0.571, 0.393, 0.333, 0.541, 0.252, 0.101, 0.296]
Macro error-rate: 0.347
Sound event detection:
[Izumi]ensembles:['cnn']
ERROR - dcase_2020_hyper_params - Failed after 0:00:45!
Traceback (most recent calls WITHOUT Sacred internals):
File "C:\Users\XXXXX\Desktop\0_tec\210317_DCASE2020_Task4_3rd\pb_sed-master\pb_sed\pb_sed\experiments\dcase_2020_task_4\tune_hyper_params.py", line 267, in main
assert len(cnn_dirs) > 0
AssertionError

Answer 9 · 2021-05-31T12:57:18.000Z

you have to set it to ensembles=['crnn'], not ensembles=['cnn']

Answer 10 · 2021-05-31T13:20:49.000Z

Thank you again, it worked.
Here is the last of log.
--- log ---
Best event-based crnn macro F-score: 0.385

Is this performance OK?

The all log is here.
tune_hyper_params_log.txt

Answer 11 · 2021-05-31T13:44:10.000Z

Looks good, in the paper we got 40.7±1.3 in this setting. However, as you only train for 20k iterations and also missing some data, if I am not mistaken, 38.5% is fine I think.

You may now try to train a crnn with unlabel_in_domain_pseudo_weak_timestamp="2020-07-03-20-48-45" which should give you something like 84.8% tagging f-score and 46% detection f-score. There are four more weak pseudo label files which have also been used in the paper, namely, "2020-07-03-20-49-48", "2020-07-03-20-52-19", "2020-07-03-21-00-48", "2020-07-03-21-05-34". You can train a crnn for each of them to train your own crnn ensemble which will give you something like 48% event detection f-score. Note that you will later need the crnn ensemble for tag conditioning the cnns.

You can train tag conditioned cnns using
pseudo_strong_suffix="2020-07-04-13-10-05_best_frame_f1_crnn",
pseudo_strong_suffix="2020-07-04-13-10-19_best_frame_f1_crnn",
pseudo_strong_suffix="2020-07-04-13-10-33_best_frame_f1_crnn",
pseudo_strong_suffix="2020-07-04-13-11-09_best_frame_f1_crnn"
and
pseudo_strong_suffix="2020-07-04-13-12-06_best_frame_f1_crnn",
which are the pseudo strong labels used in our paper.

then perform hyper parameter tuning and inference for the 5 crnns + 5 cnns and you should get the 52.8 ±0.6 % detection f-score which you asked for.

Let me know if something is not working properly.

Answer 12 · 2021-06-01T07:49:51.000Z

Thank you, with below pseudo label learning, performance improved.(44.70%)
I will try ensemble and epoch 20000->40000.

crnn
with unlabel_in_domain_pseudo_weak_timestamp=2020-07-03-20-48-45
cnn
with pseudo_strong_suffix=2020-07-04-13-12-06_best_frame_f1_crnn