"Official" train/dev/test?

Question

"Official" train/dev/test?

hbredin opened this issue 2 years ago · 21 comments

I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).

What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?

cc @fnlandini

Answer 1 · 2022-06-14T13:30:40.000Z

Hi @hbredin
Thanks for bringing this up.
It is true that even our setup has evolved through time.
Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors.
Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Answer 2 · 2022-06-14T13:36:24.000Z

Thanks. That's very helpful.

So all papers by Hitachi use part1 for fine-tuning and part2 for testing?

What about updating the README with your answer? This would definitely help the community (in the same way AMI-diarization-setup does for AMI).

Answer 3 · 2022-06-14T13:42:11.000Z

cc @desh2608 @sw005320 @wq2012

Answer 4 · 2022-06-14T13:45:23.000Z

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models

Yes, we used this setup.

Answer 5 · 2022-06-14T14:12:50.000Z

Thanks for sharing. FYI, in our previous work we did 5-fold evaluation.

We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER.

Answer 6 · 2022-06-14T14:41:35.000Z

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Yes, we used the same setup recently (cc @popcornell) where part1 was used for adaptation.

Answer 7 · 2022-06-14T14:48:53.000Z

Thanks everyone for the comments.
@hbredin I've added a pointer to this issue in the README and we can keep it open for future reference

Answer 8 · 2022-06-14T14:53:07.000Z

Thanks everyone for your feedback!
Let's make our (future) results comparable :)

Answer 9 · 2022-07-19T12:30:21.000Z

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

Answer 10 · 2022-07-19T13:50:27.000Z

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

The ones I used are shared here: https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000

Disk 8 is CALLHOME, and Disk 6 is SwitchBoard.

Answer 11 · 2022-07-21T07:03:31.000Z

Thanks @wq2012. That is what I started using as well.
Can anyone else confirm that those are the only version circulating in our community?

Answer 12 · 2022-07-22T07:15:13.000Z

Hi Herve,

Callhome is LDC propietary data that can only be obtained after purchase and we believe that we might violate some copyright issues if we publish the reference files from it.
But given that @wq2012 publicly shared his, yes, they are the same we use. With the exception that, as mentioned above, we do not use the file iaeu.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Answer 13 · 2022-07-22T12:26:17.000Z

Hmm, are you sure?

Is that the same version as the LDC callhome?

IIRC we simply searched Google and downloaded them from other publicly available domains and thought these had already been publicly circulated.

Answer 14 · 2022-07-22T14:48:59.000Z

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Totally makes sense. Thanks!

Answer 15 · 2022-07-26T12:05:18.000Z

@wq2012, there are several CALLHOME LDC datasets. That is why CALLHOME can refer so many sets in publications.
This specific CALLHOME data is not that easy to find, unless you know the origin. It is part of the 2000 NIST Speaker Recognition Evaluation, which can be found under LDC Catalog No. LDC2001S97.
The references were released as part of the NIST keys after the evaluation.

We are waiting for a response from LDC, we will write an update after we hear from them.

Answer 16 · 2022-07-26T14:24:33.000Z

Thanks! But I don't think the references are included in any of the LDC Catalogs.

Answer 17 · 2023-03-15T15:02:24.000Z

For future reference, the RTTMs are also here: http://www.openslr.org/resources/10/sre2000-key.tar.gz

Answer 18 · 2024-03-08T04:45:31.000Z

Hi @hbredin Thanks for bringing this up. It is true that even our setup has evolved through time. Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors. Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Hi, Herve
So you mean for Hitachi EEND-EDA experimennts,
Train set = Callhome part 1
Validation set = Callhome part 2
Test set = Callhome part 2

Is it right?

Answer 19 · 2024-03-08T08:41:14.000Z

I guess this is for Hitashi people to answer here.
But I do hope that they are not using the same set for both validation and testing :)

Here is what I do, on my side:

use 75% of Callhome part 1 as train
use the remaining 25% of Callhome part 1 as validation
use Callhome part 2 as test

I don't think the actual split of part 1 (into train and dev) is really critical.
As long as part 2 never leaks into the various training steps (either train or validation) and we all report numbers on part 2, comparison should be fair.

Answer 20 · 2024-03-08T12:08:37.000Z

I guess a good scenario is what Hervé commented where he has a split of Part 1. However, it can have the issue that the same speaker appears in the 75% used as train AND in the 25% used as validation and that can lead to over-optimistic results in the validation set. But that certainly is correct in that the test set (Part 2) is never used for developing the model.

If I can add, I am afraid that many people are making decisions on Part 2 (which is the test set) and that should not be the case. Very few works report results on Part 2 without fine-tuning or comparisons on Part 1 (without fine-tuning).
Something I've been doing recently is to make all my comparisons (and decisions) on Part 1 without fine-tuning and only at the very end perform fine-tuning using Part 1 in order to report results on Part 2.
Also, when doing fine-tuning there is the question of how many epochs to run. I used the same number for both (or more) methods that I was comparing. This might still play in favor of one or the other method but at least there is no direct decision made on the test set (like run fine-tuning until the performance stops improving for Part 2).

I would not mind ~~hearing~~ reading others' opinions :)

Answer 21 · 2024-03-11T07:07:33.000Z

@hbredin @fnlandini Thank you for your quick and considerable responses!!