Test split is different from the paper?

Question

Test split is different from the paper?

Closed this issue 3 years ago · 6 comments

The dict key in 'results/test_2ndstage_RetinaNet-CMC-CODAEL-512sh-b1.pth-CODAEL-val-inference_results_boxes.p.bz2' contains fa4959e484beec77543b.svs, which is in the train split reported in the table 1 of the paper, and the test slide 4eee7b944ad5e46c60ce.svs is not in the dict key.

schwanabc commented 3 years ago

Thanks

Answer 1 · 2021-08-18T13:06:22.000Z

Your observation seems to be correct. This seems to be a lookup error in our database table. I'm sorry for the confusion this caused.

Answer 2 · 2021-08-18T13:18:23.000Z

I've added an erratum to the README file, and I'm also contacting the journal to see if we can publish an erratum to the paper. Thanks!

Answer 3 · 2021-08-20T05:07:31.000Z

For clarity, is the slide fa4959e484beec77543b.svs used as a training data?

Answer 4 · 2021-08-20T06:27:35.000Z

Looks like there is more to that, I'm investigating it.

Answer 5 · 2021-08-20T07:31:10.000Z

ok, so it appears you were really up to something. I've just updated the README to reflect on the things I was able to reconstruct:

While the file fa4959e484beec77543b.svs was indeed used during training (as originally intended, apparently, I don't really remember, since this is more than a year ago), it ended up being in the test set. So in fact, this is a train/test bleed of this slide (which is embarrassing, because I checked everything a couple of times and did not realize it).

Just to clarify why we do the train/test split after inference in the first place: Since we need to optimize the threshold on the training set, it was just the most straight-forward way of batch processing (to process everything and then split up afterwards again). And, at least in theory, also correct in a machine learning sense, since we don't optimize anything on the test set. Of course, if this splitting up does not correspond to the split in the first place, then it's an issue.

Now, initially I feared that the results would be subject to a significant overfitting bias. After digging into it, this, at least, does not seem to be the case (or only very mildly so). As indicated in the new README.md, the F1 score changed only by 0.02 in the most severe case, and the order of conditions also is unchanged. So I guess our findings all still hold.

So a heartfelt "thank you" for finding this issue. While I'm, of course, not happy to have made the error in the first place, science is not about covering your mistakes but about admitting them, learning and improving your own senses to not make the same mistake twice.

Marc