Question about datasets

Question

Question about datasets

leotireger opened this issue 10 months ago · 2 comments

Hi,

I find that the downstream classify task, the PolypDIag dataset is made up with two pretrain datasets, LDPolypVideo and Hyper-Kvasir, but they are found in the PolypDIag's paper(arxiv edition, https://doi.org/10.1007/978-3-031-16437-8_9), I want to know whether you separate the pre-training data from the downstream test data or not?
If so, how do you make sure that the pretrain data is not leaked to downstream evaluation?

No offense, simply because I've been working on reproduction lately.

Thanks a lot.

Answer 1 · 2024-01-19T07:48:25.000Z

Hi, Thanks for your interest!
This is a good point! We do not remove the possible same videos in PolypDiag from our pre-training set.
As we are conducting our pre-training under a self-supervised manner (without ground truth class labels), the model is not leaked to the downstream data.
Such manner is general in the area of self-supervised learning, e.g., SimCLR (https://arxiv.org/abs/2002.05709).

Answer 2 · 2024-01-19T08:09:32.000Z

Hi, Thanks for your interest! This is a good point! We do not remove the possible same videos in PolypDiag from our pre-training set. As we are conducting our pre-training under a self-supervised manner (without ground truth class labels), the model is not leaked to the downstream data. Such manner is general in the area of self-supervised learning, e.g., SimCLR (https://arxiv.org/abs/2002.05709).

Understood, thanks for the reply and paper recommendation, it helped me a lot!