Question about downstream task test result

Hi,

i was running the experiment for PolypDiag downstream task, when I use the fine-tuned weights as the pretrained_model_weights, each time when I run the test_finetune_polypdiag.sh, it will give me different test result, may I know why this happened? Supposedly, it should be the same test result every time we run the test with the same val 80 test, and same model, right?

Hi, Thanks for your interest!
I would like to learn the results you have got?

Hi, thanks for your follow-up:

It is not stable, for the first running of experiment, it will give me the same answer when I run the eval_polypdiag_finetune, like 91.6%, it remains consistent when I run the test. However, if i use the same pretrained weight I just have from last eval, it will give me a different number, like 80%, 76% and so on. I don't see why with the same model on the same test dataset, the F1 score will be different. Have you had this problem?

I also encountered this problem, and my test results were even lower. Of course, since my training environment does not support distributed training, I annotated the relevant code. Is this also related to this?

Hi, @10086ddd
I have found that the model is not in eval mode during test, and the saved model only contains the linear classifier, without the fine-tuned backbone.
These issues are addressed currently, you can pull the latest code and try again!

Endo-FM/eval_finetune.py

Line 218 in f9136eb

model.eval()

Endo-FM/eval_finetune.py

Lines 168 to 175 in f9136eb

    
           save_dict = { 
        
               "epoch": epoch + 1, 
        
               "state_dict": linear_classifier.state_dict(), 
        
               "backbone_state_dict": model.state_dict(), 
        
               "optimizer": optimizer.state_dict(), 
        
               "scheduler": scheduler.state_dict(), 
        
               "best_f1": best_f1, 
        
           }

Hi,thanks for your follow-up: @Kyfafyd
I will test the new code in the near future and inform you of the results in a timely manner. Thanks for your work again. Also, I have a small question that I would like to ask you. If I want to switch to another endoscopic dataset for testing downstream classification tasks, do I only need to make downstream Fine-tuning first and then test directly?

Hi, @10086ddd
Yes, that's exactly right! You can refer to this issue for more details: #12

Hi, @Kyfafyd
I tested the new code today, a total of 5 times. Of course, due to environmental issues, I commented out the relevant code for distributed training. But the results of the five tests were 74.8%, 34.1%, 83.8%, 45.4%, and 85.7%, respectively. Does this mean that the model is still not very stable? Additionally, I have a question about the classification of downstream tasks in the PolypDiag dataset. Is each frame of the video labeled as abnormal labeled as diseased?

Hi @10086ddd
Are you using the latest updated model, which is in readme: https://mycuhk-my.sharepoint.com/personal/1155167044_link_cuhk_edu_hk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2F1155167044%5Flink%5Fcuhk%5Fedu%5Fhk%2FDocuments%2FEndo%2DFM%2Fdownstream%5Fweights%2Fpolypdiag%2Epth&parent=%2Fpersonal%2F1155167044%5Flink%5Fcuhk%5Fedu%5Fhk%2FDocuments%2FEndo%2DFM%2Fdownstream%5Fweights&ga=1
My testing result using this model is consistently 91.5% over time.

This PolypDiag task is a video level task, diagnosis each video is with disease or not.

Hi,@Kyfafyd
Sorry, it was my mistake. I did forget to use the latest updated model. Also, regarding the dataset of the PolypDiag task, do you mean that in the future, if I want to use some diseased and disease-free endoscopic videos for fine-tuning and testing, wouldn't diseased videos need to include diseased areas in every frame?

Hi, @10086ddd
Yes, this is a classification task, so no area is needed. If you want to perform lesion detection, you can refer to STFT in this repo.

Hi, @Kyfafyd
Thank you for your work and answer. Actually, what I want to say is the classification task. Previously, I meant to ensure that every frame in the video belongs to the same category. Now it seems unnecessary？

Hi, @10086ddd
To be note that this task is to recognize one video is diseased or not. So it is not necessary.

OK，Thanks @Kyfafyd

Hi, @Kyfafyd
I'm very sorry to disturb you again. But after using the latest updated model yesterday, the test results are still unstable. I noticed that the source code uses distributed training. Because my environment does not support it, I commented out this part of the code. But I haven't made any other modifications. If I only train on one GPU, wouldn't it be necessary to not only annotate the distributed training related code, but also modify the model's parameters and configuration files?

Hi, @10086ddd
You may test the model under distributed environment. Only 1 gpu can also setup distributed scrnario.

Hi，@Kyfafyd
I'm sorry for replying to you only today. During this time, I have been testing on a Linux server with one GPU, and the model results have been successfully stable. Thank you for your previous answer. But the result remained stable at 66%, and I think I should adjust it again.

Hi, @10086ddd
ARe you using the latest code and latest weight?

Hi, @Kyfafyd
Yes,I downloaded the new project code and weight file again and tested it again, but the result was still 66.1%.

Hi @10086ddd
I forget to add the line for loading updated backbone during testing. You can try the latest code!
Currently, it may give the correct result.

Hi, @Kyfafyd
The new code has successfully achieved the expected results. Thank you for your work.

Hi,@Kyfafyd
I have a small question, can PolypDiag downstream tasks handle videos with longer processing times? For example, videos lasting more than 10 minutes?

Hi, @10086ddd
Are you performing video-level or frame-level task?

Hi, @Kyfafyd
Yes，isn't the PolypDiag downstream task a video level task？

PolypDiag is video-level. I think you can try for videos lasting more than 10 minutes. Increasing the sampling frames for each input video may help to improve the performance. (by adding DATA.NUM_FRAMES 16 in the fine-tuning script)

Ok,Thanks for your answer. @Kyfafyd

	save_dict = {
	"epoch": epoch + 1,
	"state_dict": linear_classifier.state_dict(),
	"backbone_state_dict": model.state_dict(),
	"optimizer": optimizer.state_dict(),
	"scheduler": scheduler.state_dict(),
	"best_f1": best_f1,
	}