How to use demo.py
Zhentao-Liu opened this issue · 8 comments
I download your repo and the pretrained model mae_pretrain_vit_base.pth, and run demo.py, however, after I load the model, it does not exist 'config' key and 'state_dict' in the dict. how to fix it?
Did you run the following code?
python3 demo.py --checkpoint=./weights/simpleclick_models/cocolvis_vit_huge.pth --gpu 0
It has no issue for me. Could you provide the detailed error info?
@qinliuliuqin I can use negative click point for this model ?
@qinliuliuqin I can use negative click point for this model ?
Yes, of course.
@qinliuliuqin
Thank you. I just test demo. Right click is nagative point.
I have some other question :
1, How you create positive , negative , Prev. Mask and gt for training ?
2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ?
3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?
@qinliuliuqin Thank you. I just test demo. Right click is nagative point. I have some other question : 1, How you create positive , negative , Prev. Mask and gt for training ? 2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ? 3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?
Hi @ThorPham, Thanks for your questions.
- This line shows how to create pos and neg clicks during training. This line shows that we concatenate the image and the previous mask (i.e. a probability map), along with the click masks, as the network input.
- See here. The previous mask is a probability map obtained by the model in the evaluation mode.
- We only tested 'NoBRS' mode. Forget about other modes-they are inherited from RITM.
@qinliuliuqin thank you for your support.
What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model.
In paper, I see you concat image + prev mask . Do you add click point ?
@qinliuliuqin thank you for your support. What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model. In paper, I see you concat image + prev mask . Do you add click point ?
Hi @ThorPham, the clicks (i.e coordinates) are encoded as a 2-channel binary mask, one for positive clicks and one for negative clicks. Each click is represented as a disk on the binary mask. We concatenate the click mask (2 channels), prev mask (1 channel), and RGB image (3 channels) to form a 6-channel input. Since we want to reuse the pretrained ViT, whose patch embedding layer only accepts 3-channel input, we add one more patch embedding layer and split the 6-channel input into two groups (each group has 3 channels, as shown in Fig. 1). In this way, we can turn the plain ViT backbone into an iSeg backbone with minimal changes.
@qinliuliuqin Thank you so much.