baaivision/tokenize-anything

question about Model D training

Closed this issue · 5 comments

Awesome work, Congratulations!
I have some questions about the Model D training.
1、In this model,Pre-train with [Mask,Concept],this concept means the text embeddings(2560 categories)? Than how get this concept to 1B masks?
2、In this paper, get 2.25TB image embedding. How use this data?

Hi, @jetyingjia

  1. Each mask has a pre-computed image embedding for encoding log target via encode_tgt(...)
  2. The 2.25TB image embedding database contains 1B embeddings for 1B masks, used in 1.

BTW, there should be 60 days to compute 1B EVA-CLIP-E embeddings if using 8 NVIDIA A100 😅.

Hi, @PhyscalX
1、This means the model D‘s classify branch target is the concept distribution(image embeeding project to 2560-dimension distribution logits), not the region pseudo label(many paper use pseudo label, eg:OWL)
2、The idea of learn the concept distribution, have other paper recommended?
Thank you!

  1. We have clarified that we use KL divergence loss in Sec 3.1.
  2. This method is used by many CLIP-based distillation papers (e.g. RegionCLIP, a modified Faster R-CNN
    for Open-Vocabulary Classification). However, it is challenging to integrate this method into SAM with 1B masks.

@PhyscalX
Good idea,Do you have the plan to release the full project(including training)? As I want to fine-tune this model in my datasets.

Refer to issue #5, currently, we have no plan to release the full code.
Instead, we have released the visual prompter and losses for pre-training and fine-tuning.