DirtyHarryLYL/HAKE-Action

有没有可能做到零样本?

whqwill opened this issue · 5 comments

有没有可能做到零样本(ZSL),就是直接通过part的state然后推到出新的action,不在这600类中。也就是说比如我手里有新的数据不属于这600类,而且没有标注训练集。

This is a great question, generalization is our main goal to build HAKE to get rid of the costly data annotation.

In our previous work, we use the pasta classifier as the pre-trained backbone to extract pasta features for the downstream tasks, this way is similar to the imagenet pre-trained Resent. So we just need to finetune a new classifier for new tasks, e.g., AVA (80), VCOCO (29).

As for full ZSL, i.e., no data for finetuning the classifier, we are working on this.

Some priors can help: after recognizing the PaStas of a person,
(1) for the compositional learning way, maybe there are some pre-designed hints about the PaSta-action relations in the datasets, like the settings of MIT-States, UT-Zappos.
(2) or we may use the language priors to mine some PaSta-Action relations like a correlation matrix between 93 PaStas and X new actions ([93, X]), then the PaSta prediction score ([93]) can be multiplied to this matrix directly to generate predictions([X]) for unseen actions.

Except for the above ways, we are trying some other new ideas like causal reasoning. Once available, we will also release them in our HAKE project.

Thanks! It helps a lot. But I have a question that how to get the correlation matrix when using the language priors.

Another idea: can we just use the description of the final action, and match the description of the pasta state by language semantic similarity. But I don't know how to find such suitable description.

Correlation can be the cosine similarity between two sentences, e.g., "hand is holding an apple" and "cleaning fruit".

For the second, we have tried one way, e.g., composing all pasta descriptions (bert features) into a fixed vector which is the same size as the action description (also bert vector), and then compute their similarity. The distance closer, the probability higher. But the performance is comparable.

A more suitable way maybe using some methods in captioning to encode the pasta descriptions to generate the final action description in a more NLP style. But these are beyond the scope of our Pasta paper, it is very interesting to dig this point, as PaSta descriptions are finer depictions for action images, which can further advance the scene understanding downstream tasks, like VQA, captioning, visual reasoning, etc.

We are also working on developing more applications of HAKE. If you have any thoughts or works about a deeper utilization of HAKE data, welcome to discuss with us!

Correlation can be the cosine similarity between two sentences, e.g., "hand is holding an apple" and "cleaning fruit".

I can understand the cosine similarity between two sentences, but the "apple" is in specific data (it is unknown), so the correlation matrix between 93 PaStas and X new actions ([93, X]) can't contain the "apple" information (which means without object information). So how to deal with it?

For the second, we have tried one way, e.g., composing all pasta descriptions (bert features) into a fixed vector which is the same size as the action description (also bert vector), and then compute their similarity. The distance closer, the probability higher. But the performance is comparable.

A more suitable way maybe using some methods in captioning to encode the pasta descriptions to generate the final action description in a more NLP style. But these are beyond the scope of our Pasta paper, it is very interesting to dig this point, as PaSta descriptions are finer depictions for action images, which can further advance the scene understanding downstream tasks, like VQA, captioning, visual reasoning, etc.

We are also working on developing more applications of HAKE. If you have any thoughts or works about a deeper utilization of HAKE data, welcome to discuss with us!

More NLP style is a good idea, I will think more about it. Thanks.

I can understand the cosine similarity between two sentences, but the "apple" is in specific data (it is unknown), so the correlation matrix between 93 PaStas and X new actions ([93, X]) can't contain the "apple" information (which means without object information). So how to deal with it?

Here is just an example, detailed operations depend on scenarios, e.g.,

  1. no clues about objectives: use "something/object/..."
  2. know the objective class range, such as coco 80: use all possible objectives (apple, book, orange) for a verb and aggregate all results for a verb "hold".
  3. know the objective rough class like animals, fruits, ...: then you can use the rough class names
    ...