simon-ging/coot-videotext

How can I extract the COOT features for my chinese caption datasets?

Opened this issue · 4 comments

thanks advance!

how can i get the pretrained model like "provided_models/yc2_100m_coot.pth" for my chinese caption datasets? I have extract the coot features of youcook2 with "Extract your own embeddings" in Readme and get the same result of paper,but when I extract the youcook2 coot features without "provided_models/yc2_100m_coot.pth", the caption result worse than the original MART model。

The retriev result:
with "provided_models/yc2_100m_coot.pth":
INFO Saved embeddings to experiments/retrieval/paper2020/yc2_100m_coot_valset1/embeddings/embeddings_0.h5
INFO Retriev | R@1 | R@5 | R@10 | R@50 | MeanR | MedR | Sum
INFO vid | 0.810 | 0.958 | 0.978 | 0.996 | 1.0 | 2.2 | 2.764
INFO par | 0.783 | 0.963 | 0.978 | 0.996 | 1.0 | 2.3 | 2.742
INFO cli | 0.159 | 0.395 | 0.513 | 0.782 | 10.0 | 74.4 | 1.335
INFO sen | 0.169 | 0.406 | 0.525 | 0.780 | 9.0 | 73.2 | 1.355
INFO Loss 0.04828 (Contr: 0.03885, CC: 0.00943) Retrieval: vidpar (457) in 0.042s, clisen (3492) in 2.100s, total 6.557s, forward 0.220s

without "provided_models/yc2_100m_coot.pth":
INFO Saved embeddings to experiments/retrieval/paper2020/yc2_100m_coot_valset1/embeddings/embeddings_0.h5
INFO Retriev | R@1 | R@5 | R@10 | R@50 | MeanR | MedR | Sum
INFO vid | 0.042 | 0.149 | 0.225 | 0.584 | 39.0 | 57.1 | 0.775
INFO par | 0.035 | 0.179 | 0.291 | 0.759 | 22.0 | 37.0 | 0.974
INFO cli | 0.000 | 0.001 | 0.003 | 0.017 | 1391.0 | 1488.7 | 0.018
INFO sen | 0.000 | 0.001 | 0.003 | 0.015 | 1425.0 | 1538.9 | 0.017

The pretrained models only understand english, for chinese you will have to train everything from scratch: Train retrieval, extract features, train captioning.

The pretrained models only understand english, for chinese you will have to train everything from scratch: Train retrieval, extract features, train captioning.

Thank you very much for your reply! What you mean is that the "provided_models/yc2_100m_coot.pth" is generated by train retrieval

Yes, that is correct