OFA-Sys/ONE-PEACE

[Question] Usage of feature fusion for Multimodal Retrieval?

BIGBALLON opened this issue · 2 comments

Hi, @logicwong. Thanks for your great work!!

There are some questions about image retrieval:

  • for (image + audio) -> image
    • should we first extract the features of the query image and the query audio separately and then fuse them (PS: If this is the case, by the way, how to fuse them)
    • or directly send the query image/audio to the network at the same time and extract only one feature? is there some API or can you provide some example scripts?
image
  • for (image + text) -> image and (image + text + audio) -> image, same question above and how to do that.
image

Thanks again for your amazing project. Hope to get your reply

Hello, thank you for your interest in ONE-PEACE.

As you mentioned, we extract the features of different modalities and directly sum the features. You can refer to the following code:

def shot(image, audio, text):
    features_list = []
    if image is not None:
        src_images = model.process_image([image])
        image_features = model.extract_image_features(src_images)
        features_list += [image_features]
    if audio is not None:
        src_audios, audio_padding_masks = model.process_audio([audio])
        audio_features = model.extract_audio_features(src_audios, audio_padding_masks)
        features_list += [audio_features]
    if text is not None:
        src_tokens = model.process_text([text])
        text_features = model.extract_text_features(src_tokens)
        features_list += [text_features]
    mixed_features = sum(features_list) / len(features_list)
    sims = mixed_features @ candidate_image_features.t()
    _, rank_img = sims.topk(k=20, dim=1)

    predict_image_list = []
    for i in rank_img.squeeze().tolist():
        image_path = index2image[str(i)]
        predict_image_list.append(Image.open(image_path).convert("RGB"))

    return predict_image_list

thanks for your response, got it!