[Question] Usage of feature fusion for Multimodal Retrieval?

Question

[Question] Usage of feature fusion for Multimodal Retrieval?

BIGBALLON opened this issue a year ago · 2 comments

BIGBALLON commented a year ago

Hi, @logicwong. Thanks for your great work!!

There are some questions about image retrieval:

for (image + audio) -> image
- should we first extract the features of the query image and the query audio separately and then fuse them (PS: If this is the case, by the way, how to fuse them)
- or directly send the query image/audio to the network at the same time and extract only one feature? is there some API or can you provide some example scripts?

for (image + text) -> image and (image + text + audio) -> image, same question above and how to do that.

Thanks again for your amazing project. Hope to get your reply

Answer 1 · 2023-11-16T07:52:04.000Z

Hello, thank you for your interest in ONE-PEACE.

As you mentioned, we extract the features of different modalities and directly sum the features. You can refer to the following code:

def shot(image, audio, text):
    features_list = []
    if image is not None:
        src_images = model.process_image([image])
        image_features = model.extract_image_features(src_images)
        features_list += [image_features]
    if audio is not None:
        src_audios, audio_padding_masks = model.process_audio([audio])
        audio_features = model.extract_audio_features(src_audios, audio_padding_masks)
        features_list += [audio_features]
    if text is not None:
        src_tokens = model.process_text([text])
        text_features = model.extract_text_features(src_tokens)
        features_list += [text_features]
    mixed_features = sum(features_list) / len(features_list)
    sims = mixed_features @ candidate_image_features.t()
    _, rank_img = sims.topk(k=20, dim=1)

    predict_image_list = []
    for i in rank_img.squeeze().tolist():
        image_path = index2image[str(i)]
        predict_image_list.append(Image.open(image_path).convert("RGB"))

    return predict_image_list

Answer 2 · 2023-11-16T08:23:05.000Z

thanks for your response, got it!