[Question] Usage of feature fusion for Multimodal Retrieval?
BIGBALLON opened this issue · 2 comments
BIGBALLON commented
Hi, @logicwong. Thanks for your great work!!
There are some questions about image retrieval:
- for (image + audio) -> image
- should we first extract the features of the query image and the query audio separately and then fuse them (PS: If this is the case, by the way, how to fuse them)
- or directly send the query image/audio to the network at the same time and extract only one feature? is there some API or can you provide some example scripts?
- for (image + text) -> image and (image + text + audio) -> image, same question above and how to do that.
Thanks again for your amazing project. Hope to get your reply
logicwong commented
Hello, thank you for your interest in ONE-PEACE.
As you mentioned, we extract the features of different modalities and directly sum the features. You can refer to the following code:
def shot(image, audio, text):
features_list = []
if image is not None:
src_images = model.process_image([image])
image_features = model.extract_image_features(src_images)
features_list += [image_features]
if audio is not None:
src_audios, audio_padding_masks = model.process_audio([audio])
audio_features = model.extract_audio_features(src_audios, audio_padding_masks)
features_list += [audio_features]
if text is not None:
src_tokens = model.process_text([text])
text_features = model.extract_text_features(src_tokens)
features_list += [text_features]
mixed_features = sum(features_list) / len(features_list)
sims = mixed_features @ candidate_image_features.t()
_, rank_img = sims.topk(k=20, dim=1)
predict_image_list = []
for i in rank_img.squeeze().tolist():
image_path = index2image[str(i)]
predict_image_list.append(Image.open(image_path).convert("RGB"))
return predict_image_list
BIGBALLON commented
thanks for your response, got it!