atosystem/SpeechCLIP

Simple Embeddings

Closed this issue · 4 comments

Hi,

Please could you provide a simple way to load a model and test a single audio clip to produce an embedding?

Thank you very much.

@corranmac Got it!
I have updated the readme.
See example.py

Thanks for such a quick response!

I was wondering if I'm able to transform these outputed embeddings to the same shape as clip, for use in speech-image retrieval and also image generation trained on clip embeddings? I can't seem to find a seperate class for encoding, eg. model.encode() like clip has.

Thanks

@corranmac
Yeah, you can use the semantic embedding of speech to calculate similarity with image embeddings for speech-image retrieval. In fact, this is how we do it in our paper.
I have added a function in the model class (kwClip.py) for extracting the semantic embedding for speech input

def encode_speech(
self,
wav,
) -> dict:
"""encode speech
Args:
wav (list): input list of waveforms
Returns:
dict: {
"cascaded_audio_feat" : if cascaded branch exists
"parallel_audio_feat" : if parallel branch exists
"vq_results" : if cascaded branch exists
"keywords" : if cascaded branch exists
}
"""

@corranmac If there is no further question, I will close this issue.