qute012/Wav2Keyword

Inference

codeghees opened this issue · 15 comments

Hi! Great work with this. Was able to reproduce your results I think.
@qute012
Two questions - what is the best way to run inference on the trained model? Any sample you have?
Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset.
cfg = convert_namespace_to_omegaconf(state_dict['args'])
Error was a key error 'args' not found. What am I doing wrong?
Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

The test accuracy for 10 samples for each keyword is over 94 percent. Sounds too good to be true.

Hello, @codeghees. Could you please provide me with a requirement.txt file or a conda environment.yml file for the environment you used while reproducing the results? I tried to reproduce the results on the google speech v2 dataset and was faced with the same errors.

Hi~ @codeghees @BeardyMan37

Thank for concerning this project. Truly, i can't afford to maintain this project and can't access server now also ; (
If i have time, i would prefer to develop this project for inferencing. But you guys can reproduce this project referring hyperparameters and model architecture.

Sorry 😐

can you point me to a direction for inference?

I can build it myself.

@BeardyMan37 I used Google Colab.

@codeghees

  1. extract loudest section
    Most important for accuracy, because this model can get only 1 sec raw audio file. So you have to check out extracted signal contains voice actually.
def extract_loudest_section(self, wav, win_len=30):
        wav_len = len(wav)
        temp = abs(wav)

        st,et = 0,0
        max_dec = 0

        for ws in range(0, wav_len, win_len):
            cur_dec = temp[ws:ws+16000].sum()
            if cur_dec >= max_dec:
                max_dec = cur_dec
                st,et = ws, ws+16000
            if ws+16000 > wav_len:
                break

        return wav[st:et]
  1. post process (in fairseq)
    You don't need to normalize raw audio. And i think it works nothing, i just add it for Wav2Vec 2.0 pipeline. I'm not sure, but it doesn't matter to remove this function.
 def postprocess(self, feats, curr_sample_rate):
        if feats.dim() == 2:
            feats = feats.mean(-1)

        if curr_sample_rate != self.sample_rate:
            raise Exception(f"sample rate: {curr_sample_rate}, need {self.sample_rate}")

        assert feats.dim() == 1, feats.dim()

        if self.normalize:
            with torch.no_grad():
                feats = F.layer_norm(feats, feats.shape)
        return feats
  1. make single batch to feed to model.

  2. predict class from argmax of model output

Also how do we get which index represents which class i.e 0 for "UP" - is that positioning of the item in the index array?

@codeghees

Yes, right! like simple classification other method :D

Oh I meant - how do we know the mapping. Does that come from the CLASSES array?

Thanks!

Yes. If you can produce training environment, can you PR for others?

I will go back and check - I just opened colab and followed the instructions. - What is the exact error @BeardyMan37?

Managed to resolve it. @codeghees

@qute012 attaching both the requirement.txt and the environment.yml file for your reference.

Hi! Great work with this. Was able to reproduce your results I think. @qute012 Two questions - what is the best way to run inference on the trained model? Any sample you have? Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset. cfg = convert_namespace_to_omegaconf(state_dict['args']) Error was a key error 'args' not found. What am I doing wrong? Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

Hi! Great work with this. Was able to reproduce your results I think. @qute012 Two questions - what is the best way to run inference on the trained model? Any sample you have? Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset. cfg = convert_namespace_to_omegaconf(state_dict['args']) Error was a key error 'args' not found. What am I doing wrong? Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

hello @codeghees . I encountered the same error while trying to finetune a huggingface wav2vec model with fairseq. Have you found out a method to convert a huggingface model(.bin) to fairseq checkpoint(.pt)?

@codeghees can you please guide me or give me the link of your colab file? I want to reproduce this result and apply same strategy on Urdu Language.