huggingface/transformers.js

Support wavlm-base-plus-sv with WebGPU

Opened this issue · 3 comments

System Info

Transformers.js Alpha 10, Brave

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Not sure what happened, but:

  • did page refresh
  • started Whisper
  • Saw this:
Screenshot 2024-08-30 at 21 11 45

I did just fiddle with moving a wasm file into a local folder. But since it doesn't seem to load those, it shouldn't be the cause.

Reproduction

I'll share more if I can reproduce it myself.

It seems to be related to the audio verification model I'm using.

Perhaps I'm accidentally running it in paralel.

Screenshot 2024-08-31 at 10 54 57

I had forgotten an await. But even after fixing that it still occurs.

I've added a 5 second delay between verification of the audio snippets, and in the console I can see that it's only attempting a verification every 5+ seconds. So it's definitely not running in paralel.

It also seems to output an embedding, despite the error. Though perhaps it's outputting an older embedding? I'm going to check that next.

Screenshot 2024-09-02 at 10 20 41

I played around with D-types. Then I realized that the issue is probably with using WebGPU in the first place.

Solution

I switched the verification model over to WASM, and bingo, now it runs fine. It's detecting multiple speakers again.

Conclusion

So the conclusion is: the Xenova/wavlm-base-plus-sv model does not yet have WebGPU support.

It doesn't seem to be much slower, so I don't think it matters at all. But just for completeness I'll rename this issue to 'Support wavlm-base-plus-sv with WebGPU'.

wespeaker-voxceleb-resnet34-LM

I also quickly swapped in onnx-community/wespeaker-voxceleb-resnet34-LM. When using WebGPU (at default or forced to FP32) it doesn't output errors, but the embeddings it returns aren't useful? Here are some similarity score outputs (with the similarity threshold lowered to 0.5 instead of 0.95):

Screenshot 2024-09-02 at 10 54 46 (expectation: two speakers)

This could just be implementation error on my part. But for now I'll be sticking with wavlm-base-plus-sv.