xenova/transformers.js

How to choose a language's dialect when using `automatic-speech-recognition` pipeline?

Closed this issue · 3 comments

Question

Hi, so I was originally using the transformers library (python version) in my backend, but when refactoring my application for scale. It made more sense to move my implementation of whisper from the backend to the frontend (for my specific usecase). So I was thrilled when I saw that transformers.js supported whisper via the automatic-speech-recognition pipeline. However I'm a little confused by the implementation and the documentation left me with the question in the title.

How to choose a language's dialect when using automatic-speech-recognition pipeline?

In the python implementation of whisper, you don't have to specify the language being spoken as long as you're using the correct model size for multilingual support. But from your examples on transformers.js, it seems like you do in the js implementation.

const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-small');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/french-audio.mp3';
const output = await transcriber(url, { language: 'french', task: 'transcribe' });
// { text: " J'adore, j'aime, je n'aime pas, je déteste." }

However there's no list of supported languages, beyond what you can find on the whisper github repo. That's usually not a problem. But how do you deal with a language like Chinese, that has two main dialects; Mandarin and Cantonese. In python, I didn't have to worry about it, but in js, it seems to be a potential issue.

Please help. Any guidance will be appreciated.

I'm currently hardcoding the language for testing. But i keep getting errors when trying chinese. Here's the snippet.

    useEffect(() => {
        navigator.mediaDevices.getUserMedia({ audio: true })
            .then(stream => {
                const mediaRecorder = new MediaRecorder(stream);
                mediaRecorderRef.current = mediaRecorder;
                let audioChunks = [];

                mediaRecorder.ondataavailable = event => {
                    audioChunks.push(event.data);
                };

                mediaRecorder.onstop = async () => {
                    const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });
                    const audioUrl = URL.createObjectURL(audioBlob);
                    audioRef.current.src = audioUrl;
                    
                    // Transcribe audio
                    try {
                        const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-medium');
                        const result = await transcriber(audioBlob, { language: 'chinese', task: 'transcribe', chunk_length_s: 30, stride_length_s: 5 });
                        setTranscription(result.text);
                    } catch (error) {
                        console.log('Error transcribing audio:', error);
                    }

                    audioChunks = [];
                };

                audioRef.current = new Audio();
                audioRef.current.srcObject = stream;
                return () => {
                    stream.getTracks().forEach(track => track.stop());
                };
            })
            .catch(error => {
                console.log('Error accessing microphone:', error);
            });
    }, []);

My errors look like this:

ERROR
Unexpected token '<', "<!DOCTYPE "... is not valid JSON
SyntaxError: Unexpected token '<', "<!DOCTYPE "... is not valid JSON
    at JSON.parse (<anonymous>)
    at getModelJSON (http://localhost:3000/main.f7a520796f16f5075263.hot-update.js:35717:15)
    at async Promise.all (index 0)
    at async loadTokenizer (http://localhost:3000/main.f7a520796f16f5075263.hot-update.js:28624:16)
    at async AutoTokenizer.from_pretrained (http://localhost:3000/main.f7a520796f16f5075263.hot-update.js:32481:46)
    at async Promise.all (index 0)
    at async loadItems (http://localhost:3000/main.f7a520796f16f5075263.hot-update.js:26511:3)
    at async pipeline (http://localhost:3000/main.f7a520796f16f5075263.hot-update.js:26456:19)
    at async mediaRecorder.onstop (http://localhost:3000/main.64a844b93bb41b13d6ab.hot-update.js:59:29)

I'm guessing the pipeline can't handle webm either. In python i had to convert it to .wav. Is that the same in transformers.js?

See here for the full list of languages supported by Open AI's set of whisper models (as you can see, Cantonese is not supported). Luckily, there are some community models that are trained on Cantonese, which you can try out (see here). To use them in transformers.js, you would need to convert them to ONNX (e.g., with Optimum or with our conversion script).

To solve your second problem (invalid JSON), you can set the following (see here). Copied here:

In most cases, this in intentional and is because (like the python library) we check your local server first for the model files before downloading them from the Hugging Face Hub. If your server correctly returns the 404 status when the model file is not found, it will fallback to the server. If you want to avoid this local model check, you can add the following to the top of your code:

import { env } from '@xenova/transformers';
env.allowLocalModels=false;

You can then refresh your cache and try again.

@xenova Thanks for the tips. It helped with loading the models and working through that problem.