Transcribe

Transcribe audio in any language to English using Whisper

Run on Google Colab

Click the badge below to open this project in Google Colab:

Python Code Example

pip install transformers torchaudio

Upload audio file in colab

from google.colab import files

# Upload the file
uploaded = files.upload()

# Display the uploaded file(s)
for filename in uploaded.keys():
    print(f"Uploaded file: {filename}")
    file_path = filename

Transcribe using whisper to any language needed

Run this cell and wait for all chunks to be transcribed. The final output will be shown after all chunks are done.

Whisper Large v2 Supported Languages and Codes

Below is the full list of languages supported by Whisper Large v2 with their corresponding code parameters:

Language	Code	Language	Code
Afrikaans	`af`	Albanian	`sq`
Amharic	`am`	Arabic	`ar`
Armenian	`hy`	Assamese	`as`
Azerbaijani	`az`	Bashkir	`ba`
Basque	`eu`	Belarusian	`be`
Bengali	`bn`	Bosnian	`bs`
Bulgarian	`bg`	Burmese	`my`
Catalan	`ca`	Chinese (Simplified)	`zh`
Chinese (Traditional)	`zh-tw`	Croatian	`hr`
Czech	`cs`	Danish	`da`
Dutch	`nl`	English	`en`
Esperanto	`eo`	Estonian	`et`
Finnish	`fi`	French	`fr`
Galician	`gl`	Georgian	`ka`
German	`de`	Greek	`el`
Gujarati	`gu`	Hausa	`ha`
Hebrew	`he`	Hindi	`hi`
Hungarian	`hu`	Icelandic	`is`
Indonesian	`id`	Italian	`it`
Japanese	`ja`	Javanese	`jv`
Kannada	`kn`	Kazakh	`kk`
Khmer	`km`	Korean	`ko`
Lao	`lo`	Latvian	`lv`
Lithuanian	`lt`	Macedonian	`mk`
Malagasy	`mg`	Malay	`ms`
Malayalam	`ml`	Maltese	`mt`
Maori	`mi`	Marathi	`mr`
Mongolian	`mn`	Nepali	`ne`
Norwegian	`no`	Oriya (Odia)	`or`
Pashto	`ps`	Persian (Farsi)	`fa`
Polish	`pl`	Portuguese	`pt`
Punjabi	`pa`	Romanian	`ro`
Russian	`ru`	Serbian	`sr`
Sinhala	`si`	Slovak	`sk`
Slovenian	`sl`	Somali	`so`
Spanish	`es`	Sundanese	`su`
Swahili	`sw`	Swedish	`sv`
Tagalog	`tl`	Tamil	`ta`
Tatar	`tt`	Telugu	`te`
Thai	`th`	Turkish	`tr`
Ukrainian	`uk`	Urdu	`ur`
Uzbek	`uz`	Vietnamese	`vi`
Welsh	`cy`	Yiddish	`yi`
Yoruba	`yo`	Zulu	`zu`

You can use the corresponding code for each language in your Whisper configuration to set the desired transcription language.

Here’s how you can modify the code to replace the language in a specific line using GitHub Markdown:

# How to Modify the Language in Code

Below is the Python code snippet to change the language in `forced_decoder_ids`:

```python
# Example: Setting the decoder prompt for a specific language and task
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ta", task="translate")

Instructions to Change the Language:

Locate the line in your code:

forced_decoder_ids = processor.get_decoder_prompt_ids(language="ta", task="translate")

Replace the "ta" (Tamil) in the language parameter with the desired language code. For example, to set the language to Spanish (es):
```
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="translate")
```
Run the cell to use the updated language.

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio

# Load Whisper processor and large-v2 model
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")

# Ensure the model is on the appropriate device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Function to preprocess audio file
def preprocess_audio(file_path, sampling_rate=16000):
    # Load the audio file
    audio, sr = torchaudio.load(file_path)
    # Resample to 16kHz if necessary
    if sr != sampling_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sampling_rate)
        audio = resampler(audio)
    return audio.squeeze(0)  # Remove channel dimension if stereo

# Function to split long audio into chunks
def split_audio(audio, chunk_duration, sampling_rate):
    num_samples_per_chunk = chunk_duration * sampling_rate
    return [audio[i:i + num_samples_per_chunk] for i in range(0, len(audio), num_samples_per_chunk)]

 # Replace with the actual file path

# Preprocess the audio
audio_tensor = preprocess_audio(file_path)

# Split the audio into smaller chunks (30 seconds each)
chunk_duration = 30  # Duration of each chunk in seconds
chunks = split_audio(audio_tensor, chunk_duration, sampling_rate=16000)

# Transcribe each chunk and concatenate results
transcriptions = []
for i, chunk in enumerate(chunks):
    input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features.to(device)
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="ta", task="translate") #change language here
    generated_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids,
        max_new_tokens=444,  # Allow enough tokens for longer transcription
    )
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    transcriptions.append(transcription)
    print(f"Chunk {i + 1}/{len(chunks)} transcribed.")

# Combine all chunks into the final transcription
full_transcription = " ".join(transcriptions)
print("\nFull Transcription:\n", full_transcription)