Whisper Word Timestamps

This repository contains a Python script that uses OpenAI's Whisper ASR system to transcribe audio files and provide word-level timestamps. The script is designed to be flexible and allows you to choose from different Whisper models, specify the language of the audio, and set the maximum duration of the audio file. This project is largely based on the Hugging Face Space by Matthijs and is modified to output a CSV file with words and timestamps.

Installation
Usage
Configuration
Contributing
License

Installation

To use this script, you need to have Python installed on your system. The script also depends on several Python libraries, which are listed in the requirements.txt file.

Here are the steps to install the necessary dependencies:

Clone the repository:

git clone https://github.com/sbene97/whisper-word-ts-csv.git

Navigate to the cloned repository:

cd whisper-word-ts-csv

Create a new Python virtual environment:

python -m venv env

Activate the virtual environment:

On Windows:

.\env\Scripts\activate

On Unix or MacOS:

source env/bin/activate

Install the dependencies from the requirements.txt file:

pip install -r requirements.txt

Usage

You can run the script from the command line using the following syntax:

python main.py --audio <path/to/audio> --language <language> --length <length> --model <model>

Here's what each argument does:

--audio: Specifies the path to the audio file you want to transcribe.
--language: Specifies the language of the audio. The default is English.
--length: Specifies the length of the audio file in seconds.
--model: Specifies the Whisper model to use. Options include 'tiny', 'base', 'small', 'medium', and 'large'. The default is 'small'.

Configuration

You can configure the script by modifying the following variables at the top of the script (or by passing arguments, see above):

model: The Whisper model to use. Options include 'tiny', 'base', 'small', 'medium', and 'large'. The default is 'small'.
max_duration: The maximum duration of the audio file in seconds. The default is 600.
rows_out: The number of rows to print in the output DataFrame for quick inspection. The default is 30.
language: The languag of the audio file.

Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the terms of the Apache 2.0 license.