Google speech-to-text API provides speaker diarization capabilities alongside their transcriptions. I've put together this easy to follow notebook that allows you to send audio files through GCP's 'long_running_recognize' to perform operation, to perform asynchronous speech recognition (amongst other things), and return structured .csv
files with the output of the API.
In order to get started there are a couple of things you need to do.
- Clone this repo
- Install Google Cloud's suite of tools in python
pip install --upgrade google-cloud-speech
- Create a new project on GCP
- Create a billing account. You need to put payment info so you can use the services
- Once you have a billing account, you can Enable the API you want to use
- Create credentials for that API service.
- You'll get a JSON file with your creds in it
- Export that creds file to your bashrc with:
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
There are two ways you can use this: follow it on jupyter notebook or simply run the following on your command line:
python google-diarization.py -i <input .wav file>
This will output a .csv file to the data/outputs/
directory.
I created this pipeline with the goal of using it as a tool to supplement Spectrum with the capability of identifying multiple speaker dialog in a lecture. For those inside the Penn State network that want to learn more about the tool feel free to visit https://dstoolbox.tlt.psu.edu/