A port of OpenAI's Whisper Speech Transcription model to CoreML
The goal of this project is to natively port, and optimize Whisper for use on Apple Silicon including optimization for the Apple Neural Engine, and match the incredible WhisperCPP project on features.
Please note this repo is currently under development, so there will be bumps in the road.
Community input is welcome!
You can:
Create a Whipser instance whisper = try Whisper()
And run transcription on a Quicktime compatible asset via: await whisper.transcribe(assetURL:URL, options:WhisperOptions)
You can choose options via the WhisperOptions
struct.
Whipser CoreML will load an asset using AVFoundation and convert the audio to the appropriate format for transcription.
Alternatively, for realtime usage, you can call start a whisper session via startWhisperSession(options:WhisperOptions)
, and then send sample buffers to accrueSamplesFromSampleBuffer(sampleBuffer:CMSampleBuffer)
from say an AVCaptureSession or AVAudioSession, or any other source.
Note, we accrue a 30 second sample for now as that is the expected number of samples required.
- Working Multi Lingual Transcription
- Optimize the CoreML models for ANE using Apples ANE Transformers sample code found at this repository
- Port Log Mel Spectrogram to native vDSP and ditch RosaKit package dependency.
- Decode Special Tokens for time stamps.
- Decide on API design
- Base model gets roughly 4x realtime using a single core on an M1 Mac Book Pro.
For ease of use, you can use this Google Colab to convert models. Note that if you convert Medium or larger models you may run into memory issues on Google Colab.
This repository assumes youre converting multilingual models. If you need 'en' models you'll need to adjust the special token values by negative 1.