Through this guide, you will learn how to do automatic speech recognition in your language, fix the grammar from that transcribed speech, restore its punctuation, detect biomedical or clinical entities from that text, get a summary of it, and finally how to put everything together.
- Intro to Automatic Speech Recognition on 🤗
- Robust Speech Challenge Results on 🤗
- Mozilla Common Voice 9.0
- Thunder-speech, A Hackable speech recognition library
- SpeechBrain - PyTorch powered speech toolkit
- Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
- SPEECH RECOGNITION WITH WAV2VEC2
- How to add timestamps to ASR output
- SageMaker Studio Lab account. See this explainer video to learn more about this.
- Python=3.9
- PyTorch>=1.10
- Hugging Face Transformers
- Several audio processing libraries (see
environment.yml
)
There are 3 main notebooks to follow, but you can start from 0_speech_recognition.ipynb
Click on Copy to project
in the top right corner. This will open the Studio Lab web interface and ask you whether you want to clone the entire repo or just the Notebook. Clone the entire repo and click Yes
when asked about building the Conda
environment automatically. You will now be running on top of a Python
environment with all libraries already installed.
Open 0_speech_recognition.ipynb
and run all steps. For more information, please refer back to this other repo from machinelearnear. You will basically download the audio from a YouTube video by providing the VideoID and then generate a transcript that will be saved locally to /transcripts
.
Open 1_grammar_punctuation_correction.ipynb
and load your transcribed speech. What we want to do now is to first fix the grammar errors and then base out of that fix the punctuation. This order of doing things is random, try it on your own to see what brings better results.
I have tested a number of libraries to do spellchecking and ended up with autocorrect
and pyspellchecker
. Both of them allow for the addition of custom vocabularies to the spell checker (see this for example) so here is where you could use your very own list of relevant words in your domain e.g. radiology, pathology, etc. The way that you would run it is as follows:
from spellchecker import SpellChecker
spell_py = SpellChecker(language='es', distance=2) # Spanish dictionary
processed_text = spell_py.correction(input_text)
from autocorrect import Speller
spell_autocorrect = Speller(lang='es',only_replacements=True)
processed_text = spell_autocorrect(input_text)
Once we have our corrected text, we apply a model to restore punctuation. There are a number of them, and you can see many links at the bottom of the notebook, but I short-listed it to 2: deepmultilingualpunctuation and Silero. Both of them allow for the fine-tuning to a specific language. The first library is the one that performs the best even though it was not even trained in Spanish. I'm using a multi-lingual model.
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model='oliverguhr/fullstop-punctuation-multilingual-base')
result = model.restore_punctuation(output_text)
To detect medical entities, we are going to be using Stanza, "a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing". There are medical NLP models available in Hugging Face through the Spanish Government's National NLP Plan but they are not yet fine-tuned to detect clinical entities such as disease
, treatment
, etc.
import stanza
# download and initialize a mimic pipeline with an i2b2 NER model
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})
# annotate clinical text
doc = nlp(input_text)
# print out all entities
for ent in doc.entities:
print(f'{ent.text}\t{ent.type}')
Summarisation example
# model_name = "google/pegasus-large"
model_name = "google/pegasus-xsum"
# model_name = "csebuetnlp/mT5_multilingual_XLSum"
# model_name = "sshleifer/distilbart-cnn-12-6"
# model_name = 'ELiRF/NASES'
from transformers import pipeline
pipe = pipeline(model=model_name)
summary = pipe(input_text,truncation=True)
print(summary[0]['summary_text'])
- Pyctcdecode & Speech2text decoding
- XLS-R: Large-Scale Cross-lingual Speech Representation Learning on 128 Languages
- Unlocking global speech with Mozilla Common Voice
- Reconocimiento automático de voz con Python y HuggingFace en segundos (+ Repo)
- “SomosNLP”, red internacional de estudiantes, profesionales e investigadores acelerando el avance del NLP en español
- How to Write a Spelling Corrector
- Build Spell Checking Models For Any Language In Python
- Grammatical Error Correction
- FullStop: Multilingual Deep Models for Punctuation Prediction
- BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish
- PlanTL-GOB-ES/bsc-bio-ehr-es-pharmaconer
- Host Hugging Face transformer models using Amazon SageMaker Serverless Inference
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. [pdf][bib]
Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz. Biomedical and Clinical English Model Packages in the Stanza Python NLP Library, Journal of the American Medical Informatics Association. 2021.
- The content provided in this repository is for demonstration purposes and not meant for production. You should use your own discretion when using the content.
- The ideas and opinions outlined in these examples are my own and do not represent the opinions of AWS.