Automatically detect medical entities from speech

Through this guide, you will learn how to do automatic speech recognition in your language, fix the grammar from that transcribed speech, restore its punctuation, detect biomedical or clinical entities from that text, get a summary of it, and finally how to put everything together.

Getting started with Automatic Speech Recognition

Video Tutorial in YouTube

Requirements

SageMaker Studio Lab account. See this explainer video to learn more about this.
Python=3.9
PyTorch>=1.10
Hugging Face Transformers
Several audio processing libraries (see environment.yml)

Step by step tutorial

Clone repo and install dependencies

There are 3 main notebooks to follow, but you can start from 0_speech_recognition.ipynb

Click on Copy to project in the top right corner. This will open the Studio Lab web interface and ask you whether you want to clone the entire repo or just the Notebook. Clone the entire repo and click Yes when asked about building the Conda environment automatically. You will now be running on top of a Python environment with all libraries already installed.

Transcribe speech from audio (YouTube)

Open 0_speech_recognition.ipynb and run all steps. For more information, please refer back to this other repo from machinelearnear. You will basically download the audio from a YouTube video by providing the VideoID and then generate a transcript that will be saved locally to /transcripts.

Fix grammar and restore punctuation

Open 1_grammar_punctuation_correction.ipynb and load your transcribed speech. What we want to do now is to first fix the grammar errors and then base out of that fix the punctuation. This order of doing things is random, try it on your own to see what brings better results.

I have tested a number of libraries to do spellchecking and ended up with autocorrect and pyspellchecker. Both of them allow for the addition of custom vocabularies to the spell checker (see this for example) so here is where you could use your very own list of relevant words in your domain e.g. radiology, pathology, etc. The way that you would run it is as follows:

from spellchecker import SpellChecker
spell_py = SpellChecker(language='es', distance=2)  # Spanish dictionary
processed_text = spell_py.correction(input_text)

from autocorrect import Speller
spell_autocorrect = Speller(lang='es',only_replacements=True)
processed_text = spell_autocorrect(input_text)

Once we have our corrected text, we apply a model to restore punctuation. There are a number of them, and you can see many links at the bottom of the notebook, but I short-listed it to 2: deepmultilingualpunctuation and Silero. Both of them allow for the fine-tuning to a specific language. The first library is the one that performs the best even though it was not even trained in Spanish. I'm using a multi-lingual model.

from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model='oliverguhr/fullstop-punctuation-multilingual-base')
result = model.restore_punctuation(output_text)

Detect medical entities (NER) and run summarisation

To detect medical entities, we are going to be using Stanza, "a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing". There are medical NLP models available in Hugging Face through the Spanish Government's National NLP Plan but they are not yet fine-tuned to detect clinical entities such as disease, treatment, etc.

import stanza
# download and initialize a mimic pipeline with an i2b2 NER model
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})
# annotate clinical text
doc = nlp(input_text)
# print out all entities
for ent in doc.entities:
    print(f'{ent.text}\t{ent.type}')

Summarisation example

# model_name = "google/pegasus-large"
model_name = "google/pegasus-xsum"
# model_name = "csebuetnlp/mT5_multilingual_XLSum"
# model_name = "sshleifer/distilbart-cnn-12-6"
# model_name = 'ELiRF/NASES'
from transformers import pipeline
pipe = pipeline(model=model_name)
summary = pipe(input_text,truncation=True)
print(summary[0]['summary_text'])

Keep reading

Citations

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. [pdf][bib]

Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz. Biomedical and Clinical English Model Packages in the Stanza Python NLP Library, Journal of the American Medical Informatics Association. 2021.

Disclaimer

The content provided in this repository is for demonstration purposes and not meant for production. You should use your own discretion when using the content.
The ideas and opinions outlined in these examples are my own and do not represent the opinions of AWS.

machinelearnear/asr-restore-punctuation-summarization-biomedical-ehr