This is the BBC Punctuator, developed by BBC R&D's Speech-to-Text team. BBC Punctuator is a machine learning model for restoring punctuation and capitalisation into plaintext, such as raw transcripts produced by STT systems.
The model uses HuggingFace's bert-base-uncased
model weights that have been fine-tuned for Punctuation restoration for one epoch over a new BBC built composite punctuator training dataset, composed of BBC News articles (written between 2020-2), BBC News transcripts, and subtitles from a range of BBC TV programmes [model=composite20v3-1e]. The parameters learned over this training process are stored in bbc-data (bbc/speech/transcription/bbc-punctuator:0.0.1), which when referenced from this code make a complete system for punctuating text.
The system utilises the CPU by default but can be instructed to run on a GPU.
List of text featured available for recovery:
- Capitalisation:
- Capitalisaton
- Acronyms (upper-casing)
- Mixed-casing
- Punctuation:
- Period: .
- Comma: ,
- Question Mark: ?
- Exclamation: !
- Colon: :
- Semi-colon: ;
- Apostrophe: '
- Hypen: -
- Numbers:
- Digitising natural language numbers
- Currencies: £, $, €, ¥
- Percentages: %
- Ordinals
- Decimals
- Dates (years/decades)
-
Pull the model files from bbc-data:
bbc-data pull bbc/rpunct:0.0.1
-
Install the BBC Punctuator system from Git:
git clone git@github.com:bbc/rpunct.git
cd rpunct
pip install -r requirements.txt
-
Move the model files to the reference location within bbc-punctuator:
mv $(bbc-data path bbc/rpunct) model-files
-
Run the BBC Punctuator by pointing the run file to the model files (from inside the cloned repo):
-
To punctuate a plaintext (.txt) file from the command line:
python run.py rpunct -m model-files -i input_file_path.txt -o output_file_path.txt
-
To use bbc-punctuator within a python script (from within the
rpunct
directory):from rpunct.rpunct_recoverer import RPunctRecoverer punctuator = RPunctRecoverer() text = "hello and welcome to the bbc punctuator \ this is an example piece of plaintext \ that exhibits no punctuation or capitalisation" punctuated_text = punctuator.process(text)
-
Metric | Prec | Recall | F1 |
---|---|---|---|
abbreviations | 0.9377 | 0.8605 | 0.8974 |
capitalisation | 0.9213 | 0.9434 | 0.9322 |
commas | 0.6977 | 0.7029 | 0.7003 |
hyphens | 0.7198 | 0.8953 | 0.7980 |
numbers | 0.7499 | 0.7747 | 0.7621 |
sentences | 0.8351 | 0.8762 | 0.8552 |
fullstops | 0.8110 | 0.8710 | 0.8399 |
questions | 0.7537 | 0.4770 | 0.5843 |
- Tom Potter tom.potter@bbc.co.uk
- Daulet Nurmanbetov daulet.nurmanbetov@gmail.com (developer of the original system rpunct upon which bbc-punctuator is based)