This project implements a complete pipeline for extracting audio and transcripts from a database of obama speeches, and using those to evaluate a punctuation restoration system. Note that the punctuation restoration system is taken from the following project, which is open source: Alam, Tanvirul & Khan, Akib & Alam, Firoj. (2020). Punctuation Restoration using Transformer Models for High-and Low-Resource Languages. 132-142. 10.18653/v1/2020.wnut-1.18. Our main goal was to compare input to a PR system that is simply normal text stripped of punctuation data, versus output of an ASR system.
Please note that the output filenames are variable and hence not match when using the project. These are kept variable so they can be used for any personal project and they can be changed in the code as necessary.