poloniki/quint

Splitting into sentences

Closed this issue · 2 comments

This is not an issue with this package as I understand that the text coming from Google's Speech-to-Text already has periods, question marks etc. to mark sentence boundaries. I work with auto-generated YouTube transcripts which don't have any sentence boundaries. Do you happen to know of a way to quickly find good sentence boundaries?

Good day! Probably you already found a solution to this, but in general I would prefer to use Whisper model, which already transcripts sentences with punctuation and also has a better quality of transcript. If you want to use Google, than you will have to use punctuation model on top, and the ones that are open source don't seem to do a good job.

@wherewasit I updated the repository, so that you can use whisper transcription and pysbd sentence splitter which is much better than previous version.