Idea for tool: resegment into well-formed sentences
bricksdont opened this issue · 5 comments
Hi, this is a great library!
I added one more tool in my fork that does automatic sentence segmentation: bricksdont#1
It changes the distribution of subtitle segments so that each subtitle is exactly one well-formed (and complete) sentence. It's not perfect, a machine learning model is involved.
Here is an example:
# Input
10:01:23,880 --> 10:01:27,640
Regelmässig nimmt er an Veranstaltungen
von FRAGILE Suisse teil,
23
10:01:27,720 --> 10:01:31,840
der Patientenorganisation
für Menschen mit Hirnverletzungen.
# Output
10:01:23,880 --> 10:01:31,840
Regelmässig nimmt er an Veranstaltungen
von FRAGILE Suisse teil, der Patientenorganisation
für Menschen mit Hirnverletzungen.
Would you be interested in a PR for this?
This looks wonderful and would make a great addition to the repository as part of srt_tools
! My only point of note is that the machine learning part would need to be an optional dep assuming it's heavy, but that's it :)
What would be your preferred way of making this an optional dependency? Just letting the user run into an import error? (yes it's heavy :-))
I guess wrap the ImportError and provide some nice message, but yes. There's also the question of ongoing maintenance -- are you happy to help keep it up to date with new Python versions, for example?
I suppose this should probably go in srt_tools/contrib
since it's not under the same maintenance conditions as the rest of the repository.
Oh, and can I also take a look at the current code before committing to anything :-)
Oh, and can I also take a look at the current code before committing to anything :-)
Yes, of course. Any feedback is welcome, and of course you are not obliged to merge this code.
Here is a Colab that shows basic usage: https://colab.research.google.com/drive/1OHBylPv-8s__IU9_lwTW5CLwHQvfB9Rt?usp=sharing