Idea for tool: resegment into well-formed sentences

Question

Idea for tool: resegment into well-formed sentences

bricksdont opened this issue 2 years ago · 5 comments

Hi, this is a great library!

I added one more tool in my fork that does automatic sentence segmentation: bricksdont#1

It changes the distribution of subtitle segments so that each subtitle is exactly one well-formed (and complete) sentence. It's not perfect, a machine learning model is involved.

Here is an example:

# Input

10:01:23,880 --> 10:01:27,640
Regelmässig nimmt er an Veranstaltungen
von FRAGILE Suisse teil,

23
10:01:27,720 --> 10:01:31,840
der Patientenorganisation
für Menschen mit Hirnverletzungen.

# Output

10:01:23,880 --> 10:01:31,840
Regelmässig nimmt er an Veranstaltungen
von FRAGILE Suisse teil, der Patientenorganisation
für Menschen mit Hirnverletzungen.

Would you be interested in a PR for this?

Answer 1 · 2023-02-28T11:26:18.000Z

This looks wonderful and would make a great addition to the repository as part of srt_tools! My only point of note is that the machine learning part would need to be an optional dep assuming it's heavy, but that's it :)

Answer 2 · 2023-02-28T11:28:38.000Z

What would be your preferred way of making this an optional dependency? Just letting the user run into an import error? (yes it's heavy :-))

Answer 3 · 2023-02-28T11:41:33.000Z

I guess wrap the ImportError and provide some nice message, but yes. There's also the question of ongoing maintenance -- are you happy to help keep it up to date with new Python versions, for example?

I suppose this should probably go in srt_tools/contrib since it's not under the same maintenance conditions as the rest of the repository.

Answer 4 · 2023-02-28T11:41:58.000Z

Oh, and can I also take a look at the current code before committing to anything :-)

Answer 5 · 2023-02-28T12:39:59.000Z

Oh, and can I also take a look at the current code before committing to anything :-)

Yes, of course. Any feedback is welcome, and of course you are not obliged to merge this code.

Here is a Colab that shows basic usage: https://colab.research.google.com/drive/1OHBylPv-8s__IU9_lwTW5CLwHQvfB9Rt?usp=sharing