ottokart/punctuator

How to implement a new model in another language

jhoelzl opened this issue · 2 comments

Hi,

i really appreciate your work, I want to test the script on non punctuated English or German text. So, i guess i have to make a new model with that language.

Can you briefly describe the tasks to implement a new model in a specific language?

Regards,
Josef

Hi!

Thanks for the appreciation!
To train a new entirely text based model, you would need to obtain and prepare a text corpus in the target language. The process is not much different from the one of preparing a language modeling dataset. An example of the end result can be seen, for example, here.

If you want to include pause duration feature as well, then things get more complicated, unless a suitably annotated dataset is readily available.

By the way, there's now an improved version available.
It also has an English demo that you can try.

Thank you for your support, i will definitly try the improved version. The English demo looks very promising!