Add support for Faroese
sigmundv opened this issue · 2 comments
Hello, I was wondering how I would go about adding support for more languages. I can see that the key is to have training data, but how do I generate the required JSON file? Thank you in advance for making this package!
Hi! Thanks so much for opening this issue, much appreciated.
So I didn't perform any of the training for this library, I leveraged the pre-trained models that already existed inside NLTK: https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip
If you wanted to add support for Faroese, you would want to figure out how to use the PunktTrainer
to generate the model, convert it to JSON, and then we could add support for it inside this library.
The PunktTrainer
can be found here: https://github.com/nltk/nltk/blob/e2d368e00ef806121aaa39f6e5f90d9f8243631b/nltk/tokenize/punkt.py#L636
I hope that helps!
That's perfect, I'll look into the PunktTrainer in NLTK.