neurosnap/sentences

Add support for Faroese

sigmundv opened this issue · 2 comments

Hello, I was wondering how I would go about adding support for more languages. I can see that the key is to have training data, but how do I generate the required JSON file? Thank you in advance for making this package!

Hi! Thanks so much for opening this issue, much appreciated.

So I didn't perform any of the training for this library, I leveraged the pre-trained models that already existed inside NLTK: https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

If you wanted to add support for Faroese, you would want to figure out how to use the PunktTrainer to generate the model, convert it to JSON, and then we could add support for it inside this library.

The PunktTrainer can be found here: https://github.com/nltk/nltk/blob/e2d368e00ef806121aaa39f6e5f90d9f8243631b/nltk/tokenize/punkt.py#L636

I hope that helps!

That's perfect, I'll look into the PunktTrainer in NLTK.