[feat] Distribute NLTK tokenizers used in the core package

Question

[feat] Distribute NLTK tokenizers used in the core package

Opened this issue 13 days ago · 0 comments

Description
[Add a description of the feature]
Since we now required nltk and the punkt tokenizer during the validation loop for chunking during streaming, we should either download and distribute the punkt tokenizer with the library or find a way to include it during the install phase. From what I can see the only way to perform a post-install flow is if we switch back to setuptools instead of poetry, but even that may not work for all distribution methods.

Why is this needed
[If you have a concrete use case, add details here.]
Currently we download the tokenizer, if it dosen't exist, during runtime which can cause issues in certain environments like kubernetes. See #821

Implementation details
The simplest path would be to download the tokenizer during our deploy script and included it in the distributable.

The downside to this approach is the tokenizer is ~38 MB.

An alternative, like previously mentioned, is to abandon Poetry and switch back to setuptools. This should allow us to implement post-install functionality in the setup.py; though we would need to verify this works in all the various ways the library can be installed.

Another alternative is to find a smaller, installable tokenizer to perform chunking.

End result
[How should this feature be used?]
No nltk downloads are performed during runtime.