punkt doesn't work as of nltk 3.8.2
Opened this issue · 1 comments
tomteecezint commented
punkt is loaded in as a pickle file which is not secure CVE-2024-39705 so you have to use punkt_tab now.
This breaks _get_sentence_tokenizer
.
In order to use the Tokeniser class I had to override _get_sentence_tokenizer
like this:
def _get_sentence_tokenizer(self, language):
""" We are overriding this as we need to replace punkt with punkt_tab in sumy"""
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
return PunktTokenizer(language)
except (LookupError, zipfile.BadZipfile) as e:
raise LookupError(
"NLTK tokenizers are missing or the language is not supported.\n"
"""Download them by following command: python -c "import nltk; nltk.download('punkt_tab')"\n"""
"Original error was:\n" + str(e)
)
Also change nltk.download('punkt')
to nltk.download('punkt_tab')
tomteecezint commented
See this thread - nltk/nltk#3293