nltk/nltk

I tried everything and still I get: [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index

venturaEffect opened this issue · 24 comments

Hi everyone!

I'm using Langchain to create a custom LLM. To the process I have to use nltk and I've been following all the steps. installed nltk but I have seen it hasn't created a nltk_data folder. So after installilng, upgrading, uninstalling and installing again it didn't work. So I added on my code this lines:

import nltk nltk.data.path = ['C:\\Users\\zaesa\\AppData\\Roaming\\nltk_data'] nltk.download( 'tokenizers', download_dir='C:\\Users\\zaesa\\AppData\\Roaming\\nltk_data') nltk.download( 'punkt', download_dir='C:\\Users\\zaesa\\AppData\\Roaming\\nltk_data') nltk.download('all')

When I check which packages are installed:

`installed_packages = nltk.downloader.Downloader(
download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()

print(installed_packages)`

I get:

dict_values([<Package perluniprops>, <Package mwa_ppdb>, <Package punkt>, <Package rslp>, <Package porter_test>, <Package snowball_data>, <Package maxent_ne_chunker>, <Package moses_sample>, <Package bllip_wsj_no_aux>, <Package word2vec_sample>, <Package wmt15_eval>, <Package spanish_grammars>, <Package sample_grammars>, <Package large_grammars>, <Package book_grammars>, <Package basque_grammars>, <Package maxent_treebank_pos_tagger>, <Package averaged_perceptron_tagger>, <Package averaged_perceptron_tagger_ru>, <Package universal_tagset>, <Package vader_lexicon>, <Package lin_thesaurus>, <Package movie_reviews>, <Package problem_reports>, <Package pros_cons>, <Package masc_tagged>, <Package sentence_polarity>, <Package webtext>, <Package nps_chat>, <Package city_database>, <Package europarl_raw>, <Package biocreative_ppi>, <Package verbnet3>, <Package pe08>, <Package pil>, <Package crubadan>, <Package gutenberg>, <Package propbank>, <Package machado>, <Package state_union>, <Package twitter_samples>, <Package semcor>, <Package wordnet31>, <Package extended_omw>, <Package names>, <Package ptb>, <Package nombank.1.0>, <Package floresta>, <Package comtrans>, <Package knbc>, <Package mac_morpho>, <Package swadesh>, <Package rte>, <Package toolbox>, <Package jeita>, <Package product_reviews_1>, <Package omw>, <Package wordnet2022>, <Package sentiwordnet>, <Package product_reviews_2>, <Package abc>, <Package wordnet2021>, <Package udhr2>, <Package senseval>, <Package words>, <Package framenet_v15>, <Package unicode_samples>, <Package kimmo>, <Package framenet_v17>, <Package chat80>, <Package qc>, <Package inaugural>, <Package wordnet>, <Package stopwords>, <Package verbnet>, <Package shakespeare>, <Package ycoe>, <Package ieer>, <Package cess_cat>, <Package switchboard>, <Package comparative_sentences>, <Package subjectivity>, <Package udhr>, <Package pl196x>, <Package paradigms>, <Package gazetteers>, <Package timit>, <Package treebank>, <Package sinica_treebank>, <Package opinion_lexicon>, <Package ppattach>, <Package dependency_treebank>, <Package reuters>, <Package genesis>, <Package cess_esp>, <Package conll2007>, <Package nonbreaking_prefixes>, <Package dolch>, <Package smultron>, <Package alpino>, <Package wordnet_ic>, <Package brown>, <Package bcp47>, <Package panlex_swadesh>, <Package conll2000>, <Package universal_treebanks_v20>, <Package brown_tei>, <Package cmudict>, <Package omw-1.4>, <Package mte_teip5>, <Package indian>, <Package conll2002>, <Package tagsets>])

But when I look manually on the nltk_data folder on the right path I don't see all this packages and there is no "tokenizers" and no "taggers" even if I wrote:

`import nltk
nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data']

nltk.download(
'tokenizers', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')

nltk.download(
'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')
nltk.download('all')`

Why I still get this error? It makes non sense:

[nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index

Appreciate any help. I can't figure out what is wrong with nltk. I follow everything and in all possible ways and it just doesn't work! Thank you.

Hello!

"tokenizers" and "taggers" are not packages themselves, but "folders" that have packages in them: tokenizers, taggers.

I see that <Package punkt>, the only tokenizer, and all taggers are also included in your list of installed packages.

So, "tokenizers" isn't supposed to work, but "punkt" and "all" should, like this:

>>> import nltk
>>> nltk.download("tokenizers")
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data]     in index
False
>>> nltk.download("punkt")
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True
>>> nltk.download("all")
[nltk_data] Downloading collection 'all'
[nltk_data]    |
[nltk_data]    | Downloading package abc to
...

I think you have everything installed, and you should be ready to start using NLTK.

  • Tom Aarsen

Well, appreciate your response. But I still don't see a provided solution. This isn't working. And believe me I'm trying everything.

Would really appreciate any help.

I'm not fan to ask for help, but tried everything, read the docs, used Langchain chatbot, installed by pip, installed by pip but without dependencies, installed manually,...

Nothing seems to work. It is a bit of frustrating to be honest.

Appreciate any help.

What happens if you run the following?

from nltk.tokenize import sent_tokenize
sent_tokenize("Hello. How are you? I'm well.")
  • Tom Aarsen

Nothing happens. Just get hundreds of lines of:

[nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index

I also did manually create the taggers folder and the tokenizers folder and added all the the files and the unzip folder on the respective folders. Still nothing.

Appreciate if you can still help me because I just don't see what is going wrong here.

I can tell that you've tried a lot already - I'm sorry to hear that it's being so difficult. I tried to google your Package 'tokenizers' not found issue and found no other cases, sadly.
My tokenizers folder in nltk_data looks like this, does that match yours somewhat?
image

Do you perhaps also have a rogue tokenizers file (not folder) somewhere that it could be trying (but failing) to load?

  • Tom Aarsen

Hi Tom,

No my tokenizers folder has the zip punkt folder and punkt.xml. IS the punkt folder that you have when you unzip the punkt folder? Should I unzip also the folders on taggers?

I've made a search on all my folders and I have more folders called tokenizers but in other paths. One is for Stable Diffusion. But, this doesn't explain why it also doesn't get the taggers package and maybe also if possible, can I force to look to tokenizers folder and taggers folder in a specific path?

Appreciate.

Ok, I unzip all the folders. RTun again the script. Same errors. This is so frustrating...

I've made a search on all my folders and I have more folders called tokenizers but in other paths. One is for Stable Diffusion. But, this doesn't explain why it also doesn't get the taggers package and maybe also if possible, can I force to look to tokenizers folder and taggers folder in a specific path?

I think it should only look under a .../nltk_data path, so I think that should be fine.
This is my punkt folder:
image
And this is my punkt.zip:
image
And then this is the content of the punkt folder inside of the punkt.zip:
image

Thanks Tom!

It looks pretty similar to mine. Do you see anything wrong?

nltk-screenshot tokeenizers-screenshot punkt-screenshot taggers-screenshot

Appreciate your help!

Nothing too odd. I suppose you could try deleting the .xml files, and perhaps the folders other than tokenizers and taggers in the first screenshot. I think once you have that, then we match perfectly.

Other than that maybe my punkt.zip directly has a punkt folder inside of it instead of files directly - I'm not sure how this looks for you.

Ok, I have done the changes you have suggested.

Unfortunately and I still can't get why. It isn't working.

Damm it!

I'm puzzled too. What version of nltk are you using? Perhaps it's an older one?

No I just intalled several times nltk it should be the last one.

I've also looked at where it is installed and I get these paths: c:\users\zaesa\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\scripts\nltk.exe
c:\users\zaesa\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages\nltk-3.8.1.dist-info*
c:\users\zaesa\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages\nltk*

I'm so confused...

Weird, when I try to check the version of nltk I get this message:

PS C:\Users\zaesa\OneDrive\Escritorio\code\UtadGPT> python -m nltk C:\Users\zaesa\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe: No module named nltk.__main__; 'nltk' is a package and cannot be directly executed

Maybe there is a possible reason?

That's normal, I have that too. Only python -m nltk.downloader should work

Ok, found a work around. This is what I get:

PS C:\Users\zaesa\OneDrive\Escritorio\code\UtadGPT> pip show nltk Name: nltk Version: 3.8.1 Summary: Natural Language Toolkit Home-page: https://www.nltk.org/ Author: NLTK Team Author-email: nltk.team@gmail.com License: Apache License, Version 2.0 Location: c:\users\zaesa\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages Requires: click, joblib, regex, tqdm Required-by: unstructured

So, it seems everything is fine. What makes less sense why it isn't working. There has to be another reason. Maybe regards Langchain working with nltk?

It seems to be via unstructured. I don't think they use nltk for all that much: https://github.com/search?q=repo%3AUnstructured-IO%2Funstructured%20nltk&type=code

Oh, so it seems a problem with Langchain right?

Maybe there is some type of incompatibilities....

Unstructured just pushed a release 5 minutes ago: https://github.com/Unstructured-IO/unstructured/releases/tag/0.8.7

Upgrading to that fixed it for me.

Awesome! I hope that fixes it for you too @venturaEffect

So do I have to pip install langchain again @akowalsk ? Appreciate any guidance 🙏

I think pip install -U unstructured should do it.

I don't think so, you should just have to do pip install -U unstructured. I can't say for sure though because I'm using unstructured directly right now. My error was exactly the same as what you saw and it went away once I upgraded to 0.8.7.

Thank you soooo much @akowalsk !! You saved my day! Wish you a wonderful weekend!

Also thanks a lot to @tomaarsen . Amazing support!! Enjoy the weekend!!