nltk/nltk_data

I am not able to download punkt.zip file for tokenization purpose

sumitsharmatops opened this issue · 4 comments

Hi, I am working on one NLP project where I am using NLTK, Previously I was downloading punkt via api (nltk.download('punket')) but not want to download this manually but both things are not working. That mean I am not able to download this manually or via API, How to do that. Please help me out for this

Hello!
There is a CDN issue with the "Jio" internet provider, which prevents it from accessing the NLTK data, e.g.: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.xml.
There are some workaround described here: nltk/nltk#3146

The primary workaround that helped people was temporarily using mobile hotspot.

  • Tom Aarsen

I wasn't either, though in my case I was able to easily download it on the same machine fro the same network with no configuration change.

I looked in the index file and found the url, copied it into my browser then moved the file to the relevant place for NLTK to find it.

Obviously this is manual. but if it suits your use case, it may work.

The only solution I found is cloning the whole repository. This is what I have done today for my project. It doesn't solve the problem but I hope it gives you a workaround.

image

Or you could go to
https://github.dev/nltk/nltk_data
and download it from there. this seems to be a much better solution.