nltk/nltk_data

Issue with downloading inaugural corpus

pratos opened this issue · 4 comments

Hi,

[x] Searched Stackoverflow for any existing issues
[x] Searched nltk_data open and closed issues

I tried to install inaugural corpus using python -m nltk.downloader inaugural. But faced this problem:

[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/prthamesh/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] [Errno 21] Is a directory:
[nltk_data]     '/Users/prthamesh/nltk_data/corpora/1789-Washington.tx
[nltk_data]     t'
Error installing package. Retry? [n/y/e]
y
[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/prthamesh/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] [Errno 21] Is a directory:
[nltk_data]     '/Users/prthamesh/nltk_data/corpora/1789-Washington.tx
[nltk_data]     t'
Error installing package. Retry? [n/y/e]
y
[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/prthamesh/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] [Errno 21] Is a directory:
[nltk_data]     '/Users/prthamesh/nltk_data/corpora/1789-Washington.tx
[nltk_data]     t'
Error installing package. Retry? [n/y/e]

This was tested on Mac M1 (2021 edition) and also on Ubuntu 20.04 (Github CI runner). Faced the above issue on both the OS.

@pratos Hey! I investigated this a little bit. We updated our inaugural corpus about 8 hours ago.
The changes were of a slightly different format than before, but I don't have issues on Windows.
However, on Google Colab I do get these issues. They were (at least partially) resolved by updating nltk: pip install -U nltk.
Perhaps this would work in your case. Let us know.

@stevenbird @nimbusaeta The recent changes to inaugural have some changes which might also be related:

  • 2021-Biden.txt has Windows line endings, while all other files have Unix line endings.
  • The zip directly contains the .txt files, while previously the .zip contained a folder containing the .txt files.

If simply updating nltk doesn't help, then we might want to revert back (assuming the old version did work!).

Hey thanks for the update, will check out if bumping nltk version works for my local.

For our application though, we are being cautious not to break things. We resorted to removing inuagural from the list of corporas since we don't use it specifically now (just a bloat).

I can confirm that bumping nltk to 3.6.5 works on Mac M1

Closing this issue since this would affect folks only on the previous versions. We have nltk==3.2.4 for our legacy app. Incase if anyone gets this issue, just upgrade the nltk version