rkadlec/ubuntu-ranking-dataset-creator

Cannot download dataset

jaseleephd opened this issue · 14 comments

Downloading the dataset fails. I have read the previous issues (#9 and #11), but the problem doesn't seem to have been resolved. When I run ./generate.sh, I get:

Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 260, in prepare_data_maybe_download
    filepath, _ = urllib.request.urlretrieve(url, archive_path)
  File "/usr/lib64/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib64/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib64/python2.7/urllib.py", line 357, in open_http
    'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)

The IOError comes from urlretrieve on http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz

Doing wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz also fails. Can anybody tell me how else to download the dataset? Thanks a lot in advance!

facing same issue. is it again related to mcgill servers?

@jasonleeinf have you solved the problem? i come across the same error

Same here: the dataset has the wrong permissions:

./generate.sh
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Successfully downloaded ./ubuntu_dialogs.tgz
Unpacking dialogs ...
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 266, in prepare_data_maybe_download
    with tarfile.open(archive_path) as tar:
  File "/home/dani/anaconda3/envs/ubuntudialogue/lib/python2.7/tarfile.py", line 1680, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
Unpacking dialogs ...
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 266, in prepare_data_maybe_download
    with tarfile.open(archive_path) as tar:
  File "/home/dani/anaconda3/envs/ubuntudialogue/lib/python2.7/tarfile.py", line 1680, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
Unpacking dialogs ...
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 266, in prepare_data_maybe_download
    with tarfile.open(archive_path) as tar:
  File "/home/dani/anaconda3/envs/ubuntudialogue/lib/python2.7/tarfile.py", line 1680, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

The problem is with permissions, as shown by:

wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz
--2017-01-11 11:18:17--  http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz
Resolving cs.mcgill.ca (cs.mcgill.ca)... 132.206.51.10
Connecting to cs.mcgill.ca (cs.mcgill.ca)|132.206.51.10|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-01-11 11:18:17 ERROR 403: Forbidden.

If I browse in chrome to http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz I get

Forbidden

You don't have permission to access /~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz on this server.
Server unable to read htaccess file, denying access to be safe

Apache/2.4.18 (Ubuntu) Server at cs.mcgill.ca Port 80

The directory itself is probably not readable by "others"; if I browse to http://cs.mcgill.ca/~jpineau/datasets/ I get:

screenshot

See http://stackoverflow.com/questions/27890751/magento-new-host-403-forbidden-server-unable-to-read-htaccess-file or http://stackoverflow.com/questions/31365981/server-unable-to-read-htaccess-file-denying-access-to-be-safe on how to fix; basically, chmod -R o+r * on the datasets/ubuntu-corpus-1.0 directory.

One more adding to the choir: is there any chance this will be available again?

@ryan-lowe Do you have access to the servers? There is apparently something wrong with permissions.

Sorry for the delayed reply. I do not have permissions, but I just sent an e-mail to Joelle Pineau and to the CS technical people at McGill who will be able to sort it out. I think it will take a most a few days

Again, apologies for the inconvenience. I think that if there is a chance that this keeps happening (it's the 2nd time at least), we will try to move it to a more permanent location. @rkadlec, would IBM be amenable to this?

Okay, so it turns out the tech admins have just fixed the issue -- apparently it was a permissions problem. If any problems persist, please let me know!

Great, thanks! I'm running generate.sh right now and it seems ok.

Closing this issue since the hosting works fine over the last month.

Getting empty response on the request 8 out of 10 times and even if the download starts, it just stops around 900KB.

Facing the issue ERR_EMPTY_RESPONSE as @thepsyntist reported. Is there something wrong with the server? @ryan-lowe Thanks a lot!

Thanks for the heads up, I'll ask the McGill tech support people to look into it.

Okay, it should be fixed now! @tomyoung96 @thepsyntist