kubeflow/pytorch-operator

Mnist dataset server is down

Jeffwan opened this issue · 5 comments

E2e test is down. Reason is straightforwad that server report 503 issue and I did some check and notice this has been tracked in torch community.

As the patch is only available on master and there's no way to specify the download path. I can try to either disable that single test case and wait for stable release or build a nightly image which takes extra efforts

Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

Confirmed this is a server side issue.

https://discuss.pytorch.org/t/mnist-server-down/114433
pytorch/vision#3554

@Jeffwan We faced with the same problem in Katib.
We currently using FashionMNIST instead of MNIST: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L137.
I believe it hosts in the PyTorch servers.

@andreyvelich this sounds like a good solution.
Another way would be to pre-download the dataset in the image.
The problem is how to make a new image for the example. The current one is from the GCP registry, which is no longer available.

@Jeffwan We faced with the same problem in Katib.
We currently using FashionMNIST instead of MNIST: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L137.
I believe it hosts in the PyTorch servers.

Sounds good. Let me double check if the code is compatible with FashionMnist dataset. If it is and data server is reliable. We can quickly change to it.

Code has been changed #327
We need a better way to publish images. This can be done after 1.3 release

Code has been changed #327
We need a better way to publish images. This can be done after 1.3 release

Hi @Jeffwan Kubeflow 1.3 is already released. Is there any progress on this?