JovianHQ/opendatasets

Downloading datasets behind network Proxies fail due to timeout errors

Opened this issue · 0 comments

For users behind network proxies, the following example in the main README.md fails due to timeout errors:

$ python
Python 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import opendatasets as od
>>> dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
>>> od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: ****
Your Kaggle Key: 
2024-01-12 06:45:08,854 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1a5408e490>: Failed to establish a new connection: [Errno 110] Connection timed out')': /api/v1/datasets/download/tunguz/us-elections-dataset?datasetVersionNumber=None

However if KAGGLE_PROXY environment variable is properly set, the example works for users behind network proxy as well:

Here's the code snippet that makes this work:

import os
if 'https_proxy' in os.environ.keys():
    os.environ['KAGGLE_PROXY'] = os.environ['https_proxy']
elif 'HTTPS_PROXY' in os.environ.keys():
    os.environ['KAGGLE_PROXY'] = os.environ['HTTPS_PROXY']
else:
    os.environ['KAGGLE_PROXY'] = ''

import opendatasets as od
dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
od.download(dataset_url)

and here's the sample run behind network proxy:

python
Python 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> if 'https_proxy' in os.environ.keys():
...     os.environ['KAGGLE_PROXY'] = os.environ['https_proxy']
... elif 'HTTPS_PROXY' in os.environ.keys():
...     os.environ['KAGGLE_PROXY'] = os.environ['HTTPS_PROXY']
... else:
...     os.environ['KAGGLE_PROXY'] = ''
... 
>>> import opendatasets as od
>>> dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
>>> od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: ****
Your Kaggle Key: 
Downloading us-elections-dataset.zip to ./us-elections-dataset
  0%|                                                                        | 0.00/133k [00:00<?, ?B/s]
100%|████████████████████████████████████████████████████████████████| 133k/133k [00:00<00:00, 6.49MB/s]

I was planning to submit a PR to fix the issue but I see that the last time this repo updated was over 2 years ago.