sdv-dev/SDGym

Can't download datasets if `.aws` config is present

pvk-developer opened this issue · 0 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDGym version: 0.6.1
  • Python version: Any
  • Operating System: MacOS / Unix / Ubuntu

Error Description

When running on your local environment and it happens to have .aws/ folder with some configuration in it for your AWS, you end up getting the following error:

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

Steps to reproduce

In order to reproduce the steps create a .aws folder in your home: mkdir ~/.aws then create a file called credentials and add:

[default]
aws_access_key_id = <your id>
aws_secret_access_key = <your access key>

PS: In order for this to work make sure that you have cleared the cache of the downloaded datasets.

import sdgym

In [4]: sdgym.benchmark_single_table(synthesizers=['GaussianCopulaSynthesizer'], sdv_datasets=['student_plac
   ...: ements'], timeout=22)
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 sdgym.benchmark_single_table(synthesizers=['GaussianCopulaSynthesizer'], sdv_datasets=['student_placements'], timeout=22)

File ~/Projects/sdv-dev/SDGym/sdgym/benchmark.py:507, in benchmark_single_table(synthesizers, custom_synthesizers, sdv_datasets, additional_datasets_folder, limit_dataset_size, compute_quality_score, sdmetrics, timeout, output_filepath, detailed_results_folder, show_progress, multi_processing_config)
    503 _validate_inputs(output_filepath, detailed_results_folder, synthesizers, custom_synthesizers)
    505 _create_detailed_results_directory(detailed_results_folder)
--> 507 job_args_list = _generate_job_args_list(
    508     limit_dataset_size, sdv_datasets, additional_datasets_folder, sdmetrics,
    509     detailed_results_folder, timeout, compute_quality_score, synthesizers, custom_synthesizers)
    511 scores = _run_jobs(multi_processing_config, job_args_list, show_progress)
    512 if output_filepath:

File ~/Projects/sdv-dev/SDGym/sdgym/benchmark.py:90, in _generate_job_args_list(limit_dataset_size, sdv_datasets, additional_datasets_folder, sdmetrics, detailed_results_folder, timeout, compute_quality_score, synthesizers, custom_synthesizers)
     88 datasets = []
     89 if sdv_datasets is not None:
---> 90     datasets = get_dataset_paths(sdv_datasets, None, None, None, None)
     92 if additional_datasets_folder:
     93     additional_datasets = get_dataset_paths(None, None, additional_datasets_folder, None, None)

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:200, in get_dataset_paths(datasets, datasets_path, bucket, aws_key, aws_secret)
    196     else:
    197         datasets = _get_available_datasets(
    198             'single_table', bucket=bucket)['dataset_name'].tolist()
--> 200 return [
    201     _get_dataset_path('single_table', dataset, datasets_path, bucket, aws_key, aws_secret)
    202     for dataset in datasets
    203 ]

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:201, in <listcomp>(.0)
    196     else:
    197         datasets = _get_available_datasets(
    198             'single_table', bucket=bucket)['dataset_name'].tolist()
    200 return [
--> 201     _get_dataset_path('single_table', dataset, datasets_path, bucket, aws_key, aws_secret)
    202     for dataset in datasets
    203 ]

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:60, in _get_dataset_path(modality, dataset, datasets_path, bucket, aws_key, aws_secret)
     57     if local_path.exists():
     58         return local_path
---> 60 download_dataset(
     61     modality, dataset, dataset_path, bucket=bucket, aws_key=aws_key, aws_secret=aws_secret)
     62 return dataset_path

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:36, in download_dataset(modality, dataset_name, datasets_path, bucket, aws_key, aws_secret)
     34 LOGGER.info('Downloading dataset %s from %s', dataset_name, bucket)
     35 s3 = get_s3_client(aws_key, aws_secret)
---> 36 obj = s3.get_object(Bucket=bucket_name, Key=f'{modality.upper()}/{dataset_name}.zip')
     37 bytes_io = io.BytesIO(obj['Body'].read())
     39 LOGGER.info('Extracting dataset into %s', datasets_path)

File ~/.virtualenvs/SDGym/lib/python3.8/site-packages/botocore/client.py:530, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    526     raise TypeError(
    527         f"{py_operation_name}() only accepts keyword arguments."
    528     )
    529 # The "self" in this scope is referring to the BaseClient.
--> 530 return self._make_api_call(operation_name, kwargs)

File ~/.virtualenvs/SDGym/lib/python3.8/site-packages/botocore/client.py:960, in BaseClient._make_api_call(self, operation_name, api_params)
    958     error_code = parsed_response.get("Error", {}).get("Code")
    959     error_class = self.exceptions.from_code(error_code)
--> 960     raise error_class(parsed_response, operation_name)
    961 else:
    962     return parsed_response

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.