liormizr/s3path

Glob-related error appeared on 0.3.3

avibrazil opened this issue · 13 comments

0.3.2 was working fine until I upgraded to 0.3.3 and then the following error started to happen. I don't know why and where is the problem but going back to 0.3.2 fixes the problem.

  File "/home/sagemaker-user/Kroft/my_avm/avm/avm.py", line 1063, in load
    available=[f for f in list(resolved_path.glob(filename)) if '.dvc' not in str(f)]
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 706, in glob
    yield from super().glob(pattern)
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 1177, in glob
    for p in selector.select_from(self):
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 523, in select_from
    if not is_dir(parent_path):
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 678, in is_dir
    return self._accessor.is_dir(self)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 169, in is_dir
    resource, _ = self.configuration_map.get_configuration(path)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:
TypeError: argument of type 'NoneType' is not iterable

filename variable contains something like random • * • *.pkl*. And I don't think this is relevant information but this code is a static method of a class and it was running twice, in 2 parallel threads, where first thread succeeded and second failed with above error.

@avibrazil thanks for the issue

How can I reproduce it?
can you please give me a line to run or a small script?

Same issue here, also threading-related. Relevant stack trace:

  File "/usr/local/lib/python3.9/pathlib.py", line 1264, in write_bytes
    with self.open(mode='wb') as f: 
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 721, in open
    return self._accessor.open(         
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 205, in open
    resource, config = self.configuration_map.get_configuration(path)
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:                                                      
TypeError: argument of type 'NoneType' is not iterable   

The simplified code goes like this:

src = Path('some-local')
dst = S3Path.from_uri('s3://something') / 'foo.csv'
dst.write_bytes(src.read_bytes())

And run this massively parallel. Today I found out that creating boto3 clients is not thread safe and changed my own code like this. s3path is creating its own boto3 resource in class _S3Accessor, maybe that is the culprit.

I submitted a fix. Since I understand the issue better now, here is a workaround that should fix the issue until a new version is released: Basically, use any feature of the library without threads at the beginning of your application. It will make the library initialise without the concurrency issue that is the cause of the exception. I.e.:

S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

There are probably other methods that are even shorter, but this one works.

I’m not sure I got it.

You mean, call the lib somehow before any threading and then the lib can be used later in threading contexts?

Is this the temporary fix?

@avibrazil Sorry, I realise I was not clear 😞

Take the example from the pull request:

from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

# This line will "fix" the issue because it does ultimately call
# _S3ConfigurationMap.get_configuration()
S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

# Now, do your threaded stuff.
ThreadPool(processes=20).map(do, range(100))

The thing is, it needs to call _S3ConfigurationMap.get_configuration(). Pardon me that I do not compile a list of possible functions that do this, because my time is limited 🙂 With the code above you can experiment and find one that also works, if the provided one is not good for you, i.e. because it requires some permission you do not have or so.

EDIT: The important part is calling exists(). How you construct S3Path does not matter.

Hi all,

I merged the fix
I'll update in this issue when I'll create a new release

Thanks!

@liormizr Anything we can help with? This issue is blocking my upgrade to Python 3.10, so I am eager to see the new release 🙂

@sbrandtb will it help if I'll deploy a new version now only with this fix?

@liormizr yes 🙂

Version 0.3.4 deployed to PyPi
With only this issue fix
Sorry for the delay

I'll update tomorrow when I'll deploy to Conda

Version deployed to Conda

Oh, that's great!
I'll test it later.
But having it published, maybe it is time to close this bug.

Update: Unrelated. This issue was related to this workaround

After updating and running my project in our CI, I am seemingly randomly getting

ValueError: the bucket 'xxx does not exist, or is forbidden for access (ClientError('An error occurred (InvalidAccessKeyId) when calling the CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.'))

I did not observe this while testing the change locally and I am not sure if if is not connected to another change when migrating to Python 3.10 and updating packages, but want to drop this message here so if anyone else encounters it, he can +1. After all, it looks like some other non-deterministic thing.