googleapis/google-cloud-python

Google Cloud Storage failing when using threads

Closed this issue ยท 12 comments

  1. Ubuntu 16.04
  2. Python 2.7.6
  3. google-api-python-client>=1.6.2 and google-cloud-storage>=1.1.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
ssl.SSLError: [Errno 1] _ssl.c:1429: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
  1. The client is not thread safe (I think)
from multiprocessing.pool import ThreadPool
from google.cloud import storage
from functools import partial

def upload(bucket, i):
    blob = bucket.blob("file{}.png".format(i))
    blob.upload_from_string("blabla")
    blob.make_public()
    return blob.public_url

bucket = storage.Client().get_bucket("deepo-test")
pool = ThreadPool()
fct = partial(upload, bucket)
pool.map(fct, [i for i in range(2)])

@Alexis-Jacob That's correct, the error you are seeing is caused by the lack of thread-safety in httplib2. We recommend (for now) creating an instance of Client that is local to your thread / process.

@Alexis-Jacob I am going to pre-emptively close this issue because it is "known" and something we are working on. If you'd like a thread-safe transport, I recommend looking into https://github.com/GoogleCloudPlatform/httplib2shim

@jonparrott Is there a "better" recommendation to make?

evanj commented

Wow, can we get this added to the top level README.md, and added to the documentation for each of the clients? This causes a significant change in how I can use this library. I believe I'm running into this, and now I need to restructure my app to avoid passing the client or any created sub-objects around.

Additionally, httplib2shim doesn't seem to work with the recent updates: the google-auth library no longer uses httplib2, and when using it with google.cloud.storage I get the following exception:

    File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 891, in upload_from_file
    client, file_obj, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 818, in _do_upload
    client, stream, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 768, in _do_resumable_upload
    client, stream, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 727, in _initiate_resumable_upload
    total_bytes=size, stream_final=False)
  File "/usr/local/lib/python2.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/usr/local/lib/python2.7/site-packages/google/resumable_media/_upload.py", line 410, in _prepare_initiate_request
    if stream.tell() != 0:
AttributeError: addinfourl instance has no attribute 'tell'

@evanj it's our hope that soon we'll be able to wholesale migrate to requests so that this actually won't be an issue. @dhermes were are we on #1998?

That error you posted is curious. Storage is already using our new non-httplib2 transport, so @dhermes might be able to shed some light there.

evanj commented

Thanks for the instant response! I get that this is going to be fixed "soon", but this was a bit of a surprise for me to discover that this is a known issue for the current release that only seems to be documented in Github Issues. I would love it if the following page would say "Client is not thread-safe; do not use it between threads": https://googlecloudplatform.github.io/google-cloud-python/stable/storage-client.html

Once the bug is fixed, then the docs could be fixed :)

Also don't worry about the exception, I'm probably doing something weird or have some library version mismatch. I'm just going to fix my code to not re-use clients, since that seems like the more sane, documented solution at the moment.

Indeed, thanks for the patience @evanj.

I would like to add (for googling purposes) that I got the following errors inconsistently:

python 3.6.5
google-api-core==1.2.1
google-auth==1.5.0
google-cloud-core==0.28.1
google-cloud-storage==1.10.0
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)'),))

ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)'),))

ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)

Passing a new client to each new process solved this.

@RomHartmann

Passing a new client to each new process solved this.

Are you using multiprocessing, or threading?

@tseaver Both.

I think it's best if I quickly also describe what I was doing.
The google.storage python api does not seem to have a better way to download a bunch of small files than to individually create a blob blob = bucket.blob(name); blob.download_as_x(), which is super slow. In contrast, gsutil -m cp x y is really quick, but all metadata is lost.
I needed that metadata as well.

So as a workaround I fetch all blobs I want to download with gsutil ls -l, create batches based on file size. Each batch is then sent to a new process (using python multiprocessing.Pool) and each blob is downloaded (blob.download_as_string() and combine the dict with blob.metadata) in its own thread (using python threading.Thread).

When only multithreading I got no errors passing a single Client/Bucket (storage.Client().get_bucket(name)) to each thread.
When I sent each batch to a new process and then multithreaded each blob in that batch I got the above errors. Additionally, that whole batch would fail about 50% of the time, with subsequent batches usually succeeding to connect (I think the processed exited after an error was raised and the script terminated).
Problem was solved creating and passing a new Bucket object for each process

@RomHartmann I'm not too surprised that the requests session pool, etc., might not function well across an os.fork() call. multiprocessing can paper over some of the differences between forking and threads, but not all of them.

you need to create a new client connection for every pool / thread inside def upload(bucket, i).
That will be work.

from multiprocessing.pool import ThreadPool
from google.cloud import storage

def upload(i):
bucket = storage.Client().get_bucket("deepo-test")
blob = bucket.blob("file{}.png".format(i))
blob.upload_from_string("blabla")
blob.make_public()
return blob.public_url

pool = ThreadPool()
pool.map(fct, [i for i in range(2)])