Google Cloud Storage failing when using threads
Closed this issue ยท 12 comments
- Ubuntu 16.04
- Python 2.7.6
- google-api-python-client>=1.6.2 and google-cloud-storage>=1.1.1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
ssl.SSLError: [Errno 1] _ssl.c:1429: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
- The client is not thread safe (I think)
from multiprocessing.pool import ThreadPool
from google.cloud import storage
from functools import partial
def upload(bucket, i):
blob = bucket.blob("file{}.png".format(i))
blob.upload_from_string("blabla")
blob.make_public()
return blob.public_url
bucket = storage.Client().get_bucket("deepo-test")
pool = ThreadPool()
fct = partial(upload, bucket)
pool.map(fct, [i for i in range(2)])
@Alexis-Jacob That's correct, the error you are seeing is caused by the lack of thread-safety in httplib2
. We recommend (for now) creating an instance of Client
that is local to your thread / process.
@Alexis-Jacob I am going to pre-emptively close this issue because it is "known" and something we are working on. If you'd like a thread-safe transport, I recommend looking into https://github.com/GoogleCloudPlatform/httplib2shim
@jonparrott Is there a "better" recommendation to make?
Wow, can we get this added to the top level README.md
, and added to the documentation for each of the clients? This causes a significant change in how I can use this library. I believe I'm running into this, and now I need to restructure my app to avoid passing the client or any created sub-objects around.
Additionally, httplib2shim
doesn't seem to work with the recent updates: the google-auth
library no longer uses httplib2
, and when using it with google.cloud.storage I get the following exception:
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 891, in upload_from_file
client, file_obj, content_type, size, num_retries)
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 818, in _do_upload
client, stream, content_type, size, num_retries)
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 768, in _do_resumable_upload
client, stream, content_type, size, num_retries)
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 727, in _initiate_resumable_upload
total_bytes=size, stream_final=False)
File "/usr/local/lib/python2.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
total_bytes=total_bytes, stream_final=stream_final)
File "/usr/local/lib/python2.7/site-packages/google/resumable_media/_upload.py", line 410, in _prepare_initiate_request
if stream.tell() != 0:
AttributeError: addinfourl instance has no attribute 'tell'
Thanks for the instant response! I get that this is going to be fixed "soon", but this was a bit of a surprise for me to discover that this is a known issue for the current release that only seems to be documented in Github Issues. I would love it if the following page would say "Client is not thread-safe; do not use it between threads": https://googlecloudplatform.github.io/google-cloud-python/stable/storage-client.html
Once the bug is fixed, then the docs could be fixed :)
Also don't worry about the exception, I'm probably doing something weird or have some library version mismatch. I'm just going to fix my code to not re-use clients, since that seems like the more sane, documented solution at the moment.
I would like to add (for googling purposes) that I got the following errors inconsistently:
python 3.6.5
google-api-core==1.2.1
google-auth==1.5.0
google-cloud-core==0.28.1
google-cloud-storage==1.10.0
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)'),))
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)'),))
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)
Passing a new client to each new process solved this.
Passing a new client to each new process solved this.
Are you using multiprocessing
, or threading
?
@tseaver Both.
I think it's best if I quickly also describe what I was doing.
The google.storage python api does not seem to have a better way to download a bunch of small files than to individually create a blob blob = bucket.blob(name); blob.download_as_x()
, which is super slow. In contrast, gsutil -m cp x y
is really quick, but all metadata is lost.
I needed that metadata as well.
So as a workaround I fetch all blobs I want to download with gsutil ls -l
, create batches based on file size. Each batch is then sent to a new process (using python multiprocessing.Pool
) and each blob is downloaded (blob.download_as_string()
and combine the dict with blob.metadata
) in its own thread (using python threading.Thread
).
When only multithreading I got no errors passing a single Client/Bucket (storage.Client().get_bucket(name)
) to each thread.
When I sent each batch to a new process and then multithreaded each blob in that batch I got the above errors. Additionally, that whole batch would fail about 50% of the time, with subsequent batches usually succeeding to connect (I think the processed exited after an error was raised and the script terminated).
Problem was solved creating and passing a new Bucket object for each process
@RomHartmann I'm not too surprised that the requests
session pool, etc., might not function well across an os.fork()
call. multiprocessing
can paper over some of the differences between forking and threads, but not all of them.
you need to create a new client connection for every pool / thread inside def upload(bucket, i).
That will be work.
from multiprocessing.pool import ThreadPool
from google.cloud import storage
def upload(i):
bucket = storage.Client().get_bucket("deepo-test")
blob = bucket.blob("file{}.png".format(i))
blob.upload_from_string("blabla")
blob.make_public()
return blob.public_url
pool = ThreadPool()
pool.map(fct, [i for i in range(2)])