tensorflow/io

S3 read throughput slow down after hit prefix limit

shaowei-su opened this issue · 1 comments

Environment:

tensorflow==2.8.0
tensorflow-io==0.25.0

S3 loading client: tf.data.TFRecordDataset.

Issue

By default, S3 has limit on the number of GET/HEAD operation up to 5,500 per second per partitioned prefix, once this limit is reached than the read operation will throw 503 errors. What we noticed is that if the client starts seeing 503 error, the entire data loading speed will drop indefinitely until the end of the data loading process even if the 503 errors are recovered.

Question

Does the s3 client has retry logic in the case of 503 errors? if not, would failed S3 GET/HEAD request block the entire loading thread defined in num_parallel_reads field? Thanks