ExpediaGroup/circus-train

s3s3Copier fails for large data sets

Closed this issue · 3 comments

s3S3Copier fails for large data sets for both partitioned as well as unpartitioned tables.

com.hotels.bdp.circustrain.api.CircusTrainException: Unable to replicate
	at com.hotels.bdp.circustrain.core.PartitionedTableReplication.replicate(PartitionedTableReplication.java:136)
	at com.hotels.bdp.circustrain.core.Locomotive.run(Locomotive.java:114)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:791)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:781)
	at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:307)
	at com.hotels.bdp.circustrain.CircusTrain.main(CircusTrain.java:87)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: com.hotels.bdp.circustrain.api.CircusTrainException: Error in S3S3Copier:
	at com.hotels.bdp.circustrain.s3s3copier.S3S3Copier.copy(S3S3Copier.java:101)
	at com.hotels.bdp.circustrain.core.PartitionedTableReplication.replicate(PartitionedTableReplication.java:121)
	... 12 more
Caused by: com.hotels.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Please try again. (Service: Amazon S3; Status Code: 200; Error Code: InternalError; Request ID: 7EE3EE544247CBB2), S3 Extended Request ID: 7CbLnoxgeTt3+wfLuj6yckKDIxyUo2jEH4LMKDwE77qx5d8jjWYY9WHbIJpzHgcZ71OH4l2PnjY=
	at com.hotels.shaded.com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1866)
	at com.hotels.shaded.com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)
	at com.hotels.shaded.com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)
	at com.hotels.shaded.com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:133)
	at com.hotels.shaded.com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:44)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
	at java.lang.Thread.run(Thread.java:748)

After discussing with AWS, this is a known issue and is in their backlog. They have suggested for now to

  • Catch the embedded 200 error code and perform a retry when received.

Further Observations are as follows -

  • The default threshold limits almost always results in s3 internal errors. These default limits are 5GB (multipart-copy-threshold ) and 100MB (multipart-copy-part-size ) respectively. The reason i think is because we have a large number of files set for replication whose sizes are in the order of 100MB to GBs but less than the threshold of 5GB to kick off a multi part copy.
  • Another important point to note is that the above observation is observed only when we have a large table with large number of big files. This apparently increases the likelihood of atleast one failure and CT fails when even one of the many numerous files fail a s3 copy
  • When the threshold limits for multipart-copy-threshold and multipart-copy-part-size were reduced to 40 MB and 10MB respectively, replications have become more successful. This is because the multi part upload is triggered on lowering the threshold and each object is split into smaller parts. This however, increases the replication duration owing to extra overhead of splitting and merging the parts.

Having a CT fix of retrying individual copies would be better than failing the entire CT job when even one of the smaller s3 copies fail. It would also make replication more efficient and prevent retry of full replication again.

Thanks for the insights. So to summarise, the belief is that these failures are transient, and that if a retry is added it will make the copier more resilient to these types of failure? If so then it would seem to be appropriate to add the retry mechanism, which I imagine would be fairly straight forward.

Yes that is correct. A retry mechanism is expected to make it more resilient.