googleapis/java-storage

Java-Storage: GCS object larger than 129 GB not downloading to local system

prabhunkl opened this issue · 4 comments

I am trying to download an object (a text file) size ~800GB from GCS bucket to High Disk (10TB) GKE POD. The Java download process get stuck after downloading 129 GB. No Exception thrown.

I have Google Support Ticket open ticket # 45435795

Environment details

  1. Java API used: google-cloud-storage Version: 2.22.5
  2. OS type and version: CentOS 7.9.2009
  3. Java version: openjdk 11.0.19 2023-04-18 LTS
    OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-1.el7_9) (build 11.0.19+7-LTS)
    OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-1.el7_9) (build 11.0.19+7-LTS, mixed mode, sharing)
  4. version(s):

Steps to reproduce

  1. Have and object in size more than 700 GB in a GCS bucket
  2. use bellow Java code to download the object to local GKE, POD

Code example

Download an object from the given GCS data bucket to processing folder.

public void downloadObject(String bucketName, String objectName, String destFilePath)
			throws ObjectNotFoundException {
		
		Blob blob = storage.get(BlobId.of(bucketName, objectName));
		blob.downloadTo(Paths.get(destFilePath));

		log.info("Downloaded object " + objectName + " from bucket name " + bucketName + " to " + destFilePath);

	}

Stack trace

No Stack trace available. 

Apologies for the delay on responding.

For an object that large, you are likely running into a corner case around how operation deadlines are applied in conjunction with network socket timeouts.

For an object of such size, I would recommend the following workaround:

import com.google.cloud.ReadChannel;
import com.google.cloud.storage.BlobId;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import java.io.IOException;
import java.nio.channels.WritableByteChannel;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;


  public void downloadObject(String bucketName, String objectName, String destFilePath)
      throws ObjectNotFoundException, IOException {
    // no need to send an rpc for the metadata, start the ReadChannel directly from storage
    BlobId id = BlobId.of(bucketName, objectName);
    Path path = Paths.get(destFilePath);
    long read = -1;
    try (
        // open a reader which can have a very long lifetime
        ReadChannel r = storage.reader(id);
        // Create a Channel for the file the bytes should be written to
        WritableByteChannel w = Files.newByteChannel(path, StandardOpenOption.CREATE, StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)
    ) {
      // disable internal buffering since ByteStreams.copy will buffer to write to the file
      r.setChunkSize(0);
      // move all the bytes; under the covers if the stream is broken it will be reestablished and pick up from where it left off
      read = com.google.common.io.ByteStreams.copy(r, w);
    }
    log.info("Downloaded object " + objectName + " from bucket name " + bucketName + " to " + destFilePath + " consisting of " + read + "Bytes");
  }

Using the ReadChannel expects to have a long lifetime, so its primary deadline is for establishing the underlying stream rather than consuming all bytes.

I used a similar code samples to successfully download a 1TiB object in ~114 minutes for an amortized throughput of ~150 MiB/s (from a regional bucket to a GCE vm in the same region).

Hope this helps.

Thanks Ben. I used the code. However after 129 GB of data transfer I did not see any any progress after copying 129 GB of data. But the difference this time is the JVM was consuming CPU after the copy file reached the 129 GB. Earlier this was not the case.

Using the code from #2102 (comment) it's failing at roughly the same location in the object?

That's very odd. Do you have any customised values passed when you're constructing your StorageOptions? My tests runs where using all defaults.

One thing you might be able to do to help with further diagnosis would be to enable HTTP request logging[1] -- specifically the com.google.api.client.http.HttpTransport category/logger/appender (depending on your logging framework).

With request logging enabled I'd expect to see something like:

GET https://storage.googleapis.com/download/storage/v1/b/MY_BUCKET_NAME/o/MY_OBJECT_NAME?alt=media
Accept-Encoding: gzip
Authorization: Bearer ya29.c...
User-Agent: gcloud-java/2.23.0 Google-API-Java-Client/2.2.0 Google-HTTP-Java-Client/1.43.3 (gzip)
x-goog-api-client: gl-java/1.8.0 gdcl/2.2.0 linux/6.1.25 gccl-invocation-id/b606a52d-ecce-4bcd-8c57-37c62890e101
x-goog-gcs-idempotency-token: b606a52d-ecce-4bcd-8c57-37c62890e101

If the request is retried it should then also include a Range header with the offset to restart from:

GET https://storage.googleapis.com/download/storage/v1/b/MY_BUCKET_NAME/o/MY_OBJECT_NAME?alt=media
Accept-Encoding: gzip
Authorization: Bearer ya29.c...
User-Agent: gcloud-java/2.23.0 Google-API-Java-Client/2.2.0 Google-HTTP-Java-Client/1.43.3 (gzip)
x-goog-api-client: gl-java/1.8.0 gdcl/2.2.0 linux/6.1.25 gccl-invocation-id/b606a52d-ecce-4bcd-8c57-37c62890e101
x-goog-gcs-idempotency-token: b606a52d-ecce-4bcd-8c57-37c62890e101
Range: bytes=138512695296-

Would it be at all possible for you to test the download outside a GKE pod, on a standalone GCE vm? I'm not super familiar if here are anything the GKE Pod could be doing to the output causing backpressure or anything. If you aren't able to try on a GCE vm, would it be possible for you to share you GKE Pod configuration so we can try and replicate on our side?

[1] https://googleapis.github.io/google-http-java-client/http-transport.html#logging

Closing due to inactivity. The workaround in #2102 (comment) can be used until such time as the implementaiton of downloadTo is able to change.