googleapis/java-storage

storage: memory leak while writing to GCS bucket

cmateo917 opened this issue · 6 comments

Client:

Storage

Environment:

Google Cloud Run

Java version:

java-17-amazon-corretto

Code Example:

most recent attempt:

BlobInfo blobInfo =
            BlobInfo.newBuilder(BlobId.of(bucketName, fileName)).setContentType(ContentType.APPLICATION_JSON.getMimeType()).build();
    try (Storage storage = StorageOptions.newBuilder().setProjectId(projectId).build().getService();
         WriteChannel writer = storage.writer(blobInfo);
         OutputStream out = Channels.newOutputStream(writer)) {

      writer.setChunkSize(chunkSize);   // set below 256KB minimum
      contents.transferTo(out);

      log.info("Wrote file {} into {} bucket successfully.", fileName, bucketName);
    }

previous attempt:

try (InputStream contents = new ByteArrayInputStream(PGPCryptoUtil.encrypt(response.getBytes(), pgpKey, null, true, true);
             Storage storage = StorageOptions.newBuilder().setProjectId(projectId).build().getService();
             WriteChannel writer = storage.writer(blobInfo)) {

            writer.write(ByteBuffer.wrap(contents.readAllBytes()));
            log.info("Wrote file {} into {} bucket successfully.", fileName, BUCKET_NAME);
        }

The input streams in latest attempt is closed before exit, as well as the encryption utilities used to encrypt those streams.

While running the above snippets of code in dev environment or during stress test, allocated memory continuously increases until instance is restarted. Service also failed after running out of java heap memory.

Service runs roughly 2000 calls per day, with intervals between calls averaging out to almost 1 request per 1.4 seconds.

Expected behavior

Memory allocation stays constant

Actual behavior

The total allocated memory is continuously increasing

Screenshots

image

This screenshot shows the memory increasing and plateauing around 44% of service memory utilization (1.5GB total memory) before failing with the stack trace below.

Steps to reproduce

  1. Run stress test on code (1000 requests in 2-5 second intervals), should run out of heap memory no later than about 300 requests into process.

Stack trace

Java heap space
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:64)
	at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:363)
	at com.google.cloud.storage.Buffers.allocate(Buffers.java:119)
	at com.google.cloud.storage.BufferHandle$$Lambda$744/0x00000008015acd40.apply(Unknown Source)
	at com.google.cloud.storage.BufferHandle$LazyBufferHandle.get(BufferHandle.java:93)
	at com.google.cloud.storage.BufferHandle$LazyBufferHandle.get(BufferHandle.java:51)
	at com.google.cloud.storage.DefaultBufferedWritableByteChannel.write(DefaultBufferedWritableByteChannel.java:87)
	at com.google.cloud.storage.StorageByteChannels$SynchronizedBufferedWritableByteChannel.write(StorageByteChannels.java:109)
	at com.google.cloud.storage.BaseStorageWriteChannel.write(BaseStorageWriteChannel.java:102)
	at com.b6tp.sentry.data.SentryDataServiceOperation.writeFileToStorage(SentryDataServiceOperation.java:112)
	at com.b6tp.sentry.data.SentryDataServiceOperation.saveSentryErrorData(SentryDataServiceOperation.java:84)
	at com.b6tp.ControllerExternal.saveSentryErrorData(ControllerExternal.java:24)
	at com.b6tp.$ControllerExternal$Definition$Exec.dispatch(Unknown Source)
	at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invokeUnsafe(AbstractExecutableMethodsDefinition.java:447)
	at io.micronaut.context.DefaultBeanContext$BeanContextUnsafeExecutionHandle.invokeUnsafe(DefaultBeanContext.java:4214)
	at io.micronaut.web.router.AbstractRouteMatch.execute(AbstractRouteMatch.java:263)
	at io.micronaut.http.server.RouteExecutor$$Lambda$601/0x00000008014f7b40.get(Unknown Source)
	at io.micronaut.http.context.ServerRequestContext.with(ServerRequestContext.java:74)
	at io.micronaut.http.server.RouteExecutor.executeRouteAndConvertBody(RouteExecutor.java:480)
	at io.micronaut.http.server.RouteExecutor.lambda$callRoute$6(RouteExecutor.java:457)
	at io.micronaut.http.server.RouteExecutor$$Lambda$597/0x00000008014f5f18.get(Unknown Source)
	at io.micronaut.core.execution.ExecutionFlow.lambda$async$1(ExecutionFlow.java:87)
	at io.micronaut.core.execution.ExecutionFlow$$Lambda$598/0x00000008014f74a8.run(Unknown Source)
	at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
	at io.micronaut.core.propagation.PropagatedContext$$Lambda$599/0x00000008014f76d0.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Any additional information below

maybe related to google-cloud-go issue 8216?

There are a few things at play here that may appear to be a memory leak that in fact are not.

  1. I see in one of your samples you are setting the chunkSize on the other your are not. Our default chunkSize for a WriteChannel is 16MiB. If you know your objects are smaller, you should set this smaller (with writer.setChunkSize). 16MiB is a good default for object that are around 32MiB or more which a lot of objects in GCS are.
    1. The larger buffers can also require full gc rather than concurrent/background garbage collections
    2. 256KiB is a hard minimum imposed by GCS when performing a resumable upload[1], passing a value smaller than that will effectively set the chunkSize to 256KiB
  2. Converting from a channel to output stream or input stream to channel generally means a buffer must be allocated in order to do the conversion. Sticking with channels will remove this intermediary copy buffer.
  3. The garbage collector ultimately is the thing that decides when to free up memory in the java process. Even when objects are out of scope and no longer referenced, the garbage collector could let the live for a while longer.
  4. You may need to pass some Xms and Xmx to your running process so it knows how much memory it has to work with. (I'm not familiar with what the amazon JVMs defaults are, you would need to determine that yourself)
    6. I know of library users performing tens of thousands of uploads from jvms with 1G heaps without OOM issues.

Possibly useful alternatives:

  1. If you are in fact uploading a file from disk, you might prefer to use our method com.google.cloud.storage.Storage#createFrom(BlobInfo, Path, BlobWriteOption...)[2]. Using this method allows for optimizations because we know we're operating on a local file, rather than an arbitrary stream there are certain optimizations that can be applied.
  2. If you are uploading bytes from a process, you might prefer to use our method com.google.cloud.storage.Storage#create(BlobInfo, byte[], BlobTargetOption)[3]. using this method will not copy instead passing it directly to the http library.

4. You may need to pass some Xms and Xmx to your running process so it knows how much memory it has to work with. (I'm not familiar with what the amazon JVMs defaults are, you would need to determine that yourself)
6. I know of library users performing tens of thousands of uploads from jvms with 1G heaps without OOM issues.

thanks for the reply @BenWhitehead

the previous attempt (with ByteBuffer) was abandoned for channels for the reason you explained in #2. i read in the docs that you can set any minimum that is not zero, which I did to obtain the 256KB minimum chunk size. however this only seems to delay the memory issue.

also for part #4, we did introduce the following docker flags in an attempt to allot more RAM on initialization to the service (set minimum RAM to 512MB, set max to 70%), but this also did not resolve the issue:
ENV JAVA_OPTIONS="-Xms512m -XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=70.0"

finally, we did also attempt to use com.google.cloud.storage.Storage#createFrom(BlobInfo, Path, BlobWriteOption...), again for the reason you highlighted. this too fell short of resolving the issue.

the previous attempt (with ByteBuffer) was abandoned for channels for the reason you explained in #2. i read in the docs that you can set any minimum that is not zero, which I did to obtain the 256KB minimum chunk size. however this only seems to delay the memory issue.

256KiB is the minimum chunksize GCS will accept while still allowing a resumable session to be appended to again. Any value provided to writeChannel.setChunkSize < 256KiB will be rounded to 256KiB, the nearest valid boundary.

also for part #4, we did introduce the following docker flags in an attempt to allot more RAM on initialization to the service (set minimum RAM to 512MB, set max to 70%), but this also did not resolve the issue: ENV JAVA_OPTIONS="-Xms512m -XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=70.0"

If you want some more insight into the heap and GC that is happening in the jvm, adding -Xlog:gc=trace,gc+heap*=debug,gc+cpu*=info,gc+heap+region=info will give a pretty comprehensive idea of what is going on, and should show what amount of the regions are full. These gc logs aren't very large either, estimate 10KB or 20KB per minute.

finally, we did also attempt to use com.google.cloud.storage.Storage#createFrom(BlobInfo, Path, BlobWriteOption...), again for the reason you highlighted. this too fell short of resolving the issue.

I'm surprised by this, when using this method there should be minimal allocation, and no allocation from the com.google.cloud.storage methods, to move the bytes from the file to the output stream to GCS (the code ultimately ends up using FileChannelImpl.transferTo built into the jdk). Did using this method allow more concurrent requests, or did it still fail around the same point? Would you be able to share the stacktrace from the failure when using createFrom?

I'm surprised by this, when using this method there should be minimal allocation, and no allocation from the com.google.cloud.storage methods, to move the bytes from the file to the output stream to GCS (the code ultimately ends up using FileChannelImpl.transferTo built into the jdk). Did using this method allow more concurrent requests, or did it still fail around the same point? Would you be able to share the stacktrace from the failure when using createFrom?

i believe the createFrom method did allow more time to elapse before complaining about java heap space.

but also on closer inspection, I removed the encryption we had in place, which "resolved" the memory increase. so the memory increase seems to be related to our encryption method, not actual object creation. will report back if needed. thanks for your recommendations/time!

Thanks for reporting back! I'm glad you were able to narrow in on the cause. Best of luck figuring out the encryption piece in your app.