GoogleCloudDataproc/hadoop-connectors

Conversion from InputStream -> ByteBuffer on gRPC writes creates many byte[] allocations.

Closed this issue · 2 comments

Hi team,

Investigating memory usage in the write path for gRPC; I found that significant allocations are coming from converting InputStream -> ByteBuffer code in the GCS connector: https://storage.googleapis.com/anima-frank/large-writes-grpc/grpc_100_write_100MiB_t_4_profile.html

Note: Workload runs Fsbenchmark uploading 10k 100MiB object across 4 threads in n2-standard-4 GCE using DirectPath.

~72% of allocations come from converting InputStream -> ByteBuffer creates 2 2MiB byte[]'s for every write:

Separately, java-storage does contribute to 19% of allocations, I'm digging into this as well. My current suspicion is that java-storage creates a buffer per upload and not per message.

Update: I attempted to make a change to the code to workaround this issue but overall wall time is not suitable:
For a sequential write of 10k 100MiB objects; existing implementation is takes around ~2+ hours using gRPC DP while my prototype version is still running after ~9+ hours.

@arunkumarchacko could you investigate alternatives?

cc: @schannahalli, @danielduhh

This issue is fixed once we moved away from pipe.