googleapis/java-storage

Storage.writer(...) breaks the idea of 'generation'

kohsuke opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
I have an app that uploads a chunk of data to GCS, load that into BigQuery, and then delete this file from GCS. It does so many, many times over a period of time. Pseudo code below:

while(true) {
  try (var w=storage.writer(blobInfo)) {
    writeTo(w);
  }
  loadtoBQ();
  storage.delete(blobInfo.getBlobId());
}

I noticed that, some delete calls seem to sporadically fail with 503 service unavailable. The error message suggests those errors are transitory and I should retry that. In looking at the GCS storage library code, I noticed that there's a built-in retry mechanism to transparently handle this kind of situation, it treats delete as idempotent operation if the generation match requirement is given (see HttpRetryAlgorithmManager.getForObjectsDelete(). That makes sense!

Except, there's no way to actually capture the generation of my write. I can see that internally, the writer returned is BlobWriteChannel and its storageObject property represents the BlobInfo of a newly written blob, including the generation. That is how a method like Storage.create() can reliably return the Blob object that represents the state at the point of creation. But there seems to be no way to access the same information with the writer(...) methods. I consider this a library design problem.

Describe the solution you'd like
Storage.writer(...) should return a subtype of WriteChannel that can return Blob after its close method is invoked.

Describe alternatives you've considered
Call Storage.getBlob(BlobId) after the write is done to obtain a fresh Blob from GCS separately. This risks the race condition.

Additional context
#691 appears to be somewhat related, in the sense that it also wants additional information beyond WriteChannel

In the next release (mid April, we're currently in a code freeze for Cloud Next) we will have a new experimental API that allows access to the resulting object for the upload.

This new API is called BlobWriteSession and can be dropped in with minimal change.

Your example would change to the following:

Storage storage = StorageOptions.http().build().getService();

BlobInfo info = BlobInfo.newBuilder("bucket", "object").build();

while (true) {
  // new experimental API
  BlobWriteSession session = storage.blobWriteSession(info); // create the upload session for the object
  ApiFuture<BlobInfo> resultInfo = session.getResult(); // a Future for the object created when the WritableByteChannel below is closed
  try (WritableByteChannel w = session.open()) { // open the channel for writing
    writeTo(w); // write to the channel the same as before
  }
  // get the object with generation from the future
  BlobInfo gen1 = resultInfo.get(5, TimeUnit.SECONDS);

  // issue the delete operation, now with a generation
  storage.delete(gen1.getBlobId());
}

We decided not to change WriteChannel or Storage#writer, because we also have other features that are configured on StorageOptions and influence the way BlobWriteSessions will work. When using the new BlobWriteSession (and it's default settings), it will perform the same retried resumable uploads that Storage#writer does.

While this is a new @BetaApi and theoretically could experience breaking changes, the Default settings are not at all likely to change (~98% confident the default settings won't change from what will be present in the next release). The primary possibility of breaking changes is for some of the other settings that can change the type of upload performed.

Version 2.37.0 was released last week with the necessary plumbing for this code sample to work. libraries-bom should be released this week if you use that for version resolution.