ropensci/targets

Cache list_objects_v2() to speed up the file cue for cloud objects

wlandau opened this issue · 11 comments

Under the default settings for cloud storage, targets checks each and every target hash with its own AWS API call, which is extremely time-consuming. This is why https://books.ropensci.org/targets/cloud-storage.html recommend tar_cue(file = FALSE) for large pipelines on the cloud. This is fine if you're not manually modifying objects in the bucket, but it is not ideal. It would be better to find a safer way to speed up targets when it checks that cloud objects are up to date.

Previously I posted #1131. Versioning might not be a problem if we assume most of the objects are in their current version most of the time. However, list_objects_v2() operates on whole prefixes, which might slow us down because it operates on more objects than we really need. And then there's pagination to contend with. This functionality is worth revisiting, but the ideas I have so far range from painful to infeasible.

For this to work, I think I will need to switch to using ETags as hashes instead of the targets custom hash in the metadata. I think the reason I didn't do this initially was because I didn't know that S3 was strongly read-after-write consistent.

Roadmap for AWS:

  • Implement and test aws_s3_list() in the utils. Remember pagination.
  • Switch to ETags.
  • Modify store_aws_hash() to use a cache. This function should only be called locally in the central controlling R session. I could put guardrails to make sure that stays the case.

Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you git reset your way back to historical metadata.

For GCS, it might be good to just switch to ETags for the next release, then wait for cloudyr/googleCloudStorageR#179.

Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects.

I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible.

As I said before, list_object_versions() is not feasible because it lists all the versions of all the objects, without any kind of guardrail to list e.g. only the most recent versions. Any given object could have thousands of versions, and so listing all the versions of all the objects is way too much.

On the other hand, neither list_objects() nor list_objects_v2() lists version IDs at all, so it is impossible to confirm that the version listed in the metadata actually exists or is current. For example, suppose you revert to a historical copy of the metadata, and you see version ABC and ETag XYZ for target x. The bucket's current version could have ETag XYZ, but version ABC may no longer exist. (For example, it might have been automatically deleted by the object retention policy).

These and similar problems are impossible to reconcile unless:

  1. targets sends a HEAD request for each individual object, as it currently does, or
  2. sends a batched API request with a list of key-version pairs and to learn the existence of each one.

(2) seems impossible, so I think we have to stick with (1).

Tried to send a feature request on their feedback form, but it's glitchy today:

I am writing an R package which needs to check the existence of a specific version of each AWS S3 object in its data store. The version of a given object is the version ID recorded in the local metadata, and the recorded version may or may not be the most current version in the bucket. Currently, the package accomplishes this by sending a HEAD request for each relevant object-version pair.

I would like a more efficient/batched way to do this for each version/object pair. list_object_versions() returns every version of every object of interest, which is way too many versions to download efficiently, and neither list_objects() nor list_objects_v2() return any version IDs at all. It would be great to have something like delete_objects(), but instead of deleting the objects, accept the supplied key-version pairs and return the ETag and custom metadata of each one that exists.

c.f. https://repost.aws/questions/QUe-yNsIr0Td2aq2oA1RAQdQ/hudi-and-s3-object-versions

Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch targets to use AWS/GCS ETags when available instead of custom local file hashes. The switch is as simple as this:

  1. In store_upload_object_aws(), remove the targets-hash custom metadata:

metadata = list("targets-hash" = store$file$hash),

  1. In store_upload_object_aws(), write store$file$hash <- digest_chr64(head$ETag) just above the following line:

store$file$path <- c(path, paste0("version=", head$VersionId))

  1. At the end of store_aws_hash(), return digest_chr64(head$ETag) instead of head$Metadata[["targets-hash"]].
  2. Test that the correct ETags get to the metadata and the correct ETags are being retrieved by store_aws_hash() to assert that up-to-date targets are indeed up to date.
  3. Repeat all the above for GCS.

Taking a step back: this is actually feasible if targets can ignore version IDs. There could be a tar_option_set()-level option to either check or ignore version IDs. Things to consider:

  • Should the option be at the level of tar_option_set() and not tar_target()? At first glance, I thinks so because caching happens in bulk. Maybe the level of tar_resources_aws() could technically work, but those options are all implicitly target-level, which would be counterintuitive even with good documentation.
  • Should the version check still be enabled by default? I think so, for compatibility. But it will be slow.

Taking another step back: targets should:

  1. Always use the version ID when downloading data, and
  2. Always ignore the version ID when checking the hash.

(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new version argument of tar_resources_aws(). The outcomes will be:

  1. Pipelines with cloud targets will run dramatically faster.
  2. The rules for checking/rerunning outdated targets will take into account which objects are the latest versions in the bucket. This makes more conceptual sense.
  3. Users won't need to do anything extra.