googlegenomics/gcp-variant-transforms

Add an option to disable sharding

arostamianfar opened this issue · 0 comments

The new version of Variant Transforms "forces" sharding. While sharding is definitely useful in some situations, it would be good to add an option to disable sharding. Example use cases:

  • The dataset is small enough that sharding just creates extra overhead.
  • To keep backwards compatibility with the older versions of Variant Transforms: we have an old table that we keep appending data to (and we have a lot of queries built on top of the existing table), so we like to keep using the non-sharded version as query cost is not currently a concern.

I 'hacked' the code in our fork to make this work, but it also seems feasible to merge a version of this upstream? It's missing the residual partition (which we don't have in our use case), but it may not be too difficult to include that edge case as well.

Not sure if any other users have asked for this feature, but please let me know your thoughts and happy to discuss further :)