googlegenomics/gcp-variant-transforms

Sample lookup optimized tables with `--append` flag

Opened this issue · 0 comments

The way we create sample lookup optimized is inefficient, consider the following typical workflow:

  • Run VT for the first batch of VCF files (with --sample_lookup_optimized_output_table set and without --append).
  • Run VT for the second batch of VCF files (with --sample_lookup_optimized_output_table and --append set).
  • Run VT for the third batch of VCF files (with --sample_lookup_optimized_output_table and --append set).
    ...

Currently the way we load data into sample optimized tables is by querying variant optimized tables, flattening the call column, and then copying the result into sample optimized tables. In this implementation (#606), with each new run of VT (with --append set) we read all rows of variant optimized tables and load the result into sample optimized tables with write_disposition='WRITE_TRUNCATE'.

A more efficient implementation would be to flatten and add only newly added rows with write_disposition='WRITE_APPEND'.