googlegenomics/gcp-variant-transforms

Enable bq_to_vcf to drop irrelevant variants when `--sample_names` is set

Opened this issue · 1 comments

Imagine --sample_names S1 S2 and the following BQ table is given as source table:

v1 [S1, S2, S3, S4]
v2 [S1, S3, S4]
v3 [S2, S3, S4]
v4 [S3, S4]

currently output VCF file will include all 4 variants:

     S1   S2
v1   x     x
v2   x     0/0
v3   0/0   x
v4   0/0   0/0

where x indicates the value we read from BQ table. Including v4 in the output VCF file while none of the samples of interest have that variant does not make much sense.

@tneymanov to follow up on our conversation:
If user runs bq_to_vcf, for the previous example, using --sample_names S1 S2 S5 our current output VCF file (without this issue fixed) is:

     S1   S2    S5
v1   x     x    0/0
v2   x     0/0  0/0
v3   0/0   x    0/0
v4   0/0   0/0  0/0

And if we fix this issue, the output will not be empty, instead, it will be:

     S1   S2    S5
v1   x     x    0/0
v2   x     0/0  0/0
v3   0/0   x    0/0

which is still desirable output.