Sage-Bionetworks/Genie

t_depth, t_alt_count, t_ref_count calculation and validation

Closed this issue · 3 comments

  • We already require MAF files to submit 2 of the three columns. But we allow null or blank values, this can be changed.
  • I noticed that Genome Nexus doesn't do calculation of t_depth, t_alt_count, t_ref_count. If a variant has t_depth, and t_alt_count, t_ref_count can be calculated with the other two.
  • It isn't as straightforward to validate each center's vcf file as each center that provides VCFs do so a bit differently.  I tried using a standard tool to read in vcf files to obtain the depth values, but there were VCF formatting issues for many sites that would potentially invalidate all of a site's VCF files. 

From discussions with cBioPortal team:

cBio team: It can’t be calculated in all situations actually b/c there are multiple options (e.g. there could be a change to another base as well). For indels it can also be kind of complex to determine refcount

Me: Ah I see. Unfortunately, I really don't know enough about these three columns / the biology to comment further.
I just thought that as long as you have two of the three columns you can get the other value. However from what you are saying, these values also change based on the variant type.

  • Are there any scenarios where we cannot recalculate these fields given 2 of the 3 values?
  • Are there also any scenarios where the equation t_alt_count + t_ref_count = t_depth is false?
  • Are there any scenarios where you can actually get all three values with only one or none of the three values specified?

so usually t_alt_count + t_ref_count ≈ t_depth

in prolly 99.9% of the cases it’s fine to assume they are equal. But e.g. for a SNV where ref is G, the base changes could be G->C, G->A,G->T. If you know the count of G->T and you know the the total depth you still can’t say with certainty what t_ref_count is, because you don’t know G->A and G->T

Won't validate this, but showing in dashboard. Furthermore, the missing values are calculated when 2 of the 3 values exist.