Zilong-Li/vcfppR

vcfcomp returns NAs

kiran-lee opened this issue · 3 comments

Hi Zilong-Li,

Thank you for your software. I found it from your reply here.

I am currently trying to use vcfcomp to test concordance between some imputed genotypes and their original genotypes using the following script:
library(vcfppR)
res <- vcfcomp(test = "x.vcf.gz", truth = "y.vcf.gz", region="z|z", af="af.tsv", stats = "r2")
str(res)
write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

I used vcftools --freq to output the allele frequency for af.tsv.

However, this returns a dataframe fulls of NAs:

str(res)
List of 2
$ samples: chr [1:25] "/fastdata/bop21kgl/RawData/LIMS26076p5/Clean_aligned/418-2054_221014_L003__all_mapped_rehead.bam" "/fastdata/bop21$
$ r2 : logi [1, 1:17] NA NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:17] "(0,1e-05]" "(1e-05,2e-05]" "(2e-05,5e-05]" "(5e-05,0.0001]" ...
write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

proc.time()
user system elapsed
190.199 5.955 196.606

Do you have a suggestion of where this is going wrong?

Hi Kiran,

I think the input region is not valid, which should be in the form of chr:start-end, or just chr.

I'll add validators for the input in next release. Thanks for the feedback.

Also make sure the af.tsv is valid, which should have a header line and five columns. If the af is calculated from the truth file, you can just make vcfcomp doing the job by ignoring the af option ie af=NULL

Thank you so much for our email discussion troubleshooting my NAs.

We realised this was due to missing genotypes in the "truth" vcf file. Removing missing genotypes in the "truth" dataset beforehand produced a sensible r2 value. This was the script used:

library(vcfppR)

truth <- vcftable("x.vcf.gz", setid = T, info = F)
test <- vcftable("y.vcf.gz", setid = T, format = "DS", info = F)

N <- 22 ### number of samples is 22
truth$gt <- truth$gt[,1:N]
truth$samples <- test$samples[1:N]

remove sites with genotypes being NA in truth

w <- which(!is.na(rowSums(truth$gt)))
truth$gt <- truth$gt[w,]
truth[2:9] <- lapply(truth[2:9], function(a) a[w])
saveRDS(truth, file = "truth.rds")

names <- truth$samples[1:22]
samples <- paste0(names, collapse = ",")

res <- vcfcomp(test = "y.vcf.gz" ,
truth = "truth.rds",
stats = "r2",
names = names,
samples = samples,
bins = c(0,1))

str(res)
res