knausb/vcfR

Heterozygous SNP calling

Closed this issue · 15 comments

Hello,

I have recently generated a vcf file containing autosomal SNP data for diploid organisms using the samtools mpileup | bcftools call pipeline. Because these individuals were genotyped in the past, I know what I expect to see in the vcf file.
Although bcftools predicts genotypes, I was wondering whether there is a way in vcfR to exploit the "AD" information to assess site heterozygosity? I would like to set a threshold of 0.2, above which a SNP site can be considered heterozygous. So far I have imported my vcf and extracted "AD" information:

ad<-extract.gt(vcf,element="AD")
ad1 <- masplit(ad, record = 1)
ad2 <- masplit(ad, record = 2)

Now, I would like to know if there is a way to tell the software to call heterozygous all those sites where ad1/ad2 or ad2/ad1 > 0.20?

Thank you very much.

Hi, we've put some similar information on the ploidy section of the documentation. Please take a look and let me know if this helps. If not, perhaps start with a mre. Thanks!

Hi,

Is the following what you're looking for?

# Load libraries
library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(pinfsc50)

# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
                        package = "pinfsc50")

# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 22,031 variants
#> Object size: 22.4 Mb
#> 7.929 percent missing data
#> *****        *****         *****

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

gt <- extract.gt(vcf)

table(gt[ ad1 > 0.2 & ad1 < 0.8 ])
#> 
#>   0|0   0|1   0|2   0|3   1|0   1|1   1|2   1|3   2|0   2|1   2|2   2|3   2|4 
#>    47 26194   317     8 25838    26   110     1   275    86     1     9     2 
#>   3|0   3|1   3|2   3|4   4|1   4|2 
#>     6     7     7     1     1     1

gt[ ad1 > 0.2 & ad1 < 0.8 & !is.na(ad1) ] <- "0/1" # Note that we do not know phase here.
gt[ ad2 > 0.2 & ad2 < 0.8 & !is.na(ad2) ] <- "0/1" # Note that we do not know phase here.
table(gt[ ad1 > 0.2 & ad1 < 0.8 ])
#> 
#>   0/1 
#> 52937

Created on 2022-03-15 by the reprex package (v2.0.1)

If you're looking to paste the genotypes back to your VCF data you might try this page.

Hi,

many thanks once again for the reply.

Sorry, I have just realised my question was a bit confusing. I am trying to find a way to tweak vcfR to call heterozygous those SNPs for which the ratio allele1/allele2 and allele2/allele1 is >0.2
So instead of looking at ad1 = allele1/(allele1+allele2) and ad2 = allel2/(allele1+allele2), I would rather look at allele1/allele2 and allele2/allele1 and then paste the genotypes back to my VCF.

Why don't you try the suggestion of creating a minimal reproducible example? You can use my above code as a starting point.

Can you modify the lines of code that I've already provided to get the result you want?

In the VCF specification unphased genotypes are coded with the forward slash (0/1) while phased data is treated with the pipe (0|1). We do not know the order of unphased data so we can use either 0/1 or 1/0 to code for heterozygotes. I haven't really worked with phased data before, but I think the short answer is that it will be more complicated. If your genotyping updating efforts result in you replacing a heterozygote with a heterozygote I think you do nothing. But if you want to change a homozygote to a heterozygote then I think that might be something to think about. A phased homozygote (0|0) is actually uninformative about phase. If you reverse the alleles of a homozygote you still have a homozygote. So I don't think you'd have the information necessary to call phase. So I think you'll have to use an unphased heterozygote as your only option. Perhaps you can rephase the data after your update? Does this make sense?

Hi, I'm not sure I follow those last few lines of code. Perhaps you should review them to make sure you understand what they're attempting to accomplish. Without a complete example, including data, it's hard to tell. The code appears to change genotypes that are within your specification, ad1 > 0.2 & ad1 < 0.8 and "ad2 > 0.2 & ad2 < 0.8, but does not manipulate genotypes outside this specification. Is it possible that you have heterozygotes in the data you began with?

Can you initialize the matrix as homozygous

gt[!is.na(gt)] <- "0/0"

and then add your heterozygotes?

Hi @ettorefedele , I'm not going to write your code for you. That's your responsibility. You appear to be trying to create your own genotype caller. I suggest you look into how authors of other genotype callers have addressed these issues.