Integrating "Targeted Analysis of sequencing Reads for GenoTyping" (TARGT) Genotype Calls of Classical HLA Genes with the 1000 Genomes Project
This repo contains code associated with Pierini et al 202X. If you only use the code in this repo please cite TBD and if you use the updated TGP VCF file please cite TBD and Pierini, F., Nutsua, M., Böhme, L. et al. Targeted analysis of polymorphic loci from low-coverage shotgun sequence data allows accurate genotyping of HLA genes in historical human populations. Sci Rep 10, 7339 (2020).
Before integrating the genotype calls from the TARGT pipeline we first need to download the phase 3 realse of the TGP chromosome 6 VCF file and it's corresponding index, which can be done with the following code:
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
and:
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi
For our purposes we are only interested in the exons for the classical HLA class 1 and class 2 genes, using BCFtools we can filter the TGP VCF file to only contain our regions of interest with the following code (you can find the .bed
file in the resources
directory):
bcftools view -R hla_a_c_b_drb1_dqb1_exons.bed -Oz -o tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Lastly, we can index tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz
using Tabix with the following code:
tabix -p vcf tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz
NOTE: All VCF files and their companion index files can be found in the resources
directory.
I have preformatted the TARGT tables (which can be found in the targt_tables
directory) to include the necessary header line information as specified by the VCF documentation here I make a couple of assumptions that should not affect down stream analysis:
-
Since we called genotypes using the TARGT pipeline we do not have variant IDs associated with our new calls, so I annotate the
ID
field as missing using a"."
-
Since the quality of each call was assed in the TARGT pipeline I annotate the filter statuts in the
FILTER
field asPASS
for every site
In converting the TARGT genotype calls to the VCF format you will notice that there are some missing genotype calls. I always denote missing genotype information with a "."
and I do not do any subsequent filtering of missing genotype calls. However the targt_tables_qc.py
will produce a file that has site level annotations of which individual(s) are missing genotype information.
Now that we have the assumptions out of the way let's run:
python3 targt_tables_qc.py
which will output four files that we will now go over.
Is a headerless (but otherwise perfectly formatted) VCF file consiting of ONLY the sites called by the TARGT pipeline.
Is the output from making sure that all samples in the targt_hla_gt_calls_no_header.vcf
are indentical and in the same order as the TGP data. If nothing is wrong with the samples or their order (and there is not) then the file will contain the following message:
All good Dave!
Is the output of checking that something did not go wrong when the TARGT pipeline was determining the reference allele, since both data sets were aligned to the hg19
reference assembly then theoretically the intersect of sites between the TARGT and TGP data should have the same refernce allele. This file has four columns in the following format:
position TGP_ref_allele TARGT_ref_allele agreement
where a value of 1
in the agreemnet column would represent that the reference allele strings are IDENTICAL between data sets and a value of 0
indicates that the refernce allele strings are NOT IDENTICAL between the two data sets. There are two positions (31324493
and 31324496
) where the reference alleles are not encoded the same way, but the actual reference alleles are the same! This discordance is all due to how VCF files encode deletion events—see section 5 in the VCF documentation for a more thorough explanation of how VCF files encode different types of variants. I will illustrate this point by looking at our two positions using the following code:
zcat tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz | awk '$2 == "31324493" { print }'
zcat tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz | awk '$2 == "31324496" { print }'
which will output the following entry (for brevity I will just show you the first nine columns of the VCF file):
6 31324493 rs576010607 CA C 100 PASS AC=330;AF=0.0658946;AN=5008;NS=2504;DP=6820;EAS_AF=0.0913;AMR_AF=0.0418;AFR_AF=0.0348;EUR_AF=0.0805;SAS_AF=0.0838;AA=|||unknown(NO_COVERAGE);VT=INDEL;EX_TARGET GT
6 31324496 rs540530530 GT G 100 PASS AC=330;AF=0.0658946;AN=5008;NS=2504;DP=6940;EAS_AF=0.0913;AMR_AF=0.0418;AFR_AF=0.0348;EUR_AF=0.0805;SAS_AF=0.0838;AA=|||unknown(NO_COVERAGE);VT=INDEL;EX_TARGET GT
The reference allele for position 31324493
is CA
and the alternative allele is C_
where _
denotes the deletion of the A
allele. Similarly the reference allele for position 31324496
is GT
and the alternative allele is G_
where _
denotes the deletion of the T
allele. From section 5 in the VCF documentation it is then clear to see that the actual reference allele for these two positions are C
and G
respectively thus all positions are concordant, which is exactly what we would expect!
Is the output of checking missing genotype calls per haplotype for every site produced by the TARGT pipeline. This file has four columns in the following format:
position num_missing_calls haps_with_missing_calls locus
where num_missing_calls
refers to the total number of chromosomes with missing genotype calls, haps_with_missing_calls
refers to the comma seperated list of the specific haplotype(s) that the missing genotype call is on, and locus
refers to the HLA gene that site is falls on. Since each loci varies in the number of missing genotype calls I do not filter the VCF file by missing information and allow the researcher to make that call—haha "call", get it? That pun was 100% unintentional.
To actually generate the new VCF file simply run:
python3 update_hla_vcf.py
which will create two files. The first file unannotated_replaced_hla_exons.vcf
is simply an intemediary VCF file that can be deleted and the second file annotated_replaced_hla_exons.vcf
is proably the reason why you are looking at this repo. Lastly, I will bgzip and index annotated_replaced_hla_exons.vcf
using Tabix:
bgzip annotated_replaced_hla_exons.vcf
tabix -p vcf annotated_replaced_hla_exons.vcf.gz
I added the following INFO
flags to the header annotated_replaced_hla_exons.vcf.gz
(which can be found in the resources
directory along with its indexed file):
##INFO=<ID=TARGT,Number=0,Type=Flag,Description="indicates that the genotype call was produced by the TARGT pipeline"
##INFO=<ID=REPLACED_GT,Number=0,Type=Flag,Description="indicates that the TGP genotype call was replaced by TARGT pipeline genotype call"
##INFO=<ID=NEW_GT,Number=0,Type=Flag,Description="indicates a new genotype call introduced by TARGT pipeline genotype call"
I added these info flags such that we can subset annotated_replaced_hla_exons.vcf.gz
in three unique ways using BCFtools.
Scenario 1: Generate a VCF file of all TARGT calls.
bcftools view -i 'INFO/TARGT=1' -Oz -o all_targt_calls.vcf.gz annotated_replaced_hla_exons.vcf.gz
Scenario 2: Generate a VCF file of only the new calls introudced by the TARGT pipeline.
bcftools view -i 'INFO/NEW_GT=1' -Oz -o only_new_targt_calls.vcf.gz annotated_replaced_hla_exons.vcf.gz
Scenario 3: Generate a VCF file that contains only sites that were replaced with TARGT calls.
bcftools view -i 'INFO/REPLACED_GT=1' -Oz -o only_replaced_targt_calls.vcf.gz annotated_replaced_hla_exons.vcf.gz
NOTE: All VCF files and their companion index files can be found in the resources
directory.
Using the filtering schemes described above for the TARGT calls and the original TGP VCF file containing the exons of the classical HLA class 1 and class 2 genes to we can compare the type of variants before and after the integrating the calls from the TARGT pipeline.
### Dependicies ###
import vcf_functions.py as vf
# Load VCF files.
original_tgp_hla_vcf = './resources/tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz'
all_targt_hla_vcf = './resources/all_targt_calls.vcf.gz'
new_targt_hla_vcf = './resources/only_new_targt_calls.vcf.gz'
replaced_targt_hla_vcf = './resources/only_replaced_targt_calls.vcf.gz'
# Analyze the variant type data per exon.
vf.hla_exon_report(original_tgp_hla_vcf)
vf.hla_exon_report(all_targt_hla_vcf)
vf.hla_exon_report(new_targt_hla_vcf)
vf.hla_exon_report(replaced_targt_hla_vcf)
# Create site dictionary for the original TGP calls and the calls replaced by the TARGT pipeline.
original_tgp_hla_dicc = vf.vcf_site_info(original_tgp_hla_vcf)
replaced_targt_hla_dicc = vf.vcf_site_info(replaced_targt_hla_vcf)
# Analyze the the variant identities before and after replacement.
vf.compare_replaced_calls(original_tgp_hla_dicc, replaced_targt_hla_dicc)
First we will look at the original calls from the TGP for our classical HLA class 1 and class 2 exons from tgp_hla_a_c_b_drb1_dqb1_exons_unfiltered.vcf.gz
where the data was imputed and consequently the TGP VCF file only contains variable sites (NOTE Total Sites
refers to the number of sites present in the VCF file):
Locus | Total Sites | Invariant Sites | Bi-Allelic Sites | Multi-Allelic Sites |
---|---|---|---|---|
HLA-A (Exon 1) | 45 | 0 | 40 | 5 |
HLA-A (Exon 2) | 37 | 0 | 28 | 9 |
HLA-B (Exon 1) | 36 | 0 | 27 | 9 |
HLA-B (Exon 2) | 57 | 0 | 48 | 9 |
HLA-C (Exon 1) | 31 | 0 | 25 | 6 |
HLA-C (Exon 2) | 25 | 0 | 23 | 2 |
HLA-DRB1 (Exon 1) | 24 | 0 | 24 | 0 |
HLA-DQB1 (Exon 1) | 33 | 0 | 33 | 0 |
Next let's look at how the calls changed once we included the TARGT calls from all_targt_calls.vcf.gz
:
Locus | Total Sites | Invariant Sites | Bi-Allelic Sites | Multi-Allelic Sites |
---|---|---|---|---|
HLA-A (Exon 1) | 270 | 59 | 172 | 39 |
HLA-A (Exon 2) | 276 | 69 | 167 | 40 |
HLA-B (Exon 1) | 276 | 229 | 35 | 12 |
HLA-B (Exon 2) | 270 | 216 | 47 | 7 |
HLA-C (Exon 1) | 276 | 240 | 26 | 10 |
HLA-C (Exon 2) | 270 | 246 | 23 | 1 |
HLA-DRB1 (Exon 1) | 270 | 205 | 44 | 21 |
HLA-DQB1 (Exon 1) | 270 | 210 | 51 | 9 |
Nice, it worked! For the HLA-A locus the TARGT calls introduced quite a few informative polymorphic sites and it introduced a modest amount of new informative polymorphic sites for the HLA-DRB1 and HLA-DQB1 loci! However, for the HLA-B and HLA-C loci the majority of the introduced TARGT calls are invariant. Now let's look at the relative contributions of new TARGT calls vs replaced TARGT calls. First let's look at the breakdown of new calls introduced by the TARGT pipeline from only_new_targt_calls.vcf.gz
:
Locus | Total Sites | Invariant Sites | Bi-Allelic Sites | Multi-Allelic Sites |
---|---|---|---|---|
HLA-A (Exon 1) | 225 | 59 | 152 | 14 |
HLA-A (Exon 2) | 239 | 68 | 153 | 18 |
HLA-B (Exon 1) | 240 | 226 | 13 | 1 |
HLA-B (Exon 2) | 213 | 212 | 1 | 0 |
HLA-C (Exon 1) | 245 | 233 | 12 | 0 |
HLA-C (Exon 2) | 245 | 243 | 2 | 0 |
HLA-DRB1 (Exon 1) | 246 | 205 | 24 | 17 |
HLA-DQB1 (Exon 1) | 237 | 209 | 19 | 9 |
The majority of new calls introduced by the TARGT pipeline appear to be invariant, however, for the HLA-A and HLA-DRB1 loci it looks like the majority of informative polymorphic sites are new calls! Let's compare this to the calls from the TARGT pipeline that replaced the original HLA calls from only_replaced_targt_calls.vcf.gz
:
Locus | Total Sites | Invariant Sites | Bi-Allelic Sites | Multi-Allelic Sites |
---|---|---|---|---|
HLA-A (Exon 1) | 45 | 0 | 20 | 25 |
HLA-A (Exon 2) | 37 | 1 | 14 | 22 |
HLA-B (Exon 1) | 36 | 3 | 22 | 11 |
HLA-B (Exon 2) | 57 | 4 | 46 | 7 |
HLA-C (Exon 1) | 31 | 7 | 14 | 10 |
HLA-C (Exon 2) | 25 | 3 | 21 | 1 |
HLA-DRB1 (Exon 1) | 24 | 0 | 20 | 4 |
HLA-DQB1 (Exon 1) | 33 | 1 | 32 | 0 |
Notably, every single original position was replaced with TARGT calls! Next, lets look at the identity of the calls we replaced, which is the output of vf.compare_replaced_calls(original_tgp_hla_dicc, replaced_targt_hla_dicc)
:
Locus | Identical Variant | Bi-allelic -> Invariant | Bi-allelic -> Bi-allelic | Bi-allelic -> Multi-allelic | Multi-allelic -> Invariant | Multi-allelic -> Bi-allelic | Multi-allelic -> Multi-allelic |
---|---|---|---|---|---|---|---|
HLA-A (Exon 1) | 17 | 0 | 4 | 20 | 0 | 0 | 4 |
HLA-A (Exon 2) | 14 | 1 | 2 | 13 | 0 | 0 | 7 |
HLA-B (Exon 1) | 21 | 3 | 0 | 3 | 1 | 0 | 8 |
HLA-B (Exon 2) | 45 | 4 | 0 | 0 | 2 | 0 | 6 |
HLA-C (Exon 1) | 16 | 7 | 0 | 4 | 0 | 0 | 4 |
HLA-C (Exon 2) | 20 | 3 | 0 | 0 | 1 | 0 | 1 |
HLA-DRB1 (Exon 1) | 20 | 0 | 0 | 4 | 0 | 0 | 0 |
HLA-DQB1 (Exon 1) | 32 | 1 | 0 | 0 | 0 | 0 | 0 |
Lastly, I also compared the SFS between the original and updated TGP datasets but because the data processing is a little more hands on I created a detailed juypter notebook classical_hla_exons_before_and_after.ipynb
for those results!