marbl/HG002

the use of HG002 reference genome for phasing

Closed this issue · 6 comments

Hi,
If I want to generate a phased Nanopore BAM file, should I directly map the Nanopore reads against HG002, so the reads will be mapped to paternal or maternal chromosomes, or should I map against CHM13, then phase the reads base on heterozygous SNP as usual ?

Are you trying to map a different human to phase its variants or are you talking about mapping HG002 data back to itself? If you're mapping a different human, I think you'd want to use a single haplotype like CHM13, probably with the PAR regions masked.

Thanks for replying, I am trying to map a different human (not HG002) to phase its variants.
I see, so it is probably because the SNPs are different between HG002 and my samples ? I think I misunderstood the use of HG002 reference.
Also, could you tell the full name of "PAR" region ?

Variant calling tools aren't going to be designed to process phased assemblies like HG002 so what you don't want to have happen is half your reads to map to one chromosome on one haplotype and half to the same chromosome on the other haplotype, confusing the variant frequency calculations. For the mapping you really want a single representative of each chromosome so you could make one from HG002 by taking one of the haplotypes and adding both sex chromosomes to it (e.g. maternal + chrY assuming your sample is XY or just maternal assuming your sample is XX or paternal + chrX assuming your sample is XY).

PAR stands for pseudo-autosomal region and is the similar region at the start and ends of chrX & Y. Since this region can commonly recombine, variants on the X in one genome may be on the Y in another. This also confuses variant calling because reads will map to both and make coverage and frequency unclear. For this region, for mapping male samples the PAR region on the Y chromosome is masked so all X and Y reads from this region map to the X. The CHM13 GitHub page provides a masked version of CHM13v2.0 for this purpose.

Oh ! I see, so when using diploid reference, it needs different method for mapping.
If I don't do variant calling, can I map my reads to all HG002 paternal + maternal + X+ Y chromosomes, then extract the paternal chromosomes, and consider those reads are phased (as paternal) ? It seems to be the same as extracting paternal chromosomes from HG002, then map reads on it.

Actually I am doing structural variant detection, I am thinking to detect SV in CHM13, because current SV callers are designed for reference having single representative of each chromosome. Then I perform another mapping on HG002, extract the paternal and maternal chromosomes as above. Through combining the SV position and phased reads, I could get the phased SV. That's what I'm thinking, but not sure if it 100% make sense...

Also, while mapping on HG002, should I care about the reference like "haplotype1-0000054", "unassigned-0002984", or just ignore them ?

Thanks very much for the detailed explanation of PAR, I fully understood.

You can't rely on HG002 to phase reads from another sample. There's no reason to expect the haplotypes to match between the two so things that are on one haplotype in HG002 might not be in your sample and vice versa. You can map to the CHM13 reference, phase reads using something like whatshap or a similar tool and then look at SVs in subsets of reads.

I see, sorry for misunderstanding. Thank you very much for quick response !