igsr/igsr_analysis

Missing most of ChrX in the integrated snvindel file

gaberudy opened this issue · 1 comments

For the IGSR/1000 genomes project downloaded from EBI at this location:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/

We are concerned it what looks like a truncated file for ChrX in this list. It has considerably smaller size than any of the other per-chr VCF files and after running it through conversations and plots ourselves, the last variant in the file is only at ChrX: 2,781,455.

Since UCSC picked up these VCF files as input to their track, you can even see where the chrX variants stop in their genome browser:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chrX%3A2715741%2D2855483&hgsid=1039559061_PwN1CAfq0fa6MajfjbgjE7OAT3Xe

image

If you know who we may contact that may have the full un-truncated chrX of this source, we would appreciate it.

It looks like this chromosome needs to be recalled, and UCSC may need to notified to update their source data as well.

We provide user support at info@1000genomes.org, as noted on our sites, not via GitHub.

This data set does not include the non-PAR regions of X. This is documented in the publication that describes the data set (https://wellcomeopenresearch.org/articles/4-50) and I believe UCSC will already be aware of this.

You may be interested in newer calls based on high coverage data. These have been described in a recent preprint (https://www.biorxiv.org/content/biorxiv/early/2021/02/07/2021.02.06.430068.full.pdf). The data is also available on our FTP site with additional information here: https://www.internationalgenome.org/data-portal/data-collection/30x-grch38

We'll be happy to answer any further questions but please direct them to our email address - I'm afraid we are not able to monitor Github. I hope the above is of help.