Building ReferenceDictionary fails for hg19/*
stolpeo opened this issue ยท 16 comments
The following downloads fail during building the ReferenceDictionary:
Downloading/parsing for data source "hg19/ucsc"
Downloading/parsing for data source "hg19/ensembl"
Downloading/parsing for data source "hg19/refseq"
Downloading/parsing for data source "hg19/refseq_curated"
Downloading/parsing for data source "hg19/refseq_interim"
Downloading/parsing for data source "hg19/refseq_interim_curated"
All throw the same error message:
INFO Building ReferenceDictionary...
Exception in thread "main" java.lang.IllegalArgumentException: Multiple entries with same key: 25=16569 and 25=16571
at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:190)
at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:109)
at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:357)
at de.charite.compbio.jannovar.data.ReferenceDictionaryBuilder.build(ReferenceDictionaryBuilder.java:115)
at de.charite.compbio.jannovar.impl.parse.ReferenceDictParser.parse(ReferenceDictParser.java:127)
at de.charite.compbio.jannovar.datasource.JannovarDataFactory.build(JannovarDataFactory.java:101)
at de.charite.compbio.jannovar.cmd.download.DownloadCommand.run(DownloadCommand.java:43)
at de.charite.compbio.jannovar.Jannovar.main(Jannovar.java:67)
ok. this error is because http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz
has two times chrM with different lengths:
chrM 16571 /gbdb/hg19/hg19.2bit
chrMT 16569 /gbdb/hg19/hg19.2bit
16569 should be the correct length. there is a quick an dirty hack: after first downloading and failing, go to the download folder and manipulate the file chromInfo.txt.gz
remove the line chrM 16571 /gbdb/hg19/hg19.2bit
the rerun the download again
Can anyone tell me if this an error of ucsc? or do we have to fix it in jannovar?
@visze That's most probably the good old CRS vs rCRS problem. The correct one is the longer one. The good news is that GRCh38 == hg38...
@visze That's most probably the good old CRS vs rCRS problem. The correct one is the longer one. The good news is that GRCh38 == hg38...
@holtgrewe no I don't think so. We download the following chrMT dna: faMT=https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id=251831106 The length is 16569. That's why I tought that the smaller one is the correct one.
That depends on the definition of "correct"
Until today, I assumed that hg19 == CRS and not revised CRS which was correct at least until 2019.
They must have included CRS for good measure to inflict pain on everyone downstream...
the error appears only on hg19
This is still an issue and a royal pain in the rear. Can I suggest we slightly rationalise the way the chromosome information is gathered for a a build?
I'd like to suggest that the ini file points to the GenBank assembly report file e.g.
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt
# Assembly name: GRCh38.p13
# Description: Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13)
# Organism name: Homo sapiens (human)
# Taxid: 9606
# BioProject: PRJNA31257
# Submitter: Genome Reference Consortium
# Date: 2019-02-28
# Assembly type: haploid-with-alt-loci
# Release type: patch
# Assembly level: Chromosome
# Genome representation: full
# RefSeq category: Reference Genome
# GenBank assembly accession: GCA_000001405.28
# RefSeq assembly accession: GCF_000001405.39
# RefSeq assembly and GenBank assemblies identical: no
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000001305.2 GCF_000001305.15 Primary Assembly
## GCA_000005045.26 GCF_000005045.25 PATCHES
## GCA_000001315.2 GCF_000001315.2 ALT_REF_LOCI_1
... lots more of these
## GCA_000006015.1 GCF_000006015.1 non-nuclear
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
1 assembled-molecule 1 Chromosome CM000663.2 = NC_000001.11 Primary Assembly 248956422 chr1
2 assembled-molecule 2 Chromosome CM000664.2 = NC_000002.12 Primary Assembly 242193529 chr2
3 assembled-molecule 3 Chromosome CM000665.2 = NC_000003.12 Primary Assembly 198295559 chr3
4 assembled-molecule 4 Chromosome CM000666.2 = NC_000004.12 Primary Assembly 190214555 chr4
5 assembled-molecule 5 Chromosome CM000667.2 = NC_000005.10 Primary Assembly 181538259 chr5
6 assembled-molecule 6 Chromosome CM000668.2 = NC_000006.12 Primary Assembly 170805979 chr6
7 assembled-molecule 7 Chromosome CM000669.2 = NC_000007.14 Primary Assembly 159345973 chr7
8 assembled-molecule 8 Chromosome CM000670.2 = NC_000008.11 Primary Assembly 145138636 chr8
9 assembled-molecule 9 Chromosome CM000671.2 = NC_000009.12 Primary Assembly 138394717 chr9
10 assembled-molecule 10 Chromosome CM000672.2 = NC_000010.11 Primary Assembly 133797422 chr10
11 assembled-molecule 11 Chromosome CM000673.2 = NC_000011.10 Primary Assembly 135086622 chr11
12 assembled-molecule 12 Chromosome CM000674.2 = NC_000012.12 Primary Assembly 133275309 chr12
13 assembled-molecule 13 Chromosome CM000675.2 = NC_000013.11 Primary Assembly 114364328 chr13
14 assembled-molecule 14 Chromosome CM000676.2 = NC_000014.9 Primary Assembly 107043718 chr14
15 assembled-molecule 15 Chromosome CM000677.2 = NC_000015.10 Primary Assembly 101991189 chr15
16 assembled-molecule 16 Chromosome CM000678.2 = NC_000016.10 Primary Assembly 90338345 chr16
17 assembled-molecule 17 Chromosome CM000679.2 = NC_000017.11 Primary Assembly 83257441 chr17
18 assembled-molecule 18 Chromosome CM000680.2 = NC_000018.10 Primary Assembly 80373285 chr18
19 assembled-molecule 19 Chromosome CM000681.2 = NC_000019.10 Primary Assembly 58617616 chr19
20 assembled-molecule 20 Chromosome CM000682.2 = NC_000020.11 Primary Assembly 64444167 chr20
21 assembled-molecule 21 Chromosome CM000683.2 = NC_000021.9 Primary Assembly 46709983 chr21
22 assembled-molecule 22 Chromosome CM000684.2 = NC_000022.11 Primary Assembly 50818468 chr22
X assembled-molecule X Chromosome CM000685.2 = NC_000023.11 Primary Assembly 156040895 chrX
Y assembled-molecule Y Chromosome CM000686.2 = NC_000024.10 Primary Assembly 57227415 chrY
HSCHR1_CTG1_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270706.1 = NT_187361.1 Primary Assembly 175055 chr1_KI270706v1_random
.... lots more of these
MT assembled-molecule MT Mitochondrion J01415.2 = NC_012920.1 non-nuclear 16569 chrM
The assembled-molecule
lines can be used to construct the RefDict directly and no manual file massaging needs to happen.
Am happy to do the work to make this happen.
This seems like an excellent idea!
Excellent, then I will endeavour to make it so!
Aye aye, Captain!