charite/jannovar

Building ReferenceDictionary fails for hg19/*

stolpeo opened this issue ยท 16 comments

The following downloads fail during building the ReferenceDictionary:

Downloading/parsing for data source "hg19/ucsc"
Downloading/parsing for data source "hg19/ensembl"
Downloading/parsing for data source "hg19/refseq"
Downloading/parsing for data source "hg19/refseq_curated"
Downloading/parsing for data source "hg19/refseq_interim"
Downloading/parsing for data source "hg19/refseq_interim_curated"

All throw the same error message:

INFO Building ReferenceDictionary...
Exception in thread "main" java.lang.IllegalArgumentException: Multiple entries with same key: 25=16569 and 25=16571
        at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:190)
        at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:109)
        at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
        at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:357)
        at de.charite.compbio.jannovar.data.ReferenceDictionaryBuilder.build(ReferenceDictionaryBuilder.java:115)
        at de.charite.compbio.jannovar.impl.parse.ReferenceDictParser.parse(ReferenceDictParser.java:127)
        at de.charite.compbio.jannovar.datasource.JannovarDataFactory.build(JannovarDataFactory.java:101)
        at de.charite.compbio.jannovar.cmd.download.DownloadCommand.run(DownloadCommand.java:43)
        at de.charite.compbio.jannovar.Jannovar.main(Jannovar.java:67)
visze commented

ok. this error is because http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz has two times chrM with different lengths:

chrM    16571   /gbdb/hg19/hg19.2bit
chrMT   16569   /gbdb/hg19/hg19.2bit
visze commented

16569 should be the correct length. there is a quick an dirty hack: after first downloading and failing, go to the download folder and manipulate the file chromInfo.txt.gz

remove the line chrM 16571 /gbdb/hg19/hg19.2bit

the rerun the download again

visze commented

Can anyone tell me if this an error of ucsc? or do we have to fix it in jannovar?

@visze That's most probably the good old CRS vs rCRS problem. The correct one is the longer one. The good news is that GRCh38 == hg38...

@visze chrMT: so few bases so many problems

visze commented

@visze That's most probably the good old CRS vs rCRS problem. The correct one is the longer one. The good news is that GRCh38 == hg38...

@holtgrewe no I don't think so. We download the following chrMT dna: faMT=https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id=251831106 The length is 16569. That's why I tought that the smaller one is the correct one.

That depends on the definition of "correct"

Until today, I assumed that hg19 == CRS and not revised CRS which was correct at least until 2019.

They must have included CRS for good measure to inflict pain on everyone downstream...

To sum up, we agree on deleting line

chrM    16571   /gbdb/hg19/hg19.2bit

as @visze suggested. (?)

@stolpeo yes, for hg19 we do

visze commented

the error appears only on hg19

This is still an issue and a royal pain in the rear. Can I suggest we slightly rationalise the way the chromosome information is gathered for a a build?

I'd like to suggest that the ini file points to the GenBank assembly report file e.g.

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

# Assembly name:  GRCh38.p13
# Description:    Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13)
# Organism name:  Homo sapiens (human)
# Taxid:          9606
# BioProject:     PRJNA31257
# Submitter:      Genome Reference Consortium
# Date:           2019-02-28
# Assembly type:  haploid-with-alt-loci
# Release type:   patch
# Assembly level: Chromosome
# Genome representation: full
# RefSeq category: Reference Genome
# GenBank assembly accession: GCA_000001405.28
# RefSeq assembly accession: GCF_000001405.39
# RefSeq assembly and GenBank assemblies identical: no
#
## Assembly-Units:
## GenBank Unit Accession	RefSeq Unit Accession	Assembly-Unit name
## GCA_000001305.2	GCF_000001305.15	Primary Assembly
## GCA_000005045.26	GCF_000005045.25	PATCHES
## GCA_000001315.2	GCF_000001315.2	ALT_REF_LOCI_1
... lots more of these
## GCA_000006015.1	GCF_000006015.1	non-nuclear
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name	Sequence-Role	Assigned-Molecule	Assigned-Molecule-Location/Type	GenBank-Accn	Relationship	RefSeq-Accn	Assembly-Unit	Sequence-Length	UCSC-style-name
1	assembled-molecule	1	Chromosome	CM000663.2	=	NC_000001.11	Primary Assembly	248956422	chr1
2	assembled-molecule	2	Chromosome	CM000664.2	=	NC_000002.12	Primary Assembly	242193529	chr2
3	assembled-molecule	3	Chromosome	CM000665.2	=	NC_000003.12	Primary Assembly	198295559	chr3
4	assembled-molecule	4	Chromosome	CM000666.2	=	NC_000004.12	Primary Assembly	190214555	chr4
5	assembled-molecule	5	Chromosome	CM000667.2	=	NC_000005.10	Primary Assembly	181538259	chr5
6	assembled-molecule	6	Chromosome	CM000668.2	=	NC_000006.12	Primary Assembly	170805979	chr6
7	assembled-molecule	7	Chromosome	CM000669.2	=	NC_000007.14	Primary Assembly	159345973	chr7
8	assembled-molecule	8	Chromosome	CM000670.2	=	NC_000008.11	Primary Assembly	145138636	chr8
9	assembled-molecule	9	Chromosome	CM000671.2	=	NC_000009.12	Primary Assembly	138394717	chr9
10	assembled-molecule	10	Chromosome	CM000672.2	=	NC_000010.11	Primary Assembly	133797422	chr10
11	assembled-molecule	11	Chromosome	CM000673.2	=	NC_000011.10	Primary Assembly	135086622	chr11
12	assembled-molecule	12	Chromosome	CM000674.2	=	NC_000012.12	Primary Assembly	133275309	chr12
13	assembled-molecule	13	Chromosome	CM000675.2	=	NC_000013.11	Primary Assembly	114364328	chr13
14	assembled-molecule	14	Chromosome	CM000676.2	=	NC_000014.9	Primary Assembly	107043718	chr14
15	assembled-molecule	15	Chromosome	CM000677.2	=	NC_000015.10	Primary Assembly	101991189	chr15
16	assembled-molecule	16	Chromosome	CM000678.2	=	NC_000016.10	Primary Assembly	90338345	chr16
17	assembled-molecule	17	Chromosome	CM000679.2	=	NC_000017.11	Primary Assembly	83257441	chr17
18	assembled-molecule	18	Chromosome	CM000680.2	=	NC_000018.10	Primary Assembly	80373285	chr18
19	assembled-molecule	19	Chromosome	CM000681.2	=	NC_000019.10	Primary Assembly	58617616	chr19
20	assembled-molecule	20	Chromosome	CM000682.2	=	NC_000020.11	Primary Assembly	64444167	chr20
21	assembled-molecule	21	Chromosome	CM000683.2	=	NC_000021.9	Primary Assembly	46709983	chr21
22	assembled-molecule	22	Chromosome	CM000684.2	=	NC_000022.11	Primary Assembly	50818468	chr22
X	assembled-molecule	X	Chromosome	CM000685.2	=	NC_000023.11	Primary Assembly	156040895	chrX
Y	assembled-molecule	Y	Chromosome	CM000686.2	=	NC_000024.10	Primary Assembly	57227415	chrY
HSCHR1_CTG1_UNLOCALIZED	unlocalized-scaffold	1	Chromosome	KI270706.1	=	NT_187361.1	Primary Assembly	175055	chr1_KI270706v1_random
.... lots more of these
MT	assembled-molecule	MT	Mitochondrion	J01415.2	=	NC_012920.1	non-nuclear	16569	chrM

The assembled-molecule lines can be used to construct the RefDict directly and no manual file massaging needs to happen.

Am happy to do the work to make this happen.

This seems like an excellent idea!

Excellent, then I will endeavour to make it so!

or even more enthusiastically
image

Aye aye, Captain!