samtools/htsjdk

VCFHeader shouldn't allow multiple VCFContigHeaderLines with the same contig index

cmnbroad opened this issue · 0 comments

VCFHeader has no guard against multiple contig header lines with the same contig index. Also VCFContigHeaderLine has a compareTo implementation that ignores everything except for contigIndex, which makes it inconsistent with both equals and hashCode.

Set implementations that use a comparator, such as TreeSet, treat two VCFContigHeaderLine as equal if they have the same contig index, which is different than other non-comparator based sets. This results in various inconsistencies when two lines with the same index are included in a header (which can happen when merging headers, or manually creating headers):

  • VCFHeader.getContigLines returns all lines, even ones with a duplicate index, sorted in input order (does not respect contig index order)
  • VCFHeader.getSequenceDictionary uses getContigLines, so it also includes all lines in input order (does not respect contig index order)
  • VCFHeader.getMetaDataInSortedOrder() (this is what is used by vcfWriter to serialize a header) respects contig index order because it uses an intermediate TreeSet, and also returns only one from the duplicate set