zaeleus/noodles

Parsing VCF fails for ID in header

apraga opened this issue · 3 comments

Hi,

Thanks for you package. Some data (Clinvar in my case) has a ID in header
##ID=<Description="ClinVar Variation ID">

Minimal working example

##fileformat=VCFv4.1
##ID=<Description="ClinVar Variation ID">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	1	2	A	G	.	.	.

Parsing fails with

called Result::unwrap()on anErr value: Custom { kind: InvalidData, error: InvalidRecord(InvalidValue(InvalidOtherMap(Other("ID"), ParseError { id: None, kind: MissingId }))) }

Removing the offending header line solve the issue. I'm not sure if that should be managed by noodles but figured it would help to report it nevertheless.

Thanks,

Thanks for reporting, @apraga. This case is undefined in VCF 4.1/4.2. I asked for clarification in samtools/hts-specs#760.

This behavior remains undefined, but for inputs that are VCF < 4.3, noodles will now read other header record values that start with a map prefix (<) as a string if there is no map identifier (i.e., ID=), e.g.,

##fileformat=VCFv4.2
##A=<a>
##B=<ID=b>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
Header {
    file_format: FileFormat { major: 4, minor: 2 },
    other_records: {
        Other("A"): Unstructured(["<a>"]),
        Other("B"): Structured({
            "b": Map { inner: Other { id_tag: Standard(Id) }, other_fields: {} },
        }),
    },
    // ...
}

This is explicitly invalid in VCF >= 4.3 and will continue to fail, e.g.,

##fileformat=VCFv4.3
##A=<a>
##B=<ID=b>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
Error: Custom {
    kind: InvalidData,
    error: InvalidRecord(InvalidValue(InvalidOtherMap(
        Other("A"),
        ParseError {
            id: None,
            kind: InvalidField(InvalidKey(UnexpectedEof)),
        },
    ))),
}

These rules also apply to the serializer when writing a VCF header.

Thanks @zaeleus for this fix, that solves my issue !