samtools/hts-specs

sam: Define format of read group predicted median insert size

zaeleus opened this issue · 2 comments

This is in regard to Sequence Alignment/Map Format Specification (2022-08-22).

§ 1.3 "The header section" defines a read group (RG) field named PI for the predicted median insert size. No format is defined, but the test file hdr.RG9.sam makes the assumption it's numeric.

Existing practice and the description of the field as an insert size make it clear that the field value is numeric.

HTSJDK and Picard have always enforced that the field value be an integer. See samtools/htsjdk@2799b1f and the output from picard ValidateSamFile I=test/sam/passed/hdr.RG9.sam in the latest release:

ERROR::INVALID_PREDICTED_MEDIAN_INSERT_SIZE:Error parsing SAM header. PI is not numeric: 123.456. 

Hence my recommendation would be that we clarify that this is an integer. Proposed text is in PR #721.

Good catch given I wrote most of the test data and hadn't spotted htsjdk rejects the floating point nature. Really, since when have continuous things like means been acceptable only as integer? I don't know why they expected integer only, given ParseFloat would work just fine on integer values.

I thought however I'd tested picard on most of the test data and looked over the causes of the failures. I guess I somehow missed that one (or it's changed since, but that seems unlikely).