samtools/htsjdk

Unmapped read validation causes validation errors in specifications-compliant records.

d-cameron opened this issue · 3 comments

Description of the issue:

Validation of specifications-compliant unmapped records raises validation failures.

The SAM specifications state:
• Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptionscan be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800."

Your environment:

htsjdk 2.23.0

Steps to reproduce

	@Test
	public void htsjdk_mapq_unmapped_validation_violates_sam_specifications() {
		SAMRecord r = new SAMRecord(null);
		r.setReferenceName("chr");
		r.setAlignmentStart(1);
		r.setMappingQuality(1);
		r.setCigarString("1M");
		r.setReadUnmappedFlag(true);
		r.setSupplementaryAlignmentFlag(true);
		r.setSecondaryAlignment(true);
		r.setProperPairFlag(true);
		List<SAMValidationError> errors = r.isValid();
		assertEquals(0, errors.size());
	}

Expected behaviour

Test case should pass. Maybe a warning is more appropriate?

Actual behaviour

4 unnecessary validation failures are raised:

0 = {SAMValidationError@1269} "ERROR::INVALID_FLAG_PROPER_PAIR:Proper pair flag should not be set for unpaired read."
1 = {SAMValidationError@1270} "ERROR::INVALID_FLAG_NOT_PRIM_ALIGNMENT:Secondary alignment flag should not be set for unmapped read."
2 = {SAMValidationError@1271} "ERROR::INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT:Supplementary alignment flag should not be set for unmapped read."
3 = {SAMValidationError@1272} "ERROR::INVALID_MAPPING_QUALITY:MAPQ should be 0 for unmapped read."

@d-cameron I think this is a place where someone decided to be more strict than is specified in order to catch things that are most likely caused by bugs elsewhere in the pipeline, and to reduce the space state we have to deal with. It looks like you're working around it, but it's not clear to me if this is a situation where there is value to you in maintaining the now meaningless flags. (I.e. Do you want to keep the mapping quality from before you clipped it?). Or is it just annoying that these technically valid reads are causing errors and it's a pain to have to reset the values manually?

You're definitely correct in that these values are allowed by the spec, but I read the intent of the spec to be more of "if unmapped is set than the listed values are meaningless and should be ignored" more than "if unmapped is set then any combination of these values is valid."

Relaxing these errors to a warning seems like a good potential solution.

(I.e. Do you want to keep the mapping quality from before you clipped it?)

In this particular instance, it's bwa reporting a 50S50I alignment, I'm cleaning up the start/end indels thus it gets unmapped. The mapq is still as meaningful as it was before it became unmapped as there were 0 bases actually aligned to the ref and 0 afterwards.

SevenBridges do a similar thing where they graph align reads to a non-reference insertion which then gets mapped to 100I in the SAM output. 0 bases consumed by the reference (so according to the SAM specs the alignment position doesn't have a meaning) but the alignment location and mapq is meaningful.

Currently I am working with SAM/BAM files containing circular consensus sequences generated with PacBio ccs. These sequences are always unmapped, so FLAG is always set to 4, RNAME is always set to * and MAPQ is always set to 255 (as documented by PacBio in their SMRT Tools Reference Guide, page 9).

While trying to read these files using the SAMRecordIterator (from
htsjdk 3.0.4), calling the next() method unfortunately triggers the following exception:
htsjdk.samtools.SAMFormatException: Error parsing text SAM file. MAPQ must be zero if RNAME is not specified; Line 1

The only possibility I've found so far to read these files, is to completely disable all validations using validationStringency(ValidationStringency.SILENT) (or LENIENT) on the reader factory. But completely switching off validation really feels uncomfortable to me, like missing all its benefits.

Is there another recommended way to read such files, which i am missing?