JetBrains-Research/bioinf-commons

BED parsing is sometimes too relaxed

dievsky opened this issue · 8 comments

Today I learned that the line chr1\t0\t100 will happily parse as bed12. The missing fields will just be filled with default values.

val bedFormat = BedFormat.from("bed12")
val content = "chr1\t0\t100"
withBedFile(content) { path ->
    bedFormat.parse(path) { parser ->
        val entries = parser.map { entry -> entry.unpack(bedFormat) }
        assertEquals(1, entries.size)
        assertEquals(0, parser.linesFailedToParse)
        assertEquals(0, entries.first().score)
    }
}

No exceptions thrown, no errors logged, just total peace and harmony. Looks wrong.

This also interferes with our current approach to loading BED track in JBR to some extent, and the underlying problem is more related to BedEntry.unpack method being unusually forgiving. Because of that, I'm not sure where to place the issue -- big (home of unpack), bioinf-commons (home of BedFormat) or epigenome (home of JBR).

The questions here are:

  1. is this really intended and expected behaviour, and if not,
  2. how can we fix it?

My suggestion would be to make unpack throw an uneqivocal exception, then catch and process it upstream.

olegs commented

Your suggestion sounds reasonable.

Thanks! I'd like to also hear from @iromeo , as he wrote a test in big that checks for the behaviour that I described above as invalid (unpackBed3as12).

According to BedEntry.unpack() sources in big project the behavior is intended and that is why we have a test for it.

// If line is shorter than suggested fields number do not throw an error
// it could be ok for default behaviour, when user not sure how much fields
// do we actually have

big library doesn't provide any bed format detection, so users should guess file format somehow. If you throw exception they won't be able to parse bed files if they didn't know actual fields number.

Today I learned that the line chr1\t0\t100 will happily parse as bed12

What was the initial problem? Our new auto-format considers chr1\t0\t100 as bed12? Or you want to have some double check and validate that format detection was right while parsing the file?

P.S.: I think we can change the behavior as we like.

The comment claims that the behaviour is intended, though it doesn't really explain why. If we don't throw an exception, the users will never know that they used a wrong format. There's literally no way to determine this.

The initial problem was the inconsistency. We got all fussy and threw around exceptions if the score column contained 32768 (not true anymore, but 1.0 will still throw), but somehow at the same time we just swallow the fact that 9 out of 12 fields might be completely missing.

And yes, we do want to check that the format detection was right. We're checking it in processBed anyway, but currently we can only look for number format exceptions, since unpack ignores missing fields and doesn't tell anyone.

New version of big features a stricter unpack which doesn't ignore missing fields. Now we can adapt our code to this.
Note: our current code depends on undocumented big implementation details and is incompatible with the newest version of big (NumberFormatExceptions are no longer thrown from BedEntry.unpack).

Closed by PR #7.