BED parsing is sometimes too relaxed
dievsky opened this issue · 8 comments
Today I learned that the line chr1\t0\t100
will happily parse as bed12
. The missing fields will just be filled with default values.
val bedFormat = BedFormat.from("bed12")
val content = "chr1\t0\t100"
withBedFile(content) { path ->
bedFormat.parse(path) { parser ->
val entries = parser.map { entry -> entry.unpack(bedFormat) }
assertEquals(1, entries.size)
assertEquals(0, parser.linesFailedToParse)
assertEquals(0, entries.first().score)
}
}
No exceptions thrown, no errors logged, just total peace and harmony. Looks wrong.
This also interferes with our current approach to loading BED track in JBR to some extent, and the underlying problem is more related to BedEntry.unpack
method being unusually forgiving. Because of that, I'm not sure where to place the issue -- big
(home of unpack
), bioinf-commons
(home of BedFormat
) or epigenome
(home of JBR).
The questions here are:
- is this really intended and expected behaviour, and if not,
- how can we fix it?
My suggestion would be to make unpack
throw an uneqivocal exception, then catch and process it upstream.
Your suggestion sounds reasonable.
Thanks! I'd like to also hear from @iromeo , as he wrote a test in big
that checks for the behaviour that I described above as invalid (unpackBed3as12
).
According to BedEntry.unpack()
sources in big
project the behavior is intended and that is why we have a test for it.
// If line is shorter than suggested fields number do not throw an error
// it could be ok for default behaviour, when user not sure how much fields
// do we actually have
big
library doesn't provide any bed format detection, so users should guess file format somehow. If you throw exception they won't be able to parse bed files if they didn't know actual fields number.
Today I learned that the line
chr1\t0\t100
will happily parse as bed12
What was the initial problem? Our new auto-format considers chr1\t0\t100
as bed12
? Or you want to have some double check and validate that format detection was right while parsing the file?
P.S.: I think we can change the behavior as we like.
The comment claims that the behaviour is intended, though it doesn't really explain why. If we don't throw an exception, the users will never know that they used a wrong format. There's literally no way to determine this.
The initial problem was the inconsistency. We got all fussy and threw around exceptions if the score column contained 32768
(not true anymore, but 1.0
will still throw), but somehow at the same time we just swallow the fact that 9 out of 12 fields might be completely missing.
And yes, we do want to check that the format detection was right. We're checking it in processBed
anyway, but currently we can only look for number format exceptions, since unpack
ignores missing fields and doesn't tell anyone.
New version of big
features a stricter unpack
which doesn't ignore missing fields. Now we can adapt our code to this.
Note: our current code depends on undocumented big
implementation details and is incompatible with the newest version of big
(NumberFormatException
s are no longer thrown from BedEntry.unpack
).