Running past the end of a BED
Opened this issue · 1 comments
rulixxx commented
Expected Behavior
Terminate correctly when iterating over a BED file for intersecting intervals.
Current Behavior
Error caused by trying to read pass the end of the stream.
Possible Solution / Implementation
This worked for me:
Added an extra condition in the loop of function Indexes.done
function Indexes.done(iter::Indexes.TabixOverlapIterator, state)
buffer = BioGenerics.IO.stream(iter.reader)
source = buffer.stream
if state.chunkid == 0
if isempty(state.chunks)
return true
end
state.chunkid += 1
seek(source, state.chunks[state.chunkid].start)
end
while state.chunkid ≤ lastindex(state.chunks)
chunk = state.chunks[state.chunkid]
# The `virtualoffset(source)` is not synchronized with the current reading position because data are buffered in `buffer` for parsing text.
# So we need to check not only `virtualoffset` but also `nb_available`, which returns the current buffered data size.
while !eof(iter.reader.state.stream) && (bytesavailable(buffer) > 0 || BGZFStreams.virtualoffset(source) < chunk.stop)
read!(iter.reader, state.record)
c = Indexes.icmp(state.record, iter.interval)
if c == 0 # overlapping
return false
elseif c > 0
# no more overlapping records in this chunk
break
end
end
state.chunkid += 1
if state.chunkid ≤ lastindex(state.chunks)
seek(source, state.chunks[state.chunkid].start)
end
end
# no more overlapping records
return true
end
Steps to Reproduce (for bugs)
Sorry I encountered this sometime ago so I no longer have the BED files. Might have been brought about when working with concatenated bgzipped files.
CiaranOMara commented
Thanks for this report.
If it were bgzipped files, there is a known issue that affects the calculation of the virtual offset. The issue occurs when multiple threads are in use.
I think this issue will be addressed upstream with BioJulia/BGZFStreams.jl#27.