BioJulia/Indexes.jl

Running past the end of a BED

Opened this issue · 1 comments

Expected Behavior

Terminate correctly when iterating over a BED file for intersecting intervals.

Current Behavior

Error caused by trying to read pass the end of the stream.

Possible Solution / Implementation

This worked for me:

Added an extra condition in the loop of function Indexes.done

function Indexes.done(iter::Indexes.TabixOverlapIterator, state)
    buffer = BioGenerics.IO.stream(iter.reader)
    source = buffer.stream
    if state.chunkid == 0
        if isempty(state.chunks)
            return true
        end
        state.chunkid += 1
        seek(source, state.chunks[state.chunkid].start)
    end
    while state.chunkid ≤ lastindex(state.chunks)
        chunk = state.chunks[state.chunkid]
        # The `virtualoffset(source)` is not synchronized with the current reading position because data are buffered in `buffer` for parsing text.
        # So we need to check not only `virtualoffset` but also `nb_available`, which returns the current buffered data size.
        while !eof(iter.reader.state.stream) && (bytesavailable(buffer) > 0 || BGZFStreams.virtualoffset(source) < chunk.stop)
            read!(iter.reader, state.record)
            c = Indexes.icmp(state.record, iter.interval)
            if c == 0  # overlapping
                return false
            elseif c > 0
                # no more overlapping records in this chunk
                break
            end
        end
        state.chunkid += 1
        if state.chunkid ≤ lastindex(state.chunks)
            seek(source, state.chunks[state.chunkid].start)
        end
    end
    # no more overlapping records
    return true
end

Steps to Reproduce (for bugs)

Sorry I encountered this sometime ago so I no longer have the BED files. Might have been brought about when working with concatenated bgzipped files.

Thanks for this report.

If it were bgzipped files, there is a known issue that affects the calculation of the virtual offset. The issue occurs when multiple threads are in use.

I think this issue will be addressed upstream with BioJulia/BGZFStreams.jl#27.