daler/pybedtools

remove_invalid not removing entries with start and stop negative

blaiseli opened this issue · 6 comments

I wanted to shift coordinates in a bed file via pybedtools.

Here is test_shift.py:

from pybedtools import BedTool

def shift_bed_m5(bed):
    """
    Shift bed interval coordinates by minus 5.
    """
    bed.start -= 5
    bed.stop -= 5
    return bed

bed_tuples = [
    ("I", "0", "10"),
    ("I", "20", "25")]
bed_lines = ["\t".join(bed_tuple) for bed_tuple in bed_tuples]
bed_string = "\n".join(bed_lines)

a = BedTool(bed_string, from_string=True)
print("a")
print(a)

a_prime = a.remove_invalid()
print("a_prime")
print(a_prime)

b = a.each(shift_bed_m5)
print("b")
print(b)

c = b.remove_invalid()
print("c")
print(c)

And here is what happens when I run this script:

$ ./test_shift.py
a
I       0       10
I       20      25

a_prime
I       0       10
I       20      25

b
I       -5      5
I       15      20

c

c seems empty, but the second line seems valid, so it should not have been removed.

(I'm using version 0.8.2)

(It would be nice to have a check_invalid method, giving diagnostics about invalid lines.)

OK, got it: the problem is due to the fact that printing b exhausted some underlying generator.

The following works as intended:

c_prime = a.each(shift_bed_m5).remove_invalid()
print("c_prime")
print(c_prime)

I.e., generating the following output:

c_prime
I       15      20

Actually, I was initially trying to reproduce another problem when I fell in the above pitfall.

I'm re-opening to show my actual issue:

#!/usr/bin/env python3

from pybedtools import BedTool

def shift_bed(bed, shift):
    """
    Shift *bed* interval coordinates by *shift*.
    """
    bed.start += shift
    bed.stop += shift
    return bed

bed_tuples = [
    ("I", "0", "10", "A(I:0-10:+)", ".", "+"),
    ("I", "20", "25", "B(I:20-25:+)", ".", "+")]
bed_lines = ["\t".join(bed_tuple) for bed_tuple in bed_tuples]
bed_string = "\n".join(bed_lines)

for s in [-10, -11]:
    print(s)
    bt = BedTool(bed_string, from_string=True)
    print("bt")
    print(bt)

    bt_shifted = bt.each(shift_bed, s).remove_invalid()
    print("bt_shifted")
    print(bt_shifted)

    final = bt.each(shift_bed, s).remove_invalid().sort()
    print("final")
    print(final)

When executing the above:

-10
bt
I       0       10      A(I:0-10:+)     .       +
I       20      25      B(I:20-25:+)    .       +

bt_shifted
I       10      15      B(I:20-25:+)    .       +

final
I       10      15      B(I:20-25:+)    .       +

-11
bt
I       0       10      A(I:0-10:+)     .       +
I       20      25      B(I:20-25:+)    .       +

bt_shifted
I       -11     -1      A(I:0-10:+)     .       +
I       9       14      B(I:20-25:+)    .       +

Traceback (most recent call last):
  File "./test_shift.py", line 29, in <module>
    final = bt.each(shift_bed, s).remove_invalid().sort()
  File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 917, in decorated
    result = method(self, *args, **kwargs)
  File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 396, in wrapped
    stream = call_bedtools(
  File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/helpers.py", line 460, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError: 
Command was:

        bedtools sort -i stdin

Error message was:
Unexpected file format.  Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?

In the second case (shift by -11), the invalid line is not removed.

And if I add the following tuple to my bed_tuples: ("I", "-2", "-1", "C(I:-2--1:+)", ".", "+"), I get an error upon printing the resulting BedTool:

Traceback (most recent call last):
  File "./test_shift.py", line 24, in <module>
    print(bt)
  File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 1224, in __str__
    for i in iter(self):
  File "pybedtools/cbedtools.pyx", line 792, in pybedtools.cbedtools.IntervalIterator.__next__
  File "pybedtools/cbedtools.pyx", line 698, in pybedtools.cbedtools.create_interval_from_list
pybedtools.cbedtools.MalformedBedLineError: Unable to detect format from ['I', '-2', '-1', 'C(I:-2--1:+)', '.', '+']

I tried a simple fix here: https://github.com/blaiseli/pybedtools/tree/issue-343

(Using if 0 <= feature.start <= feature.stop instead of if feature.start <= feature.stop in remove_invalid)

However, it does not seem to have any effect. @daler any clue why ?

I think I have a clue: negative values are overflowed. If I insert printing instructions to show the values of start and stop I get values such as 4294967285.

I suppose this has to do with this line in pybedtools/include/bedFile.h:

typedef uint32_t CHRPOS;

This type is used for start and stop positions, and it may not be an option to change that, due to the interfacing with bedtools.

daler commented

@blaiseli thanks for reporting this, and sorry for the lengthy lag time. The quick fix would be to insert a .saveas() to make sure that .remove_invalid() is working on a file-based BedTool object.

E.g., using your example above, change:

final = bt.each(shift_bed, s).remove_invalid().sort()

to

final = bt.each(shift_bed, s).saveas().remove_invalid().sort()

But edf4b33 should remove the need to do that in the future.