remove_invalid not removing entries with start and stop negative
blaiseli opened this issue · 6 comments
I wanted to shift coordinates in a bed file via pybedtools.
Here is test_shift.py
:
from pybedtools import BedTool
def shift_bed_m5(bed):
"""
Shift bed interval coordinates by minus 5.
"""
bed.start -= 5
bed.stop -= 5
return bed
bed_tuples = [
("I", "0", "10"),
("I", "20", "25")]
bed_lines = ["\t".join(bed_tuple) for bed_tuple in bed_tuples]
bed_string = "\n".join(bed_lines)
a = BedTool(bed_string, from_string=True)
print("a")
print(a)
a_prime = a.remove_invalid()
print("a_prime")
print(a_prime)
b = a.each(shift_bed_m5)
print("b")
print(b)
c = b.remove_invalid()
print("c")
print(c)
And here is what happens when I run this script:
$ ./test_shift.py
a
I 0 10
I 20 25
a_prime
I 0 10
I 20 25
b
I -5 5
I 15 20
c
c
seems empty, but the second line seems valid, so it should not have been removed.
(I'm using version 0.8.2)
(It would be nice to have a check_invalid
method, giving diagnostics about invalid lines.)
OK, got it: the problem is due to the fact that printing b
exhausted some underlying generator.
The following works as intended:
c_prime = a.each(shift_bed_m5).remove_invalid()
print("c_prime")
print(c_prime)
I.e., generating the following output:
c_prime
I 15 20
Actually, I was initially trying to reproduce another problem when I fell in the above pitfall.
I'm re-opening to show my actual issue:
#!/usr/bin/env python3
from pybedtools import BedTool
def shift_bed(bed, shift):
"""
Shift *bed* interval coordinates by *shift*.
"""
bed.start += shift
bed.stop += shift
return bed
bed_tuples = [
("I", "0", "10", "A(I:0-10:+)", ".", "+"),
("I", "20", "25", "B(I:20-25:+)", ".", "+")]
bed_lines = ["\t".join(bed_tuple) for bed_tuple in bed_tuples]
bed_string = "\n".join(bed_lines)
for s in [-10, -11]:
print(s)
bt = BedTool(bed_string, from_string=True)
print("bt")
print(bt)
bt_shifted = bt.each(shift_bed, s).remove_invalid()
print("bt_shifted")
print(bt_shifted)
final = bt.each(shift_bed, s).remove_invalid().sort()
print("final")
print(final)
When executing the above:
-10
bt
I 0 10 A(I:0-10:+) . +
I 20 25 B(I:20-25:+) . +
bt_shifted
I 10 15 B(I:20-25:+) . +
final
I 10 15 B(I:20-25:+) . +
-11
bt
I 0 10 A(I:0-10:+) . +
I 20 25 B(I:20-25:+) . +
bt_shifted
I -11 -1 A(I:0-10:+) . +
I 9 14 B(I:20-25:+) . +
Traceback (most recent call last):
File "./test_shift.py", line 29, in <module>
final = bt.each(shift_bed, s).remove_invalid().sort()
File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 917, in decorated
result = method(self, *args, **kwargs)
File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 396, in wrapped
stream = call_bedtools(
File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/helpers.py", line 460, in call_bedtools
raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:
bedtools sort -i stdin
Error message was:
Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?
In the second case (shift by -11), the invalid line is not removed.
And if I add the following tuple to my bed_tuples
: ("I", "-2", "-1", "C(I:-2--1:+)", ".", "+")
, I get an error upon printing the resulting BedTool
:
Traceback (most recent call last):
File "./test_shift.py", line 24, in <module>
print(bt)
File "/home/bli/src/bioinfo_utils/.venv/lib/python3.8/site-packages/pybedtools/bedtool.py", line 1224, in __str__
for i in iter(self):
File "pybedtools/cbedtools.pyx", line 792, in pybedtools.cbedtools.IntervalIterator.__next__
File "pybedtools/cbedtools.pyx", line 698, in pybedtools.cbedtools.create_interval_from_list
pybedtools.cbedtools.MalformedBedLineError: Unable to detect format from ['I', '-2', '-1', 'C(I:-2--1:+)', '.', '+']
I tried a simple fix here: https://github.com/blaiseli/pybedtools/tree/issue-343
(Using if 0 <= feature.start <= feature.stop
instead of if feature.start <= feature.stop
in remove_invalid
)
However, it does not seem to have any effect. @daler any clue why ?
I think I have a clue: negative values are overflowed. If I insert printing instructions to show the values of start and stop I get values such as 4294967285
.
I suppose this has to do with this line in pybedtools/include/bedFile.h
:
typedef uint32_t CHRPOS;
This type is used for start and stop positions, and it may not be an option to change that, due to the interfacing with bedtools.
@blaiseli thanks for reporting this, and sorry for the lengthy lag time. The quick fix would be to insert a .saveas()
to make sure that .remove_invalid()
is working on a file-based BedTool object.
E.g., using your example above, change:
final = bt.each(shift_bed, s).remove_invalid().sort()
to
final = bt.each(shift_bed, s).saveas().remove_invalid().sort()
But edf4b33 should remove the need to do that in the future.