open2c/bioframe

Overlapping intervals in single bed

Closed this issue · 2 comments

Hi,
I've got single bed file (made from merging lot of bed files) like this:

chr1	13332	13701	sample1	0	+	13332	13701	255,128,128
chr1	13338	13695	sample2	0	+	13338	13695	255,128,128
chr1	13330	13710	sample3	0	+	13330	13710	128,179,255
chr1	13320	13690	sample4	0	+	13320	13690	128,179,255

My goal is to find overlapping region with 90% for both regions and merge them into single one to finally get the output like this:
chr1 13320 13710 merged_4_samples 0 + 13320 13710 255,128,128 sample1, sample2, sample3, sample4
Or something like this: (that is finding overlapping intervals and adding a column with their id. After I can merge rows to get the widest range)

chr1	13332	13701	sample1	0	+	13332	13701	255,128,128   sample1, sample2, sample3, sample4  
chr1	13338	13695	sample2	0	+	13338	13695	255,128,128   sample1, sample2, sample3, sample4
chr1	13330	13710	sample3	0	+	13330	13710	128,179,255   sample1, sample2, sample3, sample4
chr1	13320	13690	sample4	0	+	13320	13690	128,179,255   sample1, sample2, sample3, sample4

I tried coverage function, but it needs 2 inputs.
df = bf.coverage(df1, df2)
df = df[ ( df["coverage"] / (df["end"]-df["start"]) ) >=0.50]

Could you please tell me if such a thing is even possible? Do there always have to be 2 inputs?

Thanks in advance for your help.
Best,
Anna

Phlya commented

If I understood you correctly, this could be the starting point for you: https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.cluster

resolving for now, but feel free to re-open if you have additional questions!