Suggestion for updating the closest function.

Question

Suggestion for updating the closest function.

Opened this issue 2 years ago · 6 comments

I suggest adding an option for distance-based filtering in the closest function. Currently, closest only allows selecting the number of closest intervals to report with the k parameter. It would be beneficial to include an option to filter intervals based on a maximum distance criterion. This enhancement would provide more flexibility and control in selecting intervals based on their proximity. I recommend considering the addition of this feature to improve the functionality of the closest function.

Answer 1 · 2023-06-07T08:50:22.000Z

Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?

Answer 2 · 2023-06-07T08:59:47.000Z

Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?
I am performing the operation in this way. However, I am unable to determine the appropriate value for k, so I can only choose a large k value, which is not very elegant and time-consuming.

Answer 3 · 2023-06-07T18:18:58.000Z

hi Wang, if you are interested in many features in df2 around df1, perhaps

bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))

is what you are looking for, rather than a bf.closest operation?

Answer 4 · 2023-06-08T00:03:53.000Z

hi Wang, if you are interested in many features in df2 around df1, perhaps
bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))
is what you are looking for, rather than a bf.closest operation?

I understand the functionality you mentioned, and in comparison, the "closest" feature aligns better with my requirements. I am using it to find the cis-mRNAs for lncRNAs, so I need to differentiate the upstream and downstream relationships within a certain distance and determine if there is any direct overlap. Currently, the "closest" functionality is able to meet my basic needs, and I also hope that you can consider my suggestion.

Answer 5 · 2023-06-08T00:55:13.000Z

Can you formulate the problem more precisely?

You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.

Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the overlap algorithm which is a type of ball query algorithm, rather than the closest algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?

Answer 6 · 2023-06-08T03:15:01.000Z

Can you formulate the problem more precisely?

You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.

Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the overlap algorithm which is a type of ball query algorithm, rather than the closest algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?

I think they are two different filtering dimensions. Currently, the "closest" functionality filters the nearest k ranges without considering the distance. It selects k ranges that are closest in proximity. What I want is to filter N ranges that are within a certain distance. I believe both of these filtering approaches are necessary in practical applications.
This is how I currently achieve my requirement using the "closest" function, which is actually quite convenient.

overlap = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_upstream=True,ignore_downstream=True).dropna()
upstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_downstream=True).dropna() downstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_upstream=True).dropna()
upstream = upstream[upstream['distance']<=max_distance]
downstream = downstream[downstream['distance']<=max_distance]