Suggestion for updating the closest function.
Opened this issue · 6 comments
I suggest adding an option for distance-based filtering in the closest
function. Currently, closest
only allows selecting the number of closest intervals to report with the k
parameter. It would be beneficial to include an option to filter intervals based on a maximum distance criterion. This enhancement would provide more flexibility and control in selecting intervals based on their proximity. I recommend considering the addition of this feature to improve the functionality of the closest
function.
Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?
Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?
I am performing the operation in this way. However, I am unable to determine the appropriate value for k, so I can only choose a large k value, which is not very elegant and time-consuming.
hi Wang, if you are interested in many features in df2 around df1, perhaps
bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))
is what you are looking for, rather than a bf.closest operation?
hi Wang, if you are interested in many features in df2 around df1, perhaps
bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))
is what you are looking for, rather than a bf.closest operation?
I understand the functionality you mentioned, and in comparison, the "closest" feature aligns better with my requirements. I am using it to find the cis-mRNAs for lncRNAs, so I need to differentiate the upstream and downstream relationships within a certain distance and determine if there is any direct overlap. Currently, the "closest" functionality is able to meet my basic needs, and I also hope that you can consider my suggestion.
Can you formulate the problem more precisely?
You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.
Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the overlap
algorithm which is a type of ball query algorithm, rather than the closest
algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?
Can you formulate the problem more precisely?
You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.
Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the
overlap
algorithm which is a type of ball query algorithm, rather than theclosest
algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?
I think they are two different filtering dimensions. Currently, the "closest" functionality filters the nearest k ranges without considering the distance. It selects k ranges that are closest in proximity. What I want is to filter N ranges that are within a certain distance. I believe both of these filtering approaches are necessary in practical applications.
This is how I currently achieve my requirement using the "closest" function, which is actually quite convenient.
overlap = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_upstream=True,ignore_downstream=True).dropna()
upstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_downstream=True).dropna() downstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_upstream=True).dropna()
upstream = upstream[upstream['distance']<=max_distance]
downstream = downstream[downstream['distance']<=max_distance]