Support grouping over splits in `rgdr`
BSchilperoort opened this issue · 1 comments
BSchilperoort commented
Due to computational limits (applying DBSCAN for every individual train/test split might not be viable), we want to allow users to be able to 'grouping' splits in RGDR before calculating the DBSCAN clusters.
To do this we need to go through the following steps:
- Calculate the correlation coefficient and p-value for every fold (see #57 )
- Determine the p-value mask for every individual split (training data only)
- Reduce this mask over the split dimension with
np.any
- Apply DBSCAN to the reduced mask
- Recombine the DBSCAN clusters with each split's mask. (e.g. for each split's cluster labels:
cluster_labels[~split_mask] = 0.0
)
This way we end up with clusters for each split, with aligned split labels.
geek-yang commented
Based on the discussion in issue #71, we will only provide iterator for the user to walk through all the splits. They have the flexibility to perform RGDR (or even complete ML workflow). We can further discuss whether we need a function to do "grouping over splits". But at least we can provide a notebook to show this as a usecase.