buenrostrolab/FigR

Possible bug in runFigRGRN

Closed this issue · 1 comments

Hi Vinay,

Firstly, thank you so much for developing FigR. I've been trying to do this type of analysis using bulk RNA+ATAC data and your implementation has made it so much easier to run (I believe the methodology holds for bulk data, skipping the data smoothing steps of course, but keen to hear if you disagree). Plus having the parallelisation handled properly has made a huge difference!

I have a question and possibly a bug report.

Bug: in

FigR/R/FigR.R

Line 144 in 35299ea

DORCNNpeaks <- unique(dorcTab$Peak[dorcTab$Gene %in% c(g,dorcGenes[DORC.knn[g,]])])

Should the line be: DORCNNpeaks <- unique(dorcTab$Peak[dorcTab$Gene %in% c(g,rownames(dorcMat)[DORC.knn[g,]])])?
If DORCgenes is provided by the user, these don't have to be in the same order as dorcMat, leading to selecting the wrong indices. It would be great if you could double-check this in case I'm misunderstanding the code.

Question: chromVAR samples with replacement when selecting background peaks. I was exploring why some very low correlations are deemed significant and found that in some cases, when chromVAR cannot find enough peaks that match the accessibility/GC content profile, peaks are repeated in the background set. In extreme cases, this can lead to a single peak being used as the background. For these instances, the background is not meaningful and it leads to misleading p-values.
In runGenePeakcorr I chose to add bg <- t(apply(bg, 1, function(x) replace(x, duplicated(x), NA))) after

FigR/R/DORCs.R

Line 264 in 35299ea

bg <- chromVAR::getBackgroundPeaks(ATAC.se,niterations=n_bg)
to remove duplicated peaks from the background.
Any thoughts on this? Another option would be to only test if the number of distinct peaks in the background is above a certain threshold.

Again, thank you for your work on this, and looking forward to your feedback.

Hi there! Thank you for your interest in applying this package to your own work, and for your feedback!

For your first question: You are totally correct that if the dorcGenes param is specified, dorcMat not being subsetted to match will create issues with the indexing into dorcGenes (either index mismatch or even index out of range, depending on the length of the dorcGenes specified and the number of dorc kNNs). The peak indices being queried are all relative to what is in dorcTab, where the goal here is to fetch peaks associated with a given DORC gene and it's DORC gene k-nearest neighbors. In our example, if dorcGenes is left as NULL (default), it is basically set to rownames(dorcMat) or all genes in the matrix, in the same order - so your code would be identical, and work fine. The assumption here was that if following our workflow, dorcMat should only be calculated for dorcGenes, and not ALL genes (recommended). I will update this for now to avoid potential errors, and I will add a separate subsetting step (not in this line of code) to allow for testing only specific genes for the TF-DORC associations to reduce compuitations, which was the sole purpose of adding the dorcGenes parameter

For your second question: Yes correct - chromVAR's background peak determination will indeed pick the same peak many times if there are not many other representative peaks that match it for GC content and overall accessibility across cells / samples. So, my two cents here are that those peaks typically aren't very representative of the dataset (e.g. outlier peaks), and should be weighed less either way. Adding a warning or flag if the background peak representation is off is certainly a good idea, and I can definitely add that in. The issue with simply removing duplicated peaks to deal with this issue, is that now you no longer are using the same true n number of background peak calculations (since the level of re-sampling can vary), which in theory, should be identical across all runs since we are determining a permutation-based p-value. I haven't tested this personally, but it is the subject of current work for future optimizations and we will definitely keep this in mind.