snehamitra/SCARlink

Comparing peak-to-gene linkages across cell types?

Opened this issue · 3 comments

Hi authors,

I am wondering what metric could be used to compare the same peak-to-gene linkage across different cell types? According to the tutorial, "the Shapley values converted to z-score are an estimate of predicted gene-tile linkage". On the other hand, p-values and FDRs are "more appropriate for ordering the gene-linked tiles across all genes and celltypes".

I have several cases where the same linkage can pass FDR<1e-3 and z-score > 0.05 threshold in two cell types. In one example, there's a known significant differential peak higher in celltype1 vs celltype2. In celltype1, z.score = 4.6 and FDR = 4e-5, in celltype2, z.score = 0.07 and FDR = 2.5e-9. On visualization plot, the blue dot is much darker in condition 1. However the FDR is more significant in celltype2.

I just wonder (1) which metric should be used for comparison here, and (2) if z-score can be used, can I use the difference between z-score to indicate the difference between linkage strength in celltype1 and 2?

Thank you!

The FDR significance is largely driven by how accessible the tile is. We found FDR to be useful when ranking gene-linked tiles across different genes.

But the z-scores are useful for comparing gene-linked tiles for the same gene across cell types. Based on the values shared by you, I would assume the tile is accessible in both celltype1 and celltype2 but it is predicted to be a strongly linked tile in celltype1 with high z-score. It would make sense to use the difference between z-scores to quantify the linkage strength.

Thank you for the response! I have a follow-up question: when I compare gene-linked tiles for the same gene across cell types/conditions, I observed that some regions could have large z-scores but FDR = 1 (please see below, cond1 and cond2 are two conditions I am comparing). It indicates that this site is not necessary when predicting target gene expression (removing this site won't change the predicted expression level), but it's still a strong peak linked to the gene. The sites with FDR=1 indeed have lower ATAC signal, but they are still obvious peak regions.

My question is (1) how should I interpret the discrepancy between the FDR and the z-score, and (2) if my goal is comparing two conditions and see if a site/peak has a stronger link to one gene in any condition, is the FDR filtering still necessary?

Screenshot 2024-08-16 at 12 46 05 PM

Since the FDR is sensitive to the accessibility of the tile, it can be 1 if the overall accessibility in the tile is too sparse. In such cases, even if the z-score is high, it could be a potential false positive.

For instance, in the following example data set, the tiles with FDR=1 are more sparse compared to tiles with FDR < 0.05.

>>> df[(df['FDR'] == 1)]['test_acc_sparsity'].mean()
0.0035214338560441245
>>> df[(df['z-score'] > 1) & (df['FDR'] == 1)]['test_acc_sparsity'].mean()
0.006309513875421772
>>> df[(df['z-score'] > 1) & (df['FDR'] < 0.05)]['test_acc_sparsity'].mean()
0.08522307286150335

You can use the z-scores to compare the strength of the gene-linked tiles but filtering on FDR would allow you to consider tiles that are not too sparse.