How can peaks associated with a given gene be extracted?
Closed this issue · 5 comments
Hi,
I want to extract the peaks associated with a given gene based on your RP model. How can I do that?
Many thanks for your help!
This is a feature we are currently working on. I will provide some code for you in the interim later today
I will really appreciate that, thanks so much!!
Okay here's the code:
"get_normalized_params" gets the parameters of the LITE model, so you can get the upstream decay rate of influence of nearby cis-sites:
litemodel_params = litemodel.get_model('LEF1').get_normalized_params()
Use the "distance" key to get the decay distances (in KB, so multiply by 1000 to get bp). Upstream decay is [0], downstream is [1]. Note, upstream and downstream is relative to the gene's strand:
upstream_range = litemodel_params['distance'][0] * 1000
Next we find the idx the gene we wish to analyze to lookup TSS-peak distances:
gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]
The distance between every peak and every gene TSS is stored in adata.varm['distance_to_TSS']
, which is a sparse matrix of (n_peaks x n_genes). We index on our gene of interest to get a vector of all the distances between peaks and that gene's TSS:
relevant_peaks = atac_data.varm['distance_to_TSS'].T[gene_idx] #
Peaks which are upstream have distance > 0, downstream are < 0. This line finds all peaks which are within 3*decay rate upstream of the TSS:
upstream_peaks = (relevant_peaks.data < 3*upstream_range) & (relevant_peaks.data > 0)
Then, we use the mask from the step above to index on the atac_data matrix to get the peak locations:
atac_data.var.iloc[ relevant_peaks.indices[upstream_peaks] ]
All together:
litemodel_params = litemodel.get_model('LEF1').get_normalized_params()
upstream_range = litemodel_params['distance'][0] * 1000
gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]
relevant_peaks = atac_data.varm['distance_to_TSS'].T[gene_idx]
upstream_peaks = (relevant_peaks.data < 3*upstream_range) & (relevant_peaks.data > 0)
atac_data.var.iloc[
relevant_peaks.indices[upstream_peaks]
]
thanks very much for this.
I get an error when I try to find index for the given gene in the atac_data.uns["distance_to_TSS_genes"]
gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]
IndexError: index 0 is out of bounds for axis 0 with size 0
This gene is present in the gene list in both rna and atac TSS:
if "LEF1" in atac_data.uns['distance_to_TSS_genes']:
print("gene is present")
You may be using different names for you genes? Could LEF1 be lowercased in your TSS annotations?