WeiqiangZhou/BIRD

BIRD predictions are very similar from different cell types

Closed this issue · 8 comments

It's a bit weird to me that some of my BIRD output predictions, based on the pre-trained models posted online, return highly similar tracks. Is there a reason for this? The correlation matrix between the samples show differences albeit the correlations being high. PCA indicates a similar trend. These samples are various neuronal cell types spanning from early astrocytes to mature neurons.

> cor(PSEUDOBULK_FPKM_RAW, method = "spearman")
          c1        c2        c3        c4        c5        c6
c1 1.0000000 0.8979631 0.9135389 0.9365468 0.9433184 0.8542504
c2 0.8979631 1.0000000 0.8672490 0.9216230 0.9119155 0.8839981
c3 0.9135389 0.8672490 1.0000000 0.8970959 0.9044194 0.8440176
c4 0.9365468 0.9216230 0.8970959 1.0000000 0.9169760 0.8594397
c5 0.9433184 0.9119155 0.9044194 0.9169760 1.0000000 0.8920262
c6 0.8542504 0.8839981 0.8440176 0.8594397 0.8920262 1.0000000

Output track visualization for chr4:54,906,737-58,906,895 here.

BIRD input and output (wiggles), as well as the above FPKM data.frame (rds) are linked here, if interested.

It seems that your .matched.txt files are mostly zeros. There may be something wrong with the gene id matching.

I checked your data and the gene ids in the data look weird. For "c1_rnaseq-RAW.txt", there are only 174 of them matched with the gene ids in the training dataset.

I have checked some of the gene ids in your files. They don't seem to exist. For example:
https://useast.ensembl.org/Multi/Search/Results?q=ENSG00000030111
https://useast.ensembl.org/Multi/Search/Results?q=ENSG00000022347
https://useast.ensembl.org/Multi/Search/Results?q=ENSG00000052595

I think that's reason why the prediction result is weird. You may want to check if there is something wrong with the gene ids in your input.

After you fix the gene id issue, another suggestion is to build a brain-specific model. There are not many brain or neuron related samples in the training dataset of the 167 samples model (the full list of samples can be found in the RNA-seq data here). If your major goal is to study chromatin accessibility for the neuronal cell types, using a brain-specific model will help. Some matched ATAC-seq and RNA-seq data for brain tissues should work for building such a model.

I'm using Biomart to convert between refseq and ensemble gene IDs. Hmm I'll debug my code:

library(biomaRt)
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
mapping <- getBM(filters="hgnc_symbol", attributes=c("hgnc_symbol", "ensembl_gene_id"), values=row.names(PSEUDOBULK), mart=mart)
mapping$hgnc_symbol <- toupper(mapping$hgnc_symbol)
mapping <- mapping[!duplicated(mapping[,c('hgnc_symbol')]),]
X <- merge(mapping, PSEUDOBULK, by.x = "hgnc_symbol", by.y = 0, all.X = TRUE)

sapply(colnames(PSEUDOBULK), function(x){
  tmp <- cbind.data.frame(gene_id = X[, "ensembl_gene_id"], X[, x, drop = FALSE])
  write.table(tmp, file.path("output", "organoid-hg19", "bird-prediction", "PSEUDOBULK-FPKM-INPUT", sprintf("%s_rnaseq-RAW.txt", x)),
              sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
})

EDIT: updated to correct gene mapping code

Is your input data from mice? The current pre-built models in BIRD are for human. I saw that you change the mouse ensembl id to human by just substituting the first four characters. I don't think that will work.

Ugh dumb move on my part but great catch! These data are actually human so now I corrected the refseq <-> ensemble mapping and get many matched genes:

c1_rnaseq-RAW.txt.match.txt
   14753
c2_rnaseq-RAW.txt.match.txt
   13434
c3_rnaseq-RAW.txt.match.txt
   12389
c4_rnaseq-RAW.txt.match.txt
   12845
c5_rnaseq-RAW.txt.match.txt
   14295
c6_rnaseq-RAW.txt.match.txt
   14713

However, the signal tracks are still very similar. I suppose the issue lies in these cell types, or similar, not existing in the reference atlas. I'm trying to run the supplied distance function but am having issues reading in the expr data (RNA_data_167_cells.rds):

> Y <- readRDS("~/path/to/pseudobulk-fpkm.rds")
> X <- readRDS("~/path/to/BIRD-master/data/RNA_data_167_cells.rds")
sh: pxz: command not found
 Show Traceback
 
 Rerun with Debug
 Error in system("pxz -V", intern = TRUE) : error in running command 

EDIT: including the updated input/output BIRD data with correct gene mapping here.

EDIT2: Reading RDS solved, need base::readRDS and not my wrapper.

Reopening the issue for now.

It seems to be an issue with the xz compression. Try installing pxz in your system.

Since these are different cell types from the same tissue, it is possible that they will be similar in a global scale. I would suggest you look at the local difference and see if the differential loci are related to the different functions of these cell types.

It is also possible that the pre-built model (167 samples model) didn't capture the difference between these cell types due to the lack of brain tissues in the training data.

The output of the distance function returns:

> train_test_dist
[1] 0.6677651 0.6700545 0.6709436 0.6765723 0.6690064 0.6752177

It seems that according to Fig S13 of the 2019 paper that these values show a steep drop-off in predictive accuracy. So I suppose the reference atlas is not resembling these neuronal subtypes.

Yes, I don't think there are many neuronal related samples in the 167 training dataset. You may want to build a brain-specific model instead.