lareaulab/psix

junctions2psi taking extremely long time

Closed this issue · 10 comments

Hello,

I'm trying to get this running on our dataset which is a 10x STARSolo SJ run.
For some reason it's taking a long time (like 3 hours and doesn't finish)

psix_object = psix.Psix()
psix_object.junctions2psi(
        sj_dir='~/Solo.out/SJ/raw/',
        intron_file='~/sc_splicing_tools/psix_annotations.tab',
        save_files_in='~/psix_output/',
        tenX = True,
        solo = True
    )

Any ideas on how to speed this up?

Thanks,
Chang

Thanks for bringing this to my attention. This might be a memory issue due running to a very large dataset. Just loading a large .mtx file seems to take several minutes, using either scipy or scanpy. I built Psix with (smaller) smart-seq2 datasets in mind, so I am still testing some of the 10X functionalities.

In particular, STARsolo does not filter barcodes by SJ, which may result in an oversized matrix (alexdobin/STAR#1138). Maybe you could try making the matrix smaller, by filtering the barcodes by gene expression first?

Thanks, filtering solved that issue, however I now get a different error:

Processing STARsolo output. This might take a few minutes...                                                                                                                                                     Traceback (most recent call last):                                                                                                                                                                                 File "<stdin>", line 1, in <module>                                                                                                                                                                              File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/psix/psix.py", line 103, in junctions2psi
    min_observed = min_observed, tenX = tenX)                                                                                                                                                                      File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/psix/solo2psi.py", line 178, in solo_to_psi
    intron_mtx_CI, intron_mtx_exons = process_solo(solo_dir, intron_file, cell_list)                                                                                                                               File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/psix/solo2psi.py", line 80, in process_solo
    intron_mtx = intron_events.drop_duplicates().merge(matrix.drop_duplicates(), left_on='intron', right_index=True)[matrix.columns]
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 9199, in merge
    validate=validate,                                                                                                                                                                                             File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 120, in merge
    validate=validate,
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 700, in __init__
    ) = self._get_merge_keys()
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1117, in _get_merge_keys
    left_keys.append(left._get_label_or_level_values(k))
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 1779, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'intron

psix_annotation.tab.gz
I've attached my annotations above.

Nvm it was just the header that was missing.
It finally finishes but the output directory is empty...
psix_object.adata.uns['mrna_per_event']
yields a 12243 rows × 0 columns

I've never seen this happen before, but my guess is that none of the annotated exons has enough observations in your dataset. By default, Psix requires every exon to be observed in at least 25% of all the cells in the dataset (i.e., it will not consider an exon with more than 75% missing observations). In my experience, this is a reasonable requirement for a relatively homogeneous smart-seq2 dataset, such as the mouse midbrain development dataset that we analyzed on our manuscript (https://github.com/lareaulab/analysis_psix/blob/main/midbrain_development/midbrain_development.ipynb).

However, this could be too strict for 10X data, given its low capture efficiency (compared to smart-seq2). The problem could be exacerbated if you have a large and diverse dataset in which different cell groups have very different transcriptome profiles. E.g., let's say that your dataset is composed of 20% neurons, and you are interested in an exon that varies between neuron types, in a gene that is only expressed in neurons. Psix will likely ignore this exon, given that the gene is not expressed in the other 80% of the cells. In addition, please bear in mind that 10X data is already limited for analyzing cassette exons. You will only be able to analyze cassette exons that are close enough to the 3' end to have both their inclusion splice junctions covered by your reads.

So, my recommendation is that you lower the observation requirements for Psix, by lowering minPsi, and min_observed parameters of junctions2psi. Maybe you could try minPsi=0.01, and min_observed=0.05. In addition, if you have well defined and discrete groups of cells (e.g., all the different tissues in the Tabula Muris dataset), you could try to run Psix independently in each discrete group of cells, instead of running it in the entire dataset.

I hope this helps.

Ah thanks for the detailed explanation, you're right this dataset is very heterogeneous, multiple different cell types and lineages (~15-20 distinct clusters) with some rare cell types as well. I'll definitely subset the dataset. The dataset is sequenced to 60% saturation so I do pick up a decent amount of internal reads (not 3') per UMI duplicates.

Thanks,
Chang

subset.tar.gz
So I tried to subset on only 500 cells, which was a single cluster in my dataset and I still get an empty matrix. I've attached it above, I haven't filtered the rows (SJ) which I assume psix will do anyway.

I will take a look at your data after Labor Day, and see what's going on.

In the meantime, have you tried lowering the min_observed and minPsi parameters? This might help us figure out if it's an issue with the data, or if it's a bug somewhere.

I set both to 0.01.

Thanks,
Chang

Hello and thanks for your patience. I took a look at your data, and the problem seems to be that it is extremely sparse. The 498 cells have a median of 6 total splice junction reads, across all the 42,230 splice junctions in the matrix. Only 1 in 10,000 entries in the matrix has a value other than zero.

To consider an exon, Psix requires at a minimum one read covering each of its three splicing junctions (two for inclusion, one for exclusion), on any cell (it can be different cells). Due to the sparsity of the data, none of the cassette exons pass this minimum requirement in this dataset. In other words, for all cassette exons in the annotation, at least one of the three splicing junctions has zero reads covering it in all cells.

Psix will require more data to work. Unfortunately, I think it will be very difficult to analyze alternative splicing changes in this dataset. Although it might be possible that it would work in a different subset of cells with better splice junction read coverage.

Please let me know if you have any more questions or concerns. Otherwise, I will close this issue.

I see, I did notice my matrix changing significantly when I turned on two pass mapping, I'll probably turn it off since it aligns more reads to novel junctions as a result might be filtered when intersecting the known junctions. I Thanks for getting to the bottom of this.

Best,
Chang