epi2me-labs/wf-single-cell

Cellranger style output count matrices - sparse, raw, full gene space

Closed this issue · 6 comments

Is your feature related to a problem?

This request addresses a number of separate issues:

  • the count matrices are very sparse, and a lot of drive space is used up storing zeroes in the TSVs
  • the full cell barcode space matrix is often desirable downstream. This could be because cell calling was already performed on the data in some fashion beforehand (and now some cells go missing), or for certain analyses such as soup removal
  • having a consistent, unfiltered gene space allows for easy cross-sample integration, as well as occasional sanity checking as to what reference was used when processing the data (e.g. 36601 lines in genes.tsv tends to imply 10X's GRCh38-2020-A)

Describe the solution you'd like

Offer Cellranger style count matrix output, with a raw (full whitelist cell barcode space) and filtered (your counts matrix) present, and a full gene space. Formatted either in Cellranger v2 or v3 MTX folder style - will work out of the box with downstream single cell software in Python and R.

Describe alternatives you've considered

No reliable alternative comes to mind if a raw matrix is of interest.

Additional context

No response

Hi @ktpolanski Thanks for your suggestion. We'll consider adding these suggestions to a future release.

Seeing how there will probably be a release not long from now to patch the various issues, a decent hotfix for part of the situation would be to accept a custom cell barcode list to use. This will allow retaining more consistency with possible alternate sequencing/mapping/analyses.

I still think that the suggestion brought forward here is correct, but this seems like a relatively low effort compromise for the time being.

I agree with @ktpolanski -- so many downstream tools expect sparse matrix market format files (like those output by cellranger/starSOLO) that that seems like the ideal output. Similarly, some tools (e.g., cellbender) require the full count matrix, so it seems like that should be included in the output (as a sparse matrix).

I can put a pull request together if that's helpful.

MTX format files have been implemented as part of the next release.

Awesome, thanks! Does this also include the raw count matrix?

There's an MTX folder for both the raw counts and the processed/filtered/normalized/log-transformed counts.

(It amused me whilst testing the code that reading and writing compressed binary data of the dense matrix was much faster than ASCII COO parsers).