/polya_liftover

A Snakemake Workflow for using PolyA_DB and UCSC LiftOver with CellRanger

Primary LanguagePythonMIT LicenseMIT

polya_liftover - sc/snRNAseq Snakemake Workflow

MIT License Status: Active CI/CD Codestyle: Black Codestyle: snakefmt

A Snakemake workflow for using PolyA_DB and UCSC Liftover with Cellranger.

Some genes are not accurately annotated in the reference genome. Here, we use information provide by the PolyA_DB v3.2 to update the coordinates, then the USCS Liftover tool to update to a more recent genome. Next, we use Cellranger to create the reference and count matrix. Finally, by taking advantage of the integrated Conda and Singularity support, we can run the whole thing in an isolated environment.

Notes on Installation

Our pipeline is available on Github (see below!), on the Snakemake Workflow Catalogue, and on WorkflowHub.

A full walktrhough on how to install and use this pipeline can be found here.

To take advantage of Singularity, you'll need to install it separately. If you are running on a Linux system, then Singularity can be installed from conda like so:

conda install -n snakemake -c conda-forge singularity

It's a bit more challenging for other operating systems. Your best bet is to follow their instructions here. But don't worry! Singularity is not required! Snakemake will still run each step in its own Conda environment, it just won't put each Conda environment in a container.

Get the Source Code

Navigate to our release page on github and download the most recent version.

Alternatively, for the bleeding edge, please clone the repo like so:

git clone https://github.com/IMS-Bio2Core-Facility/polya_liftover

⚠️ Heads Up! The bleeding edge may not be stable, as it contains all active development.

Notes on Data

This pipeline expects de-multiplexed fastq.gz files, normally produced by some deriviative of bcl2fastq after sequencing. They can (technically) be placed anywhere, but we recommend creating a data directory in your project for them.

Notes on the tools

The analysis pipeline was run using Snakemake v6.11.1. The full version and software lists can be found under the relevant yaml files in workflow/envs. The all reasonable efforts have been made to ensure that the repository adheres to the best practices outlined here.

Notes on the analysis

For a full discussion on the analysis methods, please see the technical documentation.

Briefly, gene coordinates were updated with PolyA_DB version 3, converted to more recent builds with Liftover, and referenced/counted with Cellranger.

On Reproducibility

Reproducible results are the cornerstone of the scientific process. By running the pipeline with snakemake in a singularity/docker image using conda environments, we can pin all software versions, maximising reproducibility.

We also strive to make this pipeline as FAIR/O compliant as possible. In addition to the usual availability on Github, it is available at both the Snakemake Workflow Catalogue and WorkflowHub.

Future work

  • Improve species and build handling. See #2
  • Directly download and grep the PolyA_DB data. This will aloow users to specify genes only. #3