Ellison Lab Drosophila Testes TE project

Requirements

Software for data processing

For processing of raw data, the pipeline requires

Cellranger v3.1.0
snakemake v6.1.0
mamba v0.9.1
peppy v0.31.0
biopython 1.78

Each of these may be installed in a 3-5 minutes with a standard internet connection.

Most other dependencies for processing of raw data are automatically handled by snakemake/mamba during a run when called with the --use-conda flag (see below for running instructions).

For this reason, conda channels must be configured according to instructions available at the bioconda channel page.

Software for figures

To generate figures, R packages must be installed manually. Altogether these packages may be installed in approximately 10 minutes via install.packages() for CRAN hosted packages or BiocManager::install() for Bioconductor packages.

base R v4.0.3
tidyverse v1.3.0
ragg v1.1.2
ggtext v0.1.1
arrow v3.0.0
jsonlite v1.7.2
readxl v1.3.1
rtracklayer v1.50.0
GenomicRanges v1.42.0
patchwork v1.1.1
png v0.1-7
grid v4.0.3
magick v2.7.1
extrafont v0.17
ggforce v0.3.3
ggrepel v0.9.1
GenomicFeatures v1.42.3
ggpubr v0.4.0
ggpointdensity v0.1.0
ComplexHeatmap v2.6.2
broom v0.7.5
VariantAnnotation v1.36.0

System Requirements

The raw data processing pipeline has been tested on CentOS Linux 7. Alignment and quantification steps may be resource intensive. Grid search steps are numerous. Therefore it is recommended to run on an HPC system to make total running time reasonable. We ran on a slurm cluster. Please see the snakemake documentation on system profiles to set up snakemake to run on other cluster submission systems.

Installation

The primary workflow can be cloned from github via

git clone https://github.com/Ellison-Lab/TestisTEs2021

Some large files in the resources/ directory can be pulled with the command below. You may need to install git lfs. See https://git-lfs.github.com/ or https://anaconda.org/conda-forge/git-lfs for details.

git lfs pull

The primary workflow orchestrates subworkflows and figure scripts. The project relies on multiple git submodules. These can be pulled into the main workflow via

git submodule update --init
git submodule foreach git pull origin main

Finally, absolute local paths in the config.yaml file and subsamples_table.csv's should be updated as appropriate.

Alternatively, these submodules can be separately cloned and run separately.

Altogether, these repositories can be downloaded in 2-3 minutes.

Demo/Instructions

This workflow is not standalone tool and is therefore not suited to demo'ing with a small dataset. However the full workflow can be run within 1-3 days on a HPC cluster with modest resource requests.

A dry run can be initiated via the code below. Exclude -n for actual run to produce arrow format files under results/finalized. These hold processed data in a space efficient file format and are easily importable to R as dataframes via the arrow package)

Please note that specifying the python hashseed to 0 specifically is important for scanpy reproducibility, per this issue report.

PYTHONHASHSEED=0 snakemake --use-conda --profile {your system} --cores {n} --config MAIN_CONFIG=config/config.yaml -kp -n

Figures can be run via the below command, where {figure name} follows the pattern "figure1" or "supp1." This will generate PDFs and .rds files under results/panels.

PYTHONHASHSEED=0 snakemake {figure name} --use-conda --profile {your system} --cores {n} --config MAIN_CONFIG=config/config.yaml -kp -n