Developed by: Álvaro Herrero Reiriz
GitHub: a-hr
- Lexogen pipeline
This pipeline is designed to process the sequencing results of targeted RNA experiments. It includes the following steps:
- Quality control of the samples using FastQC
- Demultiplexing of the samples by barcode using cutadapt
- UMI extraction using umi_tools
- Read alignment using STAR (optionally, the index can be created)
- UMI demultiplication using umi_tools
- Optional indexing of the BAM files to be viewed in IGV (enabled with the
get_bams: true
parameter) - Target gene quantification using featureCounts. A SAF (tab-separated table) containing the target genes and their coordinates is required.
The pipeline is written in Nextflow, a workflow manager that allows to run the pipeline in a wide variety of systems. It is configured to be run either on a SLURM-managed HPC cluster or a local machine, though it can be run on a cloud instance or using other workload managers by editing the configuration file according to Nextflow documentation.
There are two main ways to run the pipeline:
- Cloning the repository and running the pipeline manually through the command line using Nextflow or nf-core. (straightforward, but requires bioinformatics knowledge).
- Installing EPI2ME and running the pipeline through the EPI2ME interface. (user-friendly, but requires to install EPI2ME).
In order to run, at least 32GB of RAM are required (human reference index is about 27GB). It can run on Windows, MacOS or Linux. If using Windows, we recommend installing the Windows Subsystem for Linux (WSL).
If using EPI2ME, install it on your system and follow the instructions on the EPI2ME documentation to run the pipeline. To add it to your saved workflows simply copy this repository's URL and paste it on the "Add workflow" section of the EPI2ME interface.
WARNING: as of July 2023, there is a bug in Docker Desktop that might cause pipelines to fail. For this reason, as stated by the development team, it is recommended to install docker manually through the command line (just a couple of commands are needed to have it up and running).
If using Nextflow/nf-core, clone the repository and install the dependencies. The easiest way to do so is using conda. The pipeline can be run on any system that supports Docker or Singularity. If using Windows, we recommend using the Windows Subsystem for Linux (WSL).
The pipeline is especially tailored to be run on a HPC cluster, though it can seamlessly be run on a local machine and, with some configuration, on a cloud instance.
The pipeline can be installed on any directory. Simply cd to the desired directory and clone the repository:
git clone https://github.com/a-hr/lexogen_pipeline.git
- Open EPI2ME and go to the "View Workflows" tab.
- On the top right corner, click on "Import workflow".
- Paste the repository's URL (https://github.com/a-hr/lexogen_pipeline) on the pop-up window and click on "Download".
- The workflow will be added to your saved workflows.
If you are using EPI2ME, the program will guide you through the dependencies installation.
WARNING: as of July 2023, there is a bug in Docker Desktop that might cause pipelines to fail. For this reason, as stated by the development team, it is recommended to install docker manually through the command line (just a couple of commands are needed to have it up and running).
If you are using Nextflow/nf-core, you will need to install them manually. The pipeline requires Docker or Singularity (mainly for HPC clusters) to run.
In order to install Nextflow, you can use the following command:
conda create -n Nextflow -c bioconda nextflow
Process specific dependencies are automatically downloaded as Docker or Singularity containers. Docker images will be automatically pulled while running. On the other hand, Singularity images are cached locally (default location: {pipeline_dir}/containers
), and can be manually installed using the provided Makefile:
make pull
-
FASTQ files: Illumina pair-end FASTQ files (either gzipped or not).
-
fiveprime.csv: maps file names (without pair-end suffix "_R" and without extensions) to sample groups, if a single file pair is present, add it as well. This will rename files to classify samples for the output table (e.g. mHTT_WT-1a).
fiveprime.csv
File;Name group_1;mHTT group_2;control
Note that File is the name of the fastq file without the suffix (e.g.
lib1_R1.fastq.gz
->lib1
) and Name is the arbitrary name of the group or experiment. -
threeprime.csv: maps barcodes to sample names for the demultiplexing step.
threeprime.csv
Barcode;Sequence;Sample BC1;ATCG;WT-1a BC2;TGCA;WT-1b BC3;ACGT;KO-1a BC4;TGAC;KO-1b
Note that Barcode is the arbitrary name of the barcode, Sequence that will be searched on the R2 reads (3' barcodes have higher quality in the reverse strand) and Sample is the name of the specific sample.
-
SAF file: tab-separated table containing the target genes and their coordinates. The pipeline will use featureCounts to quantify the transcripts mapping to these genes.
target_genes.saf
GeneID Chr Start End Strand gene1 chr1 1000 2000 + gene2 chr1 3000 4000 + gene3 chr1 5000 6000 + gene4 chr1 7000 8000 +
Note that GeneID is the arbitrary name of the gene, Chr is the chromosome, Start and End are the coordinates of the gene and Strand is the orientation of the strand.
Running the pipeline is really straightforward:
-
First, add the input files to the input directory (default:
./inputs
):- The CSV files (./inputs/csvs) will contain the barcodes (
threeprime.csv
) and the libraries (fiveprime.csv
). Additionally, theannotations.saf
file will also be placed here. - The FASTQ files (./inputs/fastqs) will contain the FASTQ files to be processed.
- Locate an existing STAR reference index (verions 2.7.10b) or create a new one using the pipeline (requires reference genome
.fa
and annotation.gtf
files).
- The CSV files (./inputs/csvs) will contain the barcodes (
-
Then, fill in the
input_params.yaml
file with the desired parameters. The available parameters are described in the Available arguments section. -
Next, load the conda environment. If you have Nextflow installed system-wide, you can skip this step.
conda activate Nextflow
Now, decide what profile to use:
- If running on a local machine, use the
local_docker
profile. You can optionally use thelocal_singularity
profile if you have Singularity installed. - If running on a HPC cluster, use the
cluster
profile. It uses the SLURM executor by default. If your cluster happens to use a different scheduler, you can change the executor on the./confs/cluster.config
file. Note that it requires to have Singularity available on the main thread's environment.
- If running on a local machine, use the
-
Finally, run the pipeline:
nextflow run main.nf -resume -profile cluster -params-file input_params.yaml
The
-resume
flag will resume the pipeline if it was interrupted.
If you are working on a cluster, it is good practice to launch the main script as a job itself (make sure that computing nodes can launch jobs in your cluster, contact your system admins if you are not sure). A sample job launching script is provided in launch_cluster.sh. You can use it as follows:sbatch launch_cluster.sh
The pipeline can be run on EPI2ME without any bioinformatics kwnoledge:
- Open the EPI2ME interface and select "Add a new workflow". Add this custom workflow by pasting the repo link (https://github.com/a-hr/lexogen_pipeline.git).
- Select the workflow and click on "Run workflow".
- Fill in the parameters on the interface. The available parameters are described in the Available arguments section.
- Additionally, on the Nextflow Configuration tab, enter a specific profile:
local_docker
for running on a local machine with Docker.local_singularity
for running on a local machine with Singularity.cluster
for running on a SLURM cluster with Singularity.- additional profiles can be added by creating a copy of any of the files in
./confs
and modifying it. Do no forget to add the new profile name and link its.conf
in thenextflow.config
file.
- Run the workflow.
The pipeline will generate the following:
- a directory containing the expression tables both before and after deduplication.
- a directory containing
seqkit stat
runs after every step of the pipeline. - a
multiqc
report of the pipeline processes. - a performance report of the pipeline processes.
- optionally, a directory containing the final deduplicated BAM files, to be able to check the alignment on IGV.
Arguments can be passed to the pipeline through the input_params.yaml file.
Parameter | Description | Type | Default | Required |
---|---|---|---|---|
csv_dir |
.csv and .saf directory path HelpPath to the directory where fiveprime.csv, three prime.csv and annotations.saf are located. |
string |
lexogen_pipeline/inputs/csvs | True |
fastq_dir |
.fastq(.gz) files' directory path HelpPath to the directory containing the FASTQ files (either gzipped or not) to be used. |
string |
lexogen_pipeline/inputs/fastqs | True |
index_dir |
reference genome directory path HelpDirectory containing the reference STAR genome if already created. If not, the index created by the pipeline could be saved here. |
string |
lexogen_pipeline/index | True |
suffix |
pair-end pattern HelpCommon pattern to detect both pair-end files, without the 1 and 2. For example: file_R01; file_R02 --> suffix = _R0 |
string |
_R | |
extension |
FASTQ file extension HelpEither .fastq or .fq. + gzip compression .gz |
string |
.fastq.gz |
Parameter | Description | Type | Default | Required |
---|---|---|---|---|
output_dir |
output directory path HelpPath to the directory that will contain the output tables, BAMs, QC reports and logs. |
string |
lexogen_pipeline/output | |
get_bams |
whether to output aligned BAMs or not HelpUseful to visualize individual samples with IGV to explore the reads. |
boolean |
True | |
save_index |
whether to save the generated index to the index_dir path | boolean |
Parameter | Description | Type | Default | Required |
---|---|---|---|---|
umi_length |
length of the UMI olives | integer |
6 | True |
Parameter | Description | Type | Default | Required |
---|---|---|---|---|
create_index |
whether to create the STAR index or not HelpNot necessary to create one if an index is already available (link it with index_dir parameter). |
boolean |
||
ref_gen |
directory containing index creation files HelpA directory containing the reference FASTA and GTF annotations to create the STAR index. |
string |
lexogen_pipeline/ref_gen | |
map_size |
STAR parameter to create index | integer |
100 | |
sharedMemory |
False when running on a cluster, True else. HelpEnables the use of shared memory to optimize RAM usage of STAR by sharing the reference index across parallel processes. HPC clusters usually do not allow its use. |
boolean |