/RGStraP

A manual for using RGStraP: a bioinformatics pipeline which calculates PCs showing genetic stratification from RNA-seq data.

Primary LanguagePythonApache License 2.0Apache-2.0

RGStraP

DOI Snakemake

RGStraP (RNA-seq-based Genetic Stratification PCs) is a bioinformatics pipeline for calculating Principal Components (PCs) showing genetic stratification from RNA-seq data. The pipeline mainly utilizes the variant calling capabilities of GATK4 and the principal component analysis (PCA) of FlashPCA2. The pipeline was built using snakemake.

RNAvc_Figure1

Contact

Muhamad Fachrul, mfachrul@student.unimelb.edu.au

Twitter Follow

Citation

Fachrul, M., Karkey, A., Shakya, M., Judd, L. M., Harshegyi, T., Sim, K. S., Tonks, S., Dongol, S., Shrestha, R., Salim, A., Baker, S., Pollard, A. J., Khor, C. C., Dolecek, C., Basnyat, B., Dunstan, S. J., Holt, K. E., & Inouye, M. (2023). Direct inference and control of genetic population structure from RNA sequencing data. Communications Biology, 6(1), Article 1. https://doi.org/10.1038/s42003-023-05171-9

Requirements

Most of the dependencies (including FastQC v0.11.8, Trim galore v0.6.0, BBMap (for Clumpify.sh), STAR v2.7.10a, Picard v2.24.0, Samtools v1.8, GATK4 v4.0.6.0, and PLINK 1.9 v1.90b6.16) are included in the setup.

Please install FlashPCA v2.0 from source.

How to use

Installling Conda and snakemake

  • Install a Conda-based Python3 distribution such as Miniconda or Mambaforge. In this case, we will use the latter as an example.
curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -o Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
  • Create and/or navigate to the directory in which you want the analysis of your project to take place, then clone this repository.
git clone https://github.com/fachrulm/RGStraP
  • Change into the RGStraP directory, and create a Conda environment to run the pipeline.
cd RGStraP

# Activate Conda environment
conda activate base

# Create RGStraP environment
mamba env create --name RGStraP --file environment.yaml
  • Activate the RGStraP environment. This environment needs to be active everytime you want to use the pipeline.
conda activate RGStraP

# To deactivate the environment
conda deactivate

Running the pipeline

  • Modify the config/config.yaml file according to where the necessary files are in your system. Variables to modify include:
    • Path to a file containing list of ONLY the first pair of paired-end fastq samples to be analyzed.
    • Path to metadata file (required for adding read-group information with GATK).
      • Has to be a tab-delimited file with 6 columns and no header, with the first column containing BAM file locations with the format 2_mapped/[FILENAME]_Aligned.sortedByCoord.out.bam and the next five columns representing read-group ID, platform, sample name, library, and platform unit, respectively.
      • More info here, example here.
    • Path to directory of reference genome index generated by STAR.
    • Path to reference genome fasta file.
    • Path to indel files (for GATK's BaseRecalibrator).
    • Path to flashpca.
  • Please adjust the 'dupedist' value according to your sequencing platform in the scripts/clumpify_OpDup.sh file (recommendations included within the script).
  • Test the pipeline by performing a dry-run.
snakemake -n
  • Running the pipeline on a cluster using a workload manager / job scheduler, such as slurm, is highly recommended. An example of a snakemake profile to run it on slurm is included.
    • Please modify the partition name in slurm/config.yaml file accordingly.
    • You can also modify the maximum number of jobs to be run at once in the slurm/config.yaml file.
# To run pipeline on slurm
snakemake --profile slurm

Running the pipeline from VCF file (lite version)

RGStraP can also be used to capture RG-PCs from existing VCF files via the lite version.

  • Make sure to modify the config/lite_config.yaml file accordingly.
# To run lite pipeline on slurm
snakemake -s lite_Snakefile --cores 2

License

Apache 2.0 License