/micromake

Primary LanguagePython

micromake

Background

micromake is a pipeline built with Snakemake to provide a flexible and comprehensive analysis of microbiological sequencing data. The pipeline streamlines the entire bioinformatics workflow, from initial quality control and read cleaning, to genome alignment and variant identification. By automating these critical data processing steps, the pipeline empowers researchers to quickly move on to the more exciting phases of analysis and insights generation. The pipeline leverages a variety of well-established bioinformatics tools to ensure thorough, high-quality results, delivering a robust and efficient platform to support microbiological research.

System requirements

  • Linux distribution
  • Your favorite code editor (Visual Studio code, sublime text, Pycharm, etc)
  • snakemake version 8.10.0 or later. One of the errors experienced during development 'PosixPath' object has no attribute 'startswith' due to an incompatible version of snakemake/python.

File names and naming

  • The desired file naming is _R1, _R2 for this pipeline. If your file naming does not conform to this, the pipeline will rename the files to match this requirement.

Reference genome

  • For a better run, choose to work with a reference genome that requires to be downloaded. Make sure to include the link to the reference genome in the links.txt file
  • If you have a reference genome, store it in the ref genome folder. Ensure that you only have one ref genome for each run, otherwise, the downstream analyses will fail.

Test data

  • There is test data, which are file links to test this pipeline in the links.txt file. You can remove these links to work with your own file links

How to run the pipeline:

  • With this version, clone this repository in your local environment

    https://github.com/GeOdette/micromake.git

  • Next, install the necessary bioinformatics tools required for the pipeline. Use the setup.sh file at the base of the folder.

    ./setup.sh

  • Alternatively, you can look at the requirements.txt file and install the tools manually

  • Confirm that snakemake has been installed in your system by running snakemake --version. This should give you the latest version of snakemake.

  • Before running this pipeline, this pipeline assumes that you have a list of links that you want to process.

  • Ensure these links are stored in a file called links.txt

  • This pipeline is developed using default settings. To enhance your run, edit the config.yaml file in the config directory to your desired needs.

Running the pipeline

  • To run the pipeline, use the following code:

  • Ensure you activate your conda environment and have snakemake version 8 and above

  • Change into the project directory. Specifically, micromake

    snakemake --profile config/

  • NOTE: The number of cores has been set in the config file. You can adjust that depending on your compute resources

Tools executed by the pipeline

This pipeline executes the following tools

  • FASTQC for quality checks/screening

  • FASTP for quality control with an option to run trimmomatic

  • BWA for alignment/genome mapping

  • SAMTOOLS for sorting and indexing

  • BCFTOOLs for variant calling with an option for freebayes

  • MULTIQC for generating quality reports

Expected outputs

The pipeline will generate a results folder with the following files:

  • results/data folder with the fastq files

  • results/fastqc_output folder containing the fastqc results. This folder also contains all_summary.txt file that contains summary statistics from all fastqc runs

  • results/ref folder containing the reference genome

  • results/bam folder containing bam files and a .txt file with the summary statistics of the alignment

  • results/trimmed containing trimmed files

  • results/trimmed/fastqc_out containing fastqc output of the trimmed files.

  • results/variants/bcf containing files of filtered and unfiltered/raw vcfs called from bcftools. The folder also has a .txt file containing summary statistics of the variant call

  • results/multiqc containing multiqc report

Running into errors:

  • If you run into an error due to the bioinformatics tools used, consider restarting the pipeline again.
  • The pipeline will pick from processes you have not run.
  • Errors can occur not due to the pipeline but sequence files used In these instances, be sure to correct the files and start the run.
  • For a smooth run, use the command snakemake --profile config/ --rerun-incomplete

The config file

  • You may edit the config file to include as many parameters as you want.