genome_mapping: A Python repository from imrambo

Author: Ian Rambo Contact: ian.rambo@utexas.edu v.1.0.0.2

Reproducible read mapping builds with SCons.

This wrapper maps FASTQ reads against an assembly (e.g. genome) in FASTA format using BWA-MEM. Additional functionality identifies single, R1-R2 paired, or interleaved FASTQ files and maps them to an assembly accordingly.

This script is intended to ensure you:

can reproduce large-scale mapping runs
do not repeat completed steps that might take a long time
do not produce huge intermediate files (e.g. unfiltered SAM)
can automate mapping depending on FASTQ type

Currently optimized for Linux 64-bit systems.

############### Getting Started ############### Cloning from the GitHub repository:

git clone https://github.com/imrambo/genome_mapping.git

Environment

Prerequisites: conda, python=3.8.1, samtools=1.10, biopython, python-magic=0.4.15 bwa=0.7.17, scons=3.1.2

I recommend using a conda environment (either with Miniconda3 or Anaconda) to run this script. You can create the conda environment using the "environment_linux64.yml" file:

conda env create -f environment_linux64.yml

and then activate the environment:

conda activate scons_map

or, depending on your system:

source activate scons_map

Running the script

To see the available options for SCons, as well as the local options for the mapping, run this in the same directory as the SConstruct file:

scons -h

Local Options: --fastq_dir=FASTQ_DIR directory containing fastq files --assembly=ASSEMBLY path to assembly to map reads to --outdir=OUTDIR path to output directory --sampleids=SIDS identifier for sample fastq files to be globbed, e.g. AB for AB*.fastq.gz. Multiple identifiers can be specified in a single string when separated by commas, e.g. AB,MG,Megs --align_thread=ALIGN_THREAD number of threads for BWA-MEM aligner --samsort_thread=SAMSORT_THREAD number of threads for samtools sort. Default = 1 --samsort_mem=SAMSORT_MEM memory per thread for samtools sort. Specify an integer with K, M, or G suffix, e.g. 10G. Default = 768M. --nheader=NHEADER number of headers from fastq file for determining if interleaved. If 0, use all headers. Default = 0 --tmpdir=TMPDIR output directory for samtools sort temporary files. Default = /tmp --rm_local_build=RMBUILD only keep the build targets in the --outdir. Will remove build targets in the temporary build within SConstruct directory. Specify 0 (keep) or 1 (remove). Default is 0. --noIntraDepthVariance=NOINTDEPTH toggle jgi_summarize_bam_contig_depths --noIntraDepthVariance (yes = 1, no = 0). Default = 1 --read_percent_id=READ_PERCENT_ID The minimum end-to-end percent identity of qualifying reads for depth file. Default = 97 --markdup=MARKDUP choose to fix mates and mark duplicates for paired-end reads (yes = 1, no = 0). Default = 0 --logfile=LOGFILE logger file name. Default = logging.log

Inputs: --assembly= --sampleids=<sample id patterns of gzipped FASTQ files. Currently, files must be gzip compressed. The pattern is searched using a BASH-like glob. For FOO.fastq.gz, just pass FOO for this option. For multiple files, e.g. FOO.fastq.gz and BAR.fastq.gz, pass: FOO,BAR>

The script must be run from the same directory as the SConstruct.

cd /path/to/genome_mapping

Be sure to use all of these included options: scons --fastq_dir= --assembly= --outdir= --sampleids= --align_thread= --samsort_thread= --samsort_mem=<memory per thread for samtools sort. Specify number and G,M,K> --nheader=4. Use 0 for all headers> --tmpdir=

Note that for SCons, the options must not have spaces between the values and names.

For sanity reasons and ensuring the script runs correctly before doing a production run, use the SCons --dry-run option.

By default, SCons will create a build in the same directory as the SConstruct. Include the --rm_local_build=1 option to delete this build. This option is coded outside of the SCons framework, so doing a --dry-run with --rm_local_build=1 will raise an error since the directory is not built. This can be ignored.

Interleaved FASTQ will take priority over single-end reads or R1-R2 reads that contain the same Illumina identifier. To avoid this, you can use a more specific identifier string for --sampleids to only glob certain files, or move your interleaved and R1-R2 files into different directories.