/iMAP

iMAP v1.0 (Pre-release): Integrated Microbiome Analysis Pipeline

Primary LanguageHTMLMIT LicenseMIT

iMAP: Integrated Microbiome Analysis Pipeline

Please be informed that financial support to develop this repo ended in October 2018. Volunteer work to make it more user-friendly is progressing slowly. Thank you for your patience.


Running Analysis within Docker Container (Default)

  • See GH-Page for step-by-step instructions.
    • Requires Docker Images and Docker containers CLI.
    • Uses less resources but memory-intensive computing may sometimes fail.
    • All analyses are run interactively on container’s Command-line.
    • The iMAP folder is by default the working directory and is readable from the container.
    • The output is stored in the working directory which mean it can be accessed ouside the container.
  • Important: Graphical applications don't work well in Docker containers.
  • Some R-packages that install perfectly in RStudio may not install correctly in docker images.



Running Analysis On Specific Platforms (Best option)

  • This README is work in progress. Please visit this page regularly for more updates.
  • No Docker images are required.
  • May probably require manual installation of some tools.
  • Uses RStudio to install required R-packages.


Step 1: Set-up the configuration file

By default most of the executable files are saved or soft-linked to the $HOME/bin directory. If bin folder does not exist please create it

cd ~/
ls -al
mkdir bin

Required configuration files:

  • .bash_profile: A hidden file executed for login shells before running any command argument. Is more common in Mac OS X.

  • .bashrc: A hidden file executed for interactive non-login shells before running any command argument. Is more common in Unix-Linux.

In MAC we will set the PATHs in the .bashrc file, then source it from the .bash_profile file.

cd ~/
ls -al

# If the two config files do not exist please create them.

touch ~/.bashrc

# See if the $HOME/bin is in the $PATH.

echo $PATH

# If not add $HOME/bin line to the *.bashrc* file. 

export PATH=$PATH:$HOME/bin

touch ~/.bash_profile

# Add the following line to the *.bash_profile* file. 

if [ -f ~/.bashrc ]; then
   source ~/.bashrc
fi


Step 2: Dowload the pre-built binary suitable for your platform



iMAP for MAC OS X (Updated July 8, 2020)


curl -LOk https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-Mac-OSX.v1.0.zip
unzip iMAP-Mac-OSX.v1.0.zip
mv iMAP-Mac-OSX.v1.0 iMAP
rm -f iMAP-Mac-OSX.v1.0.zip
cd iMAP


# OR

wget --no-check-certificate https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-Mac-OSX.v1.0.zip
unzip iMAP-Mac-OSX.v1.0.zip
mv iMAP-Mac-OSX.v1.0 iMAP
rm -f iMAP-Mac-OSX.v1.0.zip
cd iMAP


iMAP for Unix-Linux environments (in progress)


curl -LOk https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-UnixLinux.v1.0.zip
unzip iMAP-UnixLinux.v1.0.zip
mv iMAP-UnixLinux.v1.0 iMAP
rm -f iMAP-UnixLinux.v1.0.zip
cd iMAP


# OR

wget --no-check-certificate https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-UnixLinux.v1.0.zip
unzip iMAP-UnixLinux.v1.0.zip
mv iMAP-UnixLinux.v1.0 iMAP
rm -f iMAP-UnixLinux.v1.0.zip
cd iMAP


iMAP for Windows 10 with linux WSL-bash (in progress)


curl -LOk https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-Windows-10-WSL.v1.0.zip
unzip iMAP-Windows-10-WSL.v1.0.zip
mv iMAP-Windows-10-WSL.v1.0 iMAP
rm -f iMAP-Windows-10-WSL.v1.0.zip
cd iMAP


# OR

wget --no-check-certificate https://github.com/tmbuza/iMAP/releases/download/v1.0/iMAP-Windows-10-WSL.v1.0.zip
unzip iMAP-Windows-10-WSL.v1.0.zip
mv iMAP-Windows-10.v1.0 iMAP
rm -f iMAP-Windows-10-WSL.v1.0.zip
cd iMAP


Step 3: Install iMAP dependencies

The following script installs the executable tools integrated in the pipeline, including seqkit, fastqc, bbmap, multiqc and mothur. The script does not include R or RStudio which must be installed manually by the user.

Users who prefer to use Portable Batch System (PBS) or similar methods may seek advices from their system administrators.

bash ./code/00_1_InstallSoftwareDriver.bash

Confirm the installation

Make sure that all executable tools are being discovered by the system. Simply use which or type -p function to see the location.

which seqkit # must show the location of seqkit
which fastqc # must show the location of fastqc
which bbduk.sh # must show the location of bbduk.sh
which multiqc # must show the location of multiqc
which mothur # must show the location of fastqc
which vsearch # must show the location of vsearch
which uchime # must show the location of uchime

If the auto-install failed, please try to do it manually. Each of the tools below is hyperlinked to lead you to its download site. Please install the latest stable version.


Additional installation for future analyses

  • Anaconda: We recommend to install Anaconda for the local user as no administrator permissions are required.


Install R & RStudio (Required)



Step 4: Add data to designated folders

This Table provide useful information to help you place data in correct folders. Use the new versions if available.


Using demo data

The following command copy the required data files located in the iMAP/resources/ and place them in their respective locations.

bash ./code/00_2_GetDemoDataDriver.bash

Step 5: Check missing folders or files

Run checkFiles command everytime you want to verify any missing files. Add all missing files and check again untill everything looks ok.

bash ./code/00_3_CheckFilesDriver.bash 

What to replace

  • Rawdata: data/raw/
  • Metadata: data/metadata/
  • Mapping files: data/metadata/

Re-run checkFiles command everytime you change the original data files. It is important to maintain the format presented by the demo data.

bash ./code/00_3_CheckFilesDriver.bash

Changing default settings

Users who want to change the default settings may do so using any text editor. Use this table to locate files with default parameters that may be altered.




METADATA EXPLORATORY ANALYSIS


Step 6: Metadata profiling

This step helps you to:

  • Discover if data is suitable for analysis.
  • Identify and correct issues.
  • Uncover if additional formatting is needed.
  • Make decision on whether to change anything before proceeding with the analysis.

Progress report 1: Metadata profiling

Skip for now!
This chunk will hold an R script that generates Progress report 1: Metadata profiling
bash code/01_metadataProfiling_driver.bash



READ QUALITY CONTROL

Step 7: Read Preprocessing

  • Computing simple statistics of the raw reads
  • Inspecting base quality scores of original reads (qc0)
  • Filtering and trimming poor reads. Phred Score = 25 or more (qctrim25: default)
  • Removing phiX contamination (qced)
  • Summarizing Base Call Phred scores graphically
bash ./code/01_1_ReadPreprocessDriver.bash

Progress report 2: Read Preprocessing

Skip for now!
This chunk will hold an R script that generates Progress report 2: Read Preprocessing


BIOINFORMATICS ANALYSIS

A: Interactively on CLI

  • Users sequentially run individual script or the bundled scripts on CLI (Command-Line-Interface).
  • Interactive mode allows investigators to review the results and make well-informed decisions, progressively.

Step 8: Microbial Profiling

Sequence processing

  • Assembling of the forward and reverse reads, screen by length and create representative sequences.
  • Aligning the representative sequences with reference alignments. Default SILVA seed.
  • Denoising to remove poor alignments.
  • Removing Chimeric sequences.
bash ./code/01_2_SeqProcessingDriver.bash

Sequence classification

  • Taxonomic classification of the sequences
  • Post-classification quality control.
bash ./code/01_3_ClassifySeqDriver.bash

Progress report 3: Microbial Profiling

Skip for now!
This chunk will hold an R script that generates Progress report 3: Sequence Processing


Step 9: Preliminary Analysis


Phylotype method (Recommended)

bash ./code/01_4_PhylotypeBasedTaxaDriver.bash

Cluster-based method (Memory-intensive)

bash ./code/01_5_ClusterBasedTaxaDriver.bash

Phylogeny method (Memory-intensive)

bash ./code/01_6_PhylogenyBasedTaxaDriver.bash

Progress report 4: Preliminary Analysis

Skip for now!
This chunk will hold an R script that generates Progress report 4: Preliminary Analysis 



B: Remotely on HPC

  • Complete step 1-4 above
  • FYI the Portable Batch System (PBS) is the most used workload management solution for HPC systems and Linux clusters. To certain, check with your system administrator.
  • Create a job scheduling i.e. PBS script (or similar) for:
    • submitting a job to the HPC queue
    • allocating the available computing resources, and
    • requesting additional resources.
  • Submit the job using a qsub command. This command scans the lines of the PBS job scheduling script for directives or instructions.

Sample PBS script

Replace the parameters in the script to match your systems.

#!/bin/bash -f
#PBS -N [JobID]
#PBS -A [group allocation name]
#PBS -l nodes=1:ppn=10
#PBS -l walltime=3:00:00
#PBS -l pmem=10gb
#PBS -j oe
#PBS -o [Output file]
#PBS -M [Email address]
#PBS -m bea

cd $PBS_O_WORKDIR

bash ./code/01_1_ReadPreprocessDriver.bash
bash ./code/01_2_SeqProcessingDriver.bash
bash ./code/01_3_ClassifySeqDriver.bash
bash ./code/01_4_PhylotypeBasedTaxaDriver.bash
bash ./code/01_5_ClusterBasedTaxaDriver.bash
bash ./code/01_6_PhylogenyBasedTaxaDriver.bash

exit 0

Description of the PBS code

The above PBS script specifies:

  • The environment to use (#!/bin/bash -f)
  • The name of the job (#PBS -N JobID)
  • The group allocation name (#PBS -A group allocation name)
  • Ten processors to run on a single node (#PBS -l nodes=1:ppn=10)
  • Three walltime hours (#PBS -l walltime=3:00:00)
  • Ten gigabytes of memory (#PBS -l pmem=10gb)
  • Joins the error and output in a single file (#PBS -j oe)
  • Writes the output in a text file named iMAPtutorial.txt (#PBS -o iMAPtutorial.txt)
  • Instructs the PBS manager to send message to a specified email address when the job (b)egins, (e)xits or (a)borts (bea) (#PBS -m bea).
  • Instructs the PBS manager to send the notification emails to the specified email.
  • The working directory (cd $PBS_O_WORKDIR)
  • The code or individual scripts to be executed
  • Finally, the PBS manager will instruct the system to exit once the execution is done (exit 0).



IN-DEPTH ANALYSIS, VISUALIZATION & REPORTING (In progress)

The output from preprocessing and bioinformatics analysis is analyzed and visualized via the RStudio IDE (Integrated Development Environment). The entire analysis is summarized in a single HTML report or in a pre-specified format using Rmarkdown.



Related Links

URLs Description Status
Manuscript In BMC Bioinformatics Software
README Guidelines iMAP README
Practical guide Systematic Microbiome data analysis eBook, coming in 2021
Useful link Consulting Services In Progress


Citation

Teresia M. Buza, Triza Tonui, Francesca Stomeo, Christian Tiambo, Robab Katani, Megan Schilling, Beatus Lyimo, Paul Gwakisa, Isabella M. Cattadori, Joram Buza and Vivek Kapur. iMAP: an integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics (2019) 20:374. Free Full Text.




Useful links

  1. RStudio Community Q&A: https://community.rstudio.com/