Maliampi is a phylogenetic placement-based pipeline for handling 16S rRNA amplicons (the most common type of microbiome study data).
Maliampi's overall philosophy is to embrace and work with the inherent ambiguity involved in any PCR / seqencing based approach, as well as rationally deal with the limits of the available references mapping between sequence-space and taxonomy.
- Create sequence variants
- Make a reference package
- Place on the reference package
- Classify
-
Containerized dependencies
Maliampi uses docker containers for all of the software called by the pipeline. This means no fussing with getting archane academic software matched up properly with dependencies, or issues with reproducing the pipeline environment on your system.
-
Modular design
While maliampi can start with fastq files straight from the sequencer, you can also opt to use your own sequence variant / OTU generation strategy and just use maliampi to make a reference package.
Right now, maliampi relies upon a custom-fork of sciluigi. That means one cannot simply install with pip. (Pull request is pending).
- Python 3
- Docker (if you want to run locally)
For a variety of reasons, maliampi works best on an unix-like operating system (linux, MacOS, WSL). Directions here are based on an unix-like system.
-
Install python3 with pip.
Ubuntu:
sudo apt-get install python3-pip
Mac: I suggest via brew.
-
(Optional) Install docker engine community edition.
This is needed if you want to run things locally for testing and is strongly suggested.
Windows + Docker is not a happy marriage right now--particularly when attempting to bind local directories in containers. Thus, maliampi does not work well on windows, even with docker and python installed.
-
Create a python virtual environment (strongly recommended).
i)
python3 -m venv maliampi-env
ii)
source maliampi-env/bin/activate
iii)
pip install pip --upgrade
-
Obtain and install the custom fork of sciluigi
ii. Unzip:
unzip containertask.zip
iii. Install using pip:
pip3 install sciluigi-containertask/
iv. Copy over and edit the containerinfo.ini file.
mkdir ~/.sciluigi && cp sciluigi-containertask/example-config/containerinfo.ini ~/.sciluigi/
v. Cleanup
rm -r sciluigi-containertask && rm containertask.zip
-
Obtain and install maliampi
ii. Unzip:
unzip master.zip
iii. Install using pip:
pip3 install maliampi-master/
iv. Move the practice data
mv maliampi-master/maliampi-practice-data/ .
v. Cleanup
rm -r maliampi-master && rm master.zip
The tutorial presumes you have maliampi installed as above (and if in a virtual environment, the virtual environment activated).
The tutorial uses the data in the maliampi-practice-data folder from the repository, with the command lines below assuming you are within that directory.
mkdir -p working/ && mkdir -p output/sv/
Maliampi uses a manifest file to keep track of things. At a minimum this needs to have two columns: one for specimen (a unique identifier) and read__1 (to point to the file path of the fastq file containing the forward read). Almost always there will be another column for the reverse read, read__2. Another important column to include is batch, corresponding to a given extraction-and-sequencing run. Take a look at manifest.csv to see this simple example of samples from the ATCC mock community.
maliampi --workers 2 \
sv_dada2 \
--manifest manifest.csv \
--working-dir working \
--destination-dir output/sv/
A lot of log lines will spill by. Keep an eye out for 'xxx tasks pending' to get a sense of how much longer. Maliampi will run dada2 on each specimen in parallel, to the maximum number of workers you specified above at a time.
If all succeeded, you should see
===== Luigi Execution Summary =====
Scheduled 70 tasks of which:
* 9 complete ones were encountered:
- 9 LoadSpecimenReads(...)
* 61 ran successfully:
- 9 BCCSpecimenReads(...)
- 1 DADA2_Combine_Seqtabs(...)
- 9 DADA2_DADA(...)
- 9 DADA2_Dereplicate(...)
- 9 DADA2_FilterAndTrim(...)
...
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
You can look to see the output files created, in
output\sv\
These include a fasta file, a matrix, and a map and weight file we will use later.
maliampi --workers 1 \
refpkg \
--working-dir working/ \
--refpkg-destdir output/ \
--refpkg-name tutorial \
--sequence-variants output/sv/dada2.sv.fasta \
--entrez-email your@email.com \
--repo-seq-info test.repo.seq_info.csv \
--repo-annotated-fasta test.repo.fasta
If this all works, you should see something like
===== Luigi Execution Summary =====
Scheduled 11 tasks of which:
* 4 complete ones were encountered:
- 1 AlignmentStoToFasta(...)
- 1 CMAlignSeqs(...)
- 1 CombineRepoMatches(...)
- 1 ObtainCM(...)
* 7 ran successfully:
- 1 BuildTaxtasticDB(...)
- 1 CleanupTreeInfo(...)
- 1 CombineRefpkg(...)
- 1 ConfirmSeqInfoTaxonomy(...)
- 1 RAxMLTree(...)
...
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
The reference package will be packaged up at output/refpkg/tutorial.date_time.refpkg.tgz
where date and time are stamps from when your reference package was made.
Please note, you will need to replace date_time below with the timestamp from your reference package.
maliampi --workers 2 \
placement \
--working-dir working \
--destination-dir output \
--sequence-variants output/sv/dada2.sv.fasta \
--seq-map-csv output/sv/dada2.sv.map.csv \
--sv-weights-csv output/sv/dada2.sv.weights.csv \
--refpkg-tgz output/refpkg/tutorial.date_time.refpkg.tgz
If all works, you should see something like
===== Luigi Execution Summary =====
Scheduled 16 tasks of which:
* 4 complete ones were encountered:
- 1 LoadFastaSeqs(...)
- 2 LoadFile(...)
- 1 LoadRefpkgTGZ(...)
* 12 ran successfully:
- 1 CMAlignSeqs(...)
- 1 CombineAlignmentsSTO(...)
- 1 ExtractRefpkgAlignment(...)
- 1 Jplace_ADCL(...)
- 1 Jplace_Alpha_Diversity(...)
...
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
You can check out the placement outputs, which include a json file with the placement details, PCA, as well as alpha diversity and pairwise distance data, all in output\placement
:
output/placement/
├── adcl.gz
├── alpha_diversity.csv
├── dedup.jplace
├── edpl.gz
├── kr_distance.csv
├── pca
│ ├── epca.proj
│ ├── epca.trans
│ ├── epca.xml
│ ├── lpca.proj
│ ├── lpca.trans
│ └── lpca.xml
└── redup.jplace.gz
Please note, you will need to replace date_time below with the timestamp from your reference package.
maliampi --workers 1 \
classify \
--working-dir working \
--destination-dir output \
--sequence-variants output/sv/dada2.sv.fasta \
--seq-map-csv output/sv/dada2.sv.map.csv \
--sv-weights-csv output/sv/dada2.sv.weights.csv \
--jplace output/placement/dedup.jplace \
--refpkg-tgz output/refpkg/tutorial.date_time.refpkg.tgz
If all worked, you should see
INFO:
===== Luigi Execution Summary =====
Scheduled 16 tasks of which:
* 5 complete ones were encountered:
- 1 CombineAlignmentsSTO(...)
- 3 LoadFile(...)
- 1 LoadRefpkgTGZ(...)
* 11 ran successfully:
- 6 GenerateTables(...)
- 1 PlacementDB_AddSI(...)
- 1 PlacementDB_Classify_SV(...)
- 1 PlacementDB_MCC(...)
- 1 PlacementDB_Prep(...)
...
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
Look in the output/classification/
directory to see what we've created:
placement.db
is an sqlite3 database.
├── placement.db
└── tables
├── by_specimen.class.csv
├── by_specimen.family.csv
├── by_specimen.genus.csv
├── by_specimen.order.csv
├── by_specimen.phylum.csv
├── by_specimen.species.csv
├── by_taxon.class.csv
├── by_taxon.family.csv
├── by_taxon.genus.csv
├── by_taxon.order.csv
├── by_taxon.phylum.csv
├── by_taxon.species.csv
├── tallies_wide.class.csv
├── tallies_wide.family.csv
├── tallies_wide.genus.csv
├── tallies_wide.order.csv
├── tallies_wide.phylum.csv
└── tallies_wide.species.csv