/sunbeam_databases

lock down databases for sunbeam pipeline use

Primary LanguagePython

sunbeam_databases

This is the place where we download/build/collect the necessary databases required for sunbeam.

TODO:

Install environment

conda install -c bioconda snakemake

OR simplyly activate the sunbeam environment

source activate sunbeam

Make a local copy of the repository

git clone https://github.com/zhaoc1/sunbeam_databases
cd sunbeam_databases

Download BLAST databases

The downloaded nt database will be used for BLASTn assembled contigs.

mkdir nt_20180816
update_blastdb.pl --passive --decompress nt

Build Kraken 1 database

The bash script includes the following steps:

  • mask low-complexity regions from the refseq sequences

  • reformat maked sequences for krakenDB format

  • add custom sequences to krakenDB

  • kraken2 paper is under preparation. And to keep it consistent with krakenhll, we decided to move on with kraken1.

bash build_krakendb.sh

Gene clusters of interest

We do the functional prediction from shotgun metagenomics data based on sequence homology, on both the reads and contigs level.

bile salt hydrolase

The bsh genes were selected from the PMID: 18757757. The protein sequences were downloaded from NCBI and saved in dbs/bsh_20180214.txt and dbs/bsh_20180214.fasta.

bai operon

I selected the complete bai operon genes from 3 species: Clostridium hiranonis, Clostridium scindens and Clostridium hylemonae. The protein sequences were downloaded from the PubSEED database, and saved in dbs/bai.operon_20180801.fasta and dbs/bai.operon_20180801.txt. Refer to PMID: 16299351 and preprint for more information.

butyrate producing genes

Sequences were downloaded from JGI IMG, based on the list provided in PMID: 24757212, and saved in dbs/butyrate_20180612.faa and dbs/butyrate_20180612.tsv.

fungal genomes

We collected 10 fungal genomes of interest (dbs/fungi_20180502.txt), and names and length of the chrmosomes/contigs are in dbs/genome_contig_20180502.txt.

Build krakenHLL databases

viral-neighbors

KrakenHLL supports building databases on subsets of the NCBI nucleotide collection nr/nt, which is most prominently the standard database for BLASTn. On the command line, you can specify to extract all bacterial, viral, archaeal, protozoan, fungal and helminth sequences. The list of protozoan taxa is based on Kaiju's.

DBNAME=viral-neighbors
krakenhll-download -db $DBNAME taxonomy
krakenhll-download viral-neighbors --db $DBNAME --dust --threads 16

Build taxonomizr databases

Parse NCBI taxonomy and accessions to assign taxonomy.

devtools::install_github("sherrillmix/taxonomizr")

library(taxonomizr)

getNamesAndNodes()
getAccession2taxid()
getAccession2taxid(types='prot')

# on microb120
read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa_20180813.sql')

Metaphlan2 database

Respublica doesn't have network access to Bitbucket, so I pre-downloaded the metaphlan_databases on microb191 (metaphlan2.py --install ) and scp it to /mnt/isilon/microbiome/analysis/biodata/metaphlan_databases.

DIR=$CONDA_PREFIX/opt
cp -r /mnt/isilon/microbiome/analysis/biodata/metaphlan_databases $DIR

Download refseq genomes

This can be used with any group listed under the genomes/refseq directory, but recommended groups would be:

  • barcteria
  • fungi
  • archaea
  • protozoa

Example

To download the nucleotide sequences of all Refseq fungal sequences (update your config file with group: fungi):

# First, downloads the `assembly_summary.txt` from NCBI ftp, and the list of all genomes
# You can also `grep` the species of interest from the generated `genome_urls.txt`
snakemake download_group_nucl

# Second, download the sequences
snakemake download_group_nucl

The output genomes are listed under {group}/{accession}.fna.gz or {group}/{accession}.faa.gz.