Extract all viral sequences from NGS data. These codes build a blast database from the NGS data and blast search the database for the target sequences (eg. all_viral.fa). It would draw the coverage figures and extract fasta sequences for each virus whose mapped reads numbers is above a defined parameter "threshold" in "extract_seq.py" code file, the default of which is 100, and which could be modified as demand.
-
Some preparation.
a. Install softwares including sqlite3, blast-2.5+, python2.7, R, and seqtk b. Download all viral sequences from NCBI: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/ c. Download taxonomy data from NCBI: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
-
Asign the sequences with taxonomy information according to accession.
a. Extract accessions from viral sequence file (in fasta format) by using "ac_extract.py". b. Build a SQL database of accession and taxid from "nucl_gb.accession2taxid" file. c. Search the SQL database, "names.dmp", "nodes.dmp", and "merged.dmp" by using "ac2sciname.py", to the species and family scientific name of each accession.
-
Put all the other codes in the same directory with the NGS data in ".fastq.gz" format. Open a terminal and run the "blast_all.sh" bash codes as below: (# note: make sure all ".sh" file are excutable. if not, use "chmod u+x *.sh " to enable these code files excutably.)
$ ./blastn_all.sh
1. Convert illumina fastq.gz file into fasta format using "seqtk".
2. Make a directory named as the NGS filename, and change working directory to this newly directory.
3. Make a directory named "blastdb", and build a blast database in this directory by using the NGS fasta file.
4. Make a directory named "blastout", blastn search all viral sequences from the database, and output the result file into "blastout"
5. Do statistic of the mapped reads, including information of virus accession, virus species name, family name, and mapped reads counts.
6. Draw coverage maps for each virus whose mapped reads number is above a defined paramter "threshold" in "coverage.py", and output the results into a newly made directory "cov_figs".
7. Extract sequences for each virus whose mapped reads number is above a defined paramter "threshold" in "extract_seq.py", and output the reulsts into a newly made directory "seqs".
Creat a list file containing all ".fastq.gz" file names in a file format including 3 columns gapped with a tab character "\t" in each line:
column1 column2 column3
read1.fastq.gz read2.fastq.gz merged_file #(with no extension)
Then run as below:
$ ./blastn_all_2.sh file_list.txt
or
$ ./tblastn_all_m2.sh file_list.txt
It will call "merge_r1_r2.py" to merge read1 and read2 firstly, and then do blastn/tblastn, annotation, statistics, and sequence extraction...