This document describes how to run command-line BLAST with custom search-able databases on Colby's nscc.
"BLAST+" is a suite of linux programs that run the different variations of the basic local alignment search tool (BLAST) algorithm. Which program (blastn
, blastp
, blastx
or tblastn
) you need to run depends to whether your search query is a nucleotide or amino acid (protein) sequence and whether you're searching a database of nucleotide or protein sequences.
Once you know the variety of BLAST you need, the basic form of the command is:
blastn -query /path/input.filename.fa -db /path/BLASTdb > output.filename
Notice that the default is for BLAST to send its output to "stdout", that is, the screen. So typically you want to redirect that output to a named file.
Here's a more realistic example.
tblastn -query /research/drangeli/BLAST_DBs/query_sequences/Dmel.act5C.fa \
-db /research/drangeli/BLAST_DBs/Nlug.CDS.BLASTdb/Nlug.OGSv1-0.CDS.BLASTdb \
> Dmel.act5c.vs.Nlug.OGSv1-0.CDS.tblastn.txt
Notice a few things about this command.
- The input files are all identified by their complete path from root. I do this because the query files and the databases are typically in different folders, and this keep things organized.
- The output file has no path. In other words the i,mplied path is
./
. So, I typically run the BLAST command from the folder where I want the output file to be. - The output file name has built into it several important pieces of information:
- the name of query sequence,
Dmel.act5C
, in this example - the name of the database,
Nlug.OGSv1-0.CDS
- the type of BLAST,
tblastn
- the
.txt
extension should make it clear that this is human-readable output
- the name of query sequence,
In order to keep things organized on nscc, it's helpful to use consistent and clear file names. For species, they are typically identified by a four letter abbreviation consisting of the first initial of the genus, followed by the first three letters of the specific epithet. So, Oncopeltus fasciatus becomes Ofas
.
There are a few exceptions to this rule. The pea aphid, Acyrthosiphon pisum, would become Apis
, which would certainly get confused with the honeybee genus, Apis. So the aphid community has adopted Acypi
as their digital handle.
Below is a list of paths to each of the custom BLAST search-able databases available to our lab group on nscc
. Custom BLAST databases each reside in a folder, but consist of several different files (with the extensions nhr
, nin
, nog
, nsd
, nsi
, and nsq
). To run BLAST from the linux command line, you will need to refer to the path of the database and the base name of those files. List the contents of the database folder if you're not sure what this looks like.
This is the reference transcriptome that was used to annotate the O. fasciatus genome project. The CDS (coding sequence) FASTA file was kindly provided by Kristen Panfilio. When she shared the file, Kristen told us the data originate from "pooled Illumina RNAseq reads from mixed juvenile, adult male, and adult female libraries". But we have no more specific information about the ages of those individuals. We might be able to follow-up with Kristen about that.
/research/drangeli/BLAST_DBs/Ofas_transdecoder_cds/Ofas.TR.BLASTdb
This database is built from the 'grand union' of transcripts assembled from samples of the Plantation Key ecotype from whole bodies, dorsal thorax and gonads of first day adults. It is the most complete J. haematoloma transcriptome we have and is estimated (by BUSCO) to include 89% of conserved genes.
/research/drangeli/BLAST_DBs/Jhae_GU_BLAST_DB/Jhae.GU.TRassembly.BLASTdb
Additionally, we have transcriptomes and BLAST databases for samples from male and female gonads (testes and ovaries). I'm not sure if the testes also included the accessory glands.
/research/drangeli/BLAST_DBs/Jhae_ovary_BLAST_database/Jhae.ovary.TRassembly.BLASTdb
/research/drangeli/BLAST_DBs/Jhae_testes_BLAST_database/Jhae.testes.TRassembly.BLASTdb
J. sanguinolenta is a seperate species of soapberry bug whose range overlaps that of J. haematoloma in South Florida. They are not know to hybridize in the wild and attempts to do so in the lab have not produced offspring. J. sanguinolenta is not known to produce short-wing morphs in the wild, but they have appeared in the lab, suggesting a hidden reaction norm.
At the time we made this transcriptome, we thought the identification of this bug might have been Jadera hinnulea, which is why these files are labeled Jhin
rather than Jsan
. (I'd rename that, but I'm afraid the contents of these files might refer to their own file names, which would cause problems. Eventually, we should just re-build the BLAST database, so eliminate that name issue.)
/research/drangeli/BLAST_DBs/Jhae_GU_BLAST_DB/Jhae.GU.TRassembly.BLASTdb
The eastern box elder bug, Boisea trivittata, is the only rhopalid native to northeastern North America. Our lab created a transcriptome for this species with individuals collected in Waterville, ME.
/research/drangeli/BLAST_DBs/Btri_BLAST_database/Btri.TRassembly.BLASTdb
The USDA i5k project has a number of other genome projects in the works for Hemiptera at various stages of completion. There are currently five species with transcriptomes available, and I've made BLAST databases from these.
To understand how these species might be useful, it's helpful to have some idea of the phylogenetic relationships of these species as well as their unique features. Unfortunately, the taxonomy of true bugs is a mess. Mid-century entomologists recognised two orders: Heteroptera, the true bugs, and Homoptera, the cicadas and their allies. However, molecular phylogenetics has found that Heteroptera is nested within what used to be 'Homoptera'. So the term 'Homoptera' is now defunct, and people recognise a single order Hemiptera, with 3 suborders: Heteroptera (the same name with the same species in it), Auchenorrhyncha (cicadas, leafhoppers, treehoppers, planthoppers, and spittlebugs) and Sternorrhyncha (aphids, whiteflies, and scale insects).
Here's a figure Kristen and I produced for a review we wrote on hemipteran genomics. Below these species are listed in decreasing relatedness to Oncopeltus.
The invasive brown marmorated stink bug is a member of the Pentatomomorpha, the infraorder that includes Oncopeltus (family Lygaeidae and Jadera (family Rhopalidae).
/research/drangeli/BLAST_DBs/Hhal.CDS.BLASTdb/Hhal.BCMv0-5-3.CDS.BLASTdb
The bedbug represents the infraorder Cimicomorpha.
/research/drangeli/BLAST_DBs/Clec.CDS.BLASTdb/Clec.OGSv1-2.CDS.BLASTdb
The water strider represents the infraorder Gerromorpha, the earliest diverging lineage of the Heteroptera.
/research/drangeli/BLAST_DBs/Gbue.CDS.BLASTdb/Gbue.OGSv1-0.CDS.BLASTdb
The brown planthopper is currently the only member of the Auchenorrhyncha with an available transcriptome.
/research/drangeli/BLAST_DBs/Nlug.CDS.BLASTdb/Nlug.OGSv1-0.CDS.BLASTdb
The pea pahid genome was one of the first big insect genome projects. This pest represents the Sternorrhyncha, the earliest-branching major lineage of Hemiptera.
/research/drangeli/BLAST_DBs/Acypi.CDS.BLASTdb/Acypi.OGSv2-1b.CDS.BLASTdb
Pachypsylla venusta is a gall-forming psyllid. This species represents a relatively early-branching lineage within the Sternorrhyncha.
/research/drangeli/BLAST_DBs/Pven.CDS.BLASTdb/Pven.BCMv0-5-3.CDS.BLASTdb
Thrips such as F. occidentalis are members of the Phthiraptera, the sister-order to the Hemiptera.
/research/drangeli/BLAST_DBs/Focc.CDS.BLASTdb/Focc.BCMv0-5-3.CDS.BLASTdb
The default output for local BLAST is a human-readable tabular format, give a long list of matches, including some that are very poor. However, you can adjust BLAST's output with some additional parameters in the command line. The BLAST+ manual has detailed information on all the suite's functionality. But here's a summary of some useful switches. These switches can be used in combination with one another.
blastn -query /path/input.filename.fa -db /path/BLASTdb \
-evalue 1e-10 \
> output.filename
blastn -query /path/input.filename.fa -db /path/BLASTdb \
-max_target_seqs 10 \
> output.filename
This will limit the output to the top 10 matches.
It's possible for a query search and a subject sequence to align more of less well along different stretches. This is most relevant when the database includes very long sequences, that might have tandem duplications. This output can be reduced to show only the best such alignments. The example below will report only the best alignment between any query and subject sequence pair.
blastn -query /path/input.filename.fa -db /path/BLASTdb \
-max_hsps 1 \
> output.filename
nscc is a multi-processor computing cluster. By default, most programs will use only one processor, but you can direct many of them to "multi-thread" and use multiple processors simultaneously.
Do this only with caution, since it will impact other users on nscc. As a rule, you should only use multi-threading on node 26. The biology node has 32 processors, and as a courtesy to other users, it's good form to use at most half of those at any time. The example below uses 16 processors.
blastn -query /path/input.filename.fa -db /path/BLASTdb \
-num_threads 16 \
> output.filename
BLAST has the ability to render its output in format other than the default table and alignment. Several options are possible, and they're described in Table C1 in the Appendix to the BLAST+ manual. If you look at this table, scroll down or search for the outfmt
entry.
For many big data bioinformatics approaches, it can be useful to get your BLAST output in a TSV format that can be imported into R or other applications. One line per result is produced with -outfmt 6
. But you'll want to specify exactly which metadata the BLAST returns. They are specified in double quotes, as in the example below.
blastn -query /path/input.filename.fa -db /path/BLASTdb \
-evalue 1e-10 -max_target_seqs 10 -max_hsps 1 -num_threads 16 \
-outfmt "6 qseqid sseqid pident qlen length mismatch evalue bitscore" \
> output.filename
-Dave Angelini, 2018