blast-galley
Precooked BLAST-related recipes, scripts and utilities
Introduction
In the blast-galley, I collect a mishmash of scripts and utilities for easy digestion of the NCBI Blast+ suite.
These tools were developed for my own use, but I've tried to make them
self-contained (all have --help
) so they may be of use to others.
Prerequisites
- Several tools make use of GNU awk (
gawk
), which is available in every Linux distribution. Recent Debian/Ubuntu versions installmawk
rather thangawk
by default, so you may need toapt install gawk
.
zblast
zblast
is a very thin wrapper around the blast command. I use it because
I keep forgetting the options that do what I want, while blastn -help
is
an oxymoron. For that same reason I maintain a
Blast+ commmand-line reference
$ zblast "ATGAGCAT" # default blast query against `nt` for given sequence
$ zblast queries.fasta # same but reading subject(s) from file queries.fasta
$ echo "ATGAGCAT" | zblast # same but reading subject from stdin
$ zblast -b "-perc_identity 99 -evalue 0.01" ... # pass options to blast
blastdb-get
blastdb-get
retrieves sequences or metadata from a BLAST database, using
sequence identifiers such as accession to identify the entry.
$ blastdb-get 'X74108.1'
>gi|395160|emb|X74108.1| V.cholerae gene for heat-stable enterotoxin, partial
TTATTATTTTCTTCAATCGCATTTAGCCAAACAGTAGAAAACAATACAAAAACAGTGCAGCAACCACAACAAATTGAAAG
CAAGGTAAATATTAAAAAACTAAGTGAAAATGAAGAATGCCCATTTATAAAACAAGTCGATGAAAATGGAAATCTCATTG
By default blastdb-get
returns sequences in FASTA format, but it can also
output tabular metadata and/or sequence data.
$ blastdb-get --header --table "aTls" EU545988.1 JF260983.1
Accession TaxID Length Sequence data
EU545988.1 Zika virus 10272 ATGAAAAACCCCAAAGAAGAAATCCGGAGGATCC...
JF260983.1 Dengue virus 10176 ATGAATAACCAACGGAAAAAGGCGAGAAACACGC...
blastdb-find
Whereas blastdb-get
retrieves sequences by identifier only, blastdb-find
can also grep through sequence titles or select by taxonomy ID. By default
it returns a list, but it can also produce the sequences in FASTA format.
$ blastdb-find -t 64320 -t 12637 'polyprotein .*complete cds'
gb|EU545988.1| EU545988.1 64320 10272 Zika virus polyprotein gene, complete cds
gb|DQ859059.1| DQ859059.1 64320 10254 Zika virus strain MR 766 polyprotein gene, complete cds
gb|JF260983.1| JF260983.1 12637 10176 Dengue virus strain EEB-17 polyprotein gene, complete cds
Though blastdb-find
can do a superset of what blastdb-get
can do, it needs
to maintain a cache of metadata per BLAST database. For 'key-based' queries,
blastdb-get
is generally faster, simpler, and more configurable.
gene-cutter
gene-cutter
excises from one or more sequences the segment(s) which match
a given template, such as a known gene sequence. It can operate on FASTA
files or against sequences in a BLAST database.
The sequences being searched through should ideally consist of as few contigs
as possible, as gene-cutter
won't detect matches that straddle contigs.
When matches break across contigs, mapping reads is the alternative. I've
implemented that in mappet. In practice
though, if gene-cutter
gives a result, then it is both quick and accurate.
gene-cutter
could be extended to work around fragmented matches, for instance
by lowering the query coverage threshold so as to find subjects whose start or
end is overlapped by the query, then stitching these together. Alternatively,
we could use exonerate
with affine:overlap
model. The point of
blast-galley however was to use only
BLAST - with the added advantage that gene-cutter
can be used against any
BLAST database.
The gene-cutter
script is self-contained; use -h, --help
for documentation.
blast-in-silico-pcr
blast-in-silico-pcr
is a bash script which tests pairs of PCR primers against
a local BLAST database and returns the fragments selected by the primers.
The online in-silico PCR services at EHU and NCBI do the same thing and may do a better job. This script was intended as a quick shot at doing isPCR using only BLAST commands.
The script is self-contained; the usual -h, --help
gives documentation.
taxo
taxo
is a command line utility to search or browse a local copy of the
NCBI taxonomy database. taxo
has moved to https://github.com/zwets/taxo.
Miscellaneous
Why the name "blast-galley"?
Because it has a nice piratey ring. Pirates must be revered for the well-established fact that their presence attenuates global warming.
License
blast-galley - pre-cooked BLAST for easier digestion Copyright (C) 2016 Marco van Zwetselaar
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.