- Last modified: ons jan 17, 2024 11:35
- Sign: nylander
Try to correct the direction of DNA sequences in fasta (by reverse complement) in relation to a reference sequence.
More specifically, try to "re-orient" (reverse+complement) sequences based on their blast hits (using bl2seq, blastn). All sequences in input are blasted against the same reference sequence. A good strategy is to have as long and "representative" sequence as possible as reference.
The reference sequence can be given in a separate file (see Options below), otherwise the first sequence in input will be used as reference.
If any sequence is reverse-complemented, the string @revcomp@
is appended to
the fasta header.
If some sequences are too dissimilar to the reference sequence, they are written to an (optional) file (see Options).
$ reorient.pl infile.fasta
$ reorient.pl --help
-r, --ref=<ref.fas>
Use (first) sequence in file ref.fas as reference. If ref.fas is not given, the first sequence in input will be used as reference.-o, --out=<out.fas>
Write to file out.fas instead of STDOUT.-f, --fail=<fail.fas>
Sequences with 'No hit' with ref sequence will be stored in fail.fas.-v, --version
Show version number.-V, --verbose
Be verbose about input/output.--noverbose
will turn off extra printing.
The script uses perl, perldoc and perl module StandAloneBlastPlus.pm. In addition, ncbi-blast+ ("stand-alone blast") needs to be installed.
On a deb-based Linux system, they can be installed using:
$ sudo apt install ncbi-blast+ bioperl bioperl-run
In addition, a definintion file for building a singularity/apptainer container is provided. The reorient.pl script with all depedencies can then be installed as a stand-alone singularity image file (approx 450M). For example:
$ cd singularity
$ sudo singularity build reorient.sif reorient.def
$ ./reorient.sif -h
Reverse-complemented sequences in the output can easily be identified using,
e.g., grep: grep '@revcomp@' out.fas
.
The added string can be removed using GNU sed:
sed -i 's/@revcomp@$//' out.fas
.
Or in one go:
$ ./src/reorient.pl data/infile.fasta | \
sed 's/@revcomp@$//' > out.fas
The success of identifying the sequence as being "plus" or "minus" depends on
the success of blastn finding a significant hit. This may sometimes not be
possible. One may wish, however, to pay extra attention to those sequences as
they probably will be difficult to align (if this is the aim). To run the
script while saving too dissimilar sequences to a separate file, use the -f
option. For example:
$ ./src/reorient.pl -f fail.fas data/infile.fasta
The functionality provided in this script is already implemented in, e.g.,
orient
command in
USEARCH, and option
--adjustdirection
in
MAFFT. Both
these examples uses another criterion for deciding the sequence direction.