These Python scripts are relatively well-documented and battle tested, and may be useful to a somewhat wider audience than their author.
Note: Some scripts require the biogl module to function properly. If you use pip
, you can simply do
python3 -m pip install biogl
to add the package to your Python install directly. Otherwise, you can pull the local submodule from this repo along with the scripts:
git clone --recursive https://github.com/glarue/bioinformatics_scripts.git
- cdseq | Extract transcript/coding sequences from a genome using an annotation file
- climail | Send an email over SSL from the command line
- cmdlog | Log a shell commmand with optional note
- faseq | Retrieve (sub)sequences from a FASTA file
- fasterq_dump | Run NCBI's
fastq-dump
on one or more accession numbers faster than normal - notify | Automatically send an email at the end of a (usually long-running) command
- palign | A parallel implementation of NCBI's BLAST+ suite (faster than their threaded implementation for certain db/query size combinations) or a convenience wrapper for DIAMOND
- quickstats | Calculate basic statistics on single-column input data
- randl | Retrieve a random set of lines from a file (FASTA-format aware)
- reciprologs | Find the best reciprocal BLAST/DIAMOND hits between 2+ sets of sequences
- revcomp | Reverse-complement a nucleotide sequence
- tophits | Get the top n hits for each query sequence in a BLAST tabular output file
- transeq | Translate a nucleotide sequence to amino acid sequence
usage: cdseq [-h] [-t] [-e] [-i] [-c] [-v] [-n] [--introns]
[--truncate_introns int_length]
genome annotation
Extract transcript/coding sequences from a genome using an annotation file
positional arguments:
genome genome file in FASTA format
annotation annotation file in GFF[3]/GTF format
optional arguments:
-h, --help show this help message and exit
-t, --translate translate the output sequence (default: False)
-e, --exon use exons instead of CDS entries to define coding
sequence (default: False)
-i, --isoforms allow multiple isoforms per gene, instead of only
longest (default: False)
-c, --coord_based_isoforms
detect isoforms by overlapping coordindates in
addition to shared parent (useful for annotations
without gene entries) (default: False)
-v, --verbose_headers
include coordinate info in output headers (default:
False)
-n, --non_coding include non-coding (intronic, UTR, etc.) sequence;
uses only the coordinates of the transcript itself
(incompatible with --introns) (default: False)
--introns include intron sequences in lowercase (incompatible
with -n) (default: False)
--truncate_introns int_length
truncate intron sequences to length {int_length} (or
{int_length} -1 if odd) (default: None)
usage: climail [-h] -sadd SERVER_ADDRESS -fadd FROM_ADDRESS -pw PASSWORD -tadd
TO_ADDRESS [-p PORT] [-s [SUBJECT]] [-b [BODY]] [--ID ID]
[--no_host_tag] [--suffix_tag]
Send an email over SSL from the command line
optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT Port to use on email server (default: 587)
-s [SUBJECT], --subject [SUBJECT]
A subject line for the email, wrapped in single quotes
('') (default: None)
-b [BODY], --body [BODY]
The body of the email, wrapped in single quotes ('')
(default: None)
--ID ID Arbitrary string to include in host tag (if present)
(default: None)
--no_host_tag disable host tag (time/server string) in subject
(default: False)
--suffix_tag place host tag at end of subject rather than beginning
(default: False)
required named arguments:
-sadd SERVER_ADDRESS, --server_address SERVER_ADDRESS
Server address for outgoing email using the SSL
protocol (default: None)
-fadd FROM_ADDRESS, --from_address FROM_ADDRESS
Full email address from which to send email (default:
None)
-pw PASSWORD, --password PASSWORD
Account password for the 'from' address. Wrap in
single quotes ('') (default: None)
-tadd TO_ADDRESS, --to_address TO_ADDRESS
Destination email address (default: None)
usage: cmdlog [-h] shell commands [shell commands ...]
Log a shell commmand with optional note
positional arguments:
shell commands one or more commands to be sent to the shell (may require
single or double quotes)
optional arguments:
-h, --help show this help message and exit
usage: faseq [-h] [-f INFO_FILE] [--flank FLANK] [-s SEPARATOR] [-u] [-e]
[--full_header] [-t integer]
FASTA_file [header strand start stop [label]
[header strand start stop [label] ...]]
Retrieves sequences from a FASTA file using the following format: header
strand start stop [label]. If {label} is provided, will output in FASTA format
using {label} as header. Otherwise, outputs one sequence per line. If only
{header} is provided, will return the entire header/sequence pair.
positional arguments:
FASTA_file FASTA-formatted text file
header strand start stop [label]
header strand start stop [label]
optional arguments:
-h, --help show this help message and exit
-f INFO_FILE, --info_file INFO_FILE
file with information for sequences on separate lines.
--flank FLANK size of any desired flanking region around specified
sequence(s)
-s SEPARATOR, --separator SEPARATOR
Character to use to separate flanking sequence (if
any) from main sequence, which defaults to {tab}.
-u, --unformatted leave original file formatting intact (e.g. don't join
multi-line entries into a single line)
-e, --exclude_header exclude the header line from the output (does not
apply to explicitly labeled queries)
--full_header match on full header string (including whitespace)
-t integer, --truncate integer
truncate sequence to the nearest even integer <=
{integer} (equally from both ends)
usage: fasterq_dump [-h] [-f FILE] [-k] [-t] [-w]
[accessions [accessions ...]]
A program to run fastq-dump sequentially on a list of accession numbers. Will
automatically detect if reads are single- or paired-end and will run fastq-
dump accordingly, adding proper ID suffixes as needed. Accession numbers may
be provided directly, or in a file using the -f option. If provided directly,
accessions may be comma or space separated, or hyphen-separated to specify a
range, e.g. SRR123455, SRR123456, SRR123457 or SRR123455-SRR123457
positional arguments:
accessions Space-separated list of SRA accession numbers (e.g.
SRR123456 SRR123457)
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File with SRA accession numbers on separate lines
-k, --keep_sra_files Keep SRA files after dumping to FASTQ format
-t, --trinity_compatible_ids
rename read IDs in paired-end data to ensure
downstream compatibility with Trinity
-w, --overwrite overwrite any matching local SRA files
usage: notify [-h] [-e EMAIL] [--add_email] [--view_config] [--ID ID]
[external commands [external commands ...]]
Automatically sends an email to the specified address upon completion of the
specified command. Useful primarily for very long-running processes. In many
cases, the command being run (including the name of the other program) will
need to be placed in quotes. If no email address is provided with the -e flag,
a prompt will be displayed based upon the configuration file.
positional arguments:
external commands External commands to run, including external program
call (default: None)
optional arguments:
-h, --help show this help message and exit
-e EMAIL, --email EMAIL
the email address to notify (default: None)
--add_email add or change an email address in the config file
(default: False)
--view_config view the contents of the configuration file (default:
False)
--ID ID additional string to include in email subject
(default: None)
usage: palign [-h] [-p PARALLEL_PROCESSES] [-s] [-f OUTPUT_FORMAT]
[-o OUTPUT_NAME] [-t THREADS] [-e E_VALUE] [-naf] [--clobber_db]
query subject
{diamondp,diamondx,blastn,blastp,blastx,tblastn,tblastx}
Align one file against another. Any arguments not listed here will be passed
to the chosen aligner unmodified.
positional arguments:
query query file to be aligned
subject subject file to be aligned against
{diamondp,diamondx,blastn,blastp,blastx,tblastn,tblastx}
type of alignment to run
optional arguments:
-h, --help show this help message and exit
-p PARALLEL_PROCESSES, --parallel_processes PARALLEL_PROCESSES
run the alignment step using multiple parallel
processes (default: 1)
-s, --single disable parallel processing (default: False)
-f OUTPUT_FORMAT, --output_format OUTPUT_FORMAT
integer output format for alignment results (default:
6)
-o OUTPUT_NAME, --output_name OUTPUT_NAME
filename for results (otherwise, automatic based on
input) (default: None)
-t THREADS, --threads THREADS
number of threads per process. Be careful when
combining this with multiple processes! (default: 1)
-e E_VALUE, --e_value E_VALUE
e-value threshold to use for search (default: 1e-10)
-naf, --no_auto_format
disable helper operations that auto-format
subject/query as needed and build database if not
already present (default: False)
--clobber_db create new database even if one already exists
(default: False)
usage: randl [-h] [-s SEED] file n
Retrieves a random sample of lines from a file.
positional arguments:
file input file. If FASTA, will retrieve random
header/sequence pairs.
n size of sample to retrieve.
optional arguments:
-h, --help show this help message and exit
-s SEED, --seed SEED seed value for the pseudo-random algorithm
usage: reciprologs [-h] [-p PARALLEL_PROCESSES] [-q PERCENTAGE] [--chain]
[--overwrite] [--one_to_one] [--logging]
file_1 file_2 ... [file_1 file_2 ... ...]
{diamondp,diamondx,blastn,blastp,blastx,tblastn,tblastx}
Find reciprocal best hits between two or more files.
positional arguments:
file_1 file_2 ... files to use to build reciprolog sets (space
separated)
{diamondp,diamondx,blastn,blastp,blastx,tblastn,tblastx}
type of alignment program to run
optional arguments:
-h, --help show this help message and exit
-p PARALLEL_PROCESSES, --parallel_processes PARALLEL_PROCESSES
run the alignment step using multiple parallel
processes (default: 1)
-q PERCENTAGE, --query_percentage_threshold PERCENTAGE
require a specified fraction of the query length to
match in order for a hit to qualify (lowest allowable
percentage (default: None)
--chain cluster reciprologs without requiring all-by-all
pairwise relationships, e.g. A-B, A-C, A-D --> A-B-C-D
(default: False)
--overwrite overwrite existing output files (instead of using to
bypass alignment) (default: False)
--one_to_one remove any many-to-one reciprolog relationships in
each pairwise set, such that each member of each
pairwise comparison is only present exactly one time
in output (default: False)
--logging output a log of best-hit choice criteria (default:
False)
usage: revcomp [-h] [-f FILE] [--no_mask] [--no_lower] [seq]
reverse-complement a nucleotide sequence
positional arguments:
seq nucleotide sequence to be reverse-complemented
(default: None)
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE file to reverse-complement (FASTA-compatible)
(default: None)
--no_mask don't mask non-ACTG characters with N (default: False)
--no_lower consider lowercase input invalid (will still reverse)
(default: False)
usage: tophits [-h] [-n NUMBER_OF_HITS] [-e E_VALUE_CUTOFF] [-d] [-s] [-m]
[-r]
blast_file
Reports the top n BLAST hits for each query in a (tabular) blast output file
positional arguments:
blast_file
optional arguments:
-h, --help show this help message and exit
-n NUMBER_OF_HITS, --number_of_hits NUMBER_OF_HITS
number of hits to report for each unique query
(default: 1)
-e E_VALUE_CUTOFF, --e_value_cutoff E_VALUE_CUTOFF
exclude hits with e-value greater than this value
(default: None)
-d, --allow_duplicate_target_hits
allow multiple hits to the same subject to be included
(default: False)
-s, --sort_by_name sort output by name rather than bitscore (default:
False)
-m, --memory_efficient
avoid reading entire dataset into memory at once, to
increase memory efficiency (default: False)
-r, --redundant_ids allow query and subject to share the same identifier
(default: False)
usage: transeq [-h] [-v {short,long}] [-p {1,2}] [-r] [-s STOP_CHARACTER]
[--no_mask]
[sequence_input]
Translate nucleotide sequence into amino acid sequence
positional arguments:
sequence_input sequence string or file containing sequence(s)
(default: None)
optional arguments:
-h, --help show this help message and exit
-v {short,long}, --verbosity {short,long}
amino acid abbreviation length (e.g. Glu/Glutamine)
(default: single)
-p {1,2}, --phase {1,2}
change the phase of translation (default: 0)
-r, --reverse_complement
reverse-complement the sequence before translation
(default: False)
-s STOP_CHARACTER, --stop_character STOP_CHARACTER
the string to use for stop codons (default: *)
--no_mask do not mask unrecognized codons with X; report
lowercased (default: False)