Collection of scripts used to visualize protein sequences.

Update 2024.04.08: BLAST and Clustal Omega API calls have been removed and replaced by calls to local installations: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
http://www.clustal.org/omega/

NOTE: The above just need to be unpacked and added to Path/PATH/etc.

NOTE: The descriptions in this README are not entirely accurate as of 2024.04.08. Scripts should function similarly, but now require the command-line installations of BLAST and Clustal Omega.

Continue with out-of-date README

NOTE: These scripts make use of EMBL-EBI and NCBI resources. References for tools and databases used here include:

UniProt:
The UniProt Consortium.
“UniProt: The Universal Protein Knowledgebase in 2023.”
Nucleic Acids Research 51, no. D1 (January 6, 2023): D523–31. https://doi.org/10.1093/nar/gkac1052.

NCBI:
Sayers, Eric W, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, et al.
“Database Resources of the National Center for Biotechnology Information.”
Nucleic Acids Research 50, no. D1 (December 1, 2021): D20–26. https://doi.org/10.1093/nar/gkab1112.

Clustal Omega:
Sievers, Fabian, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, et al.
“Fast, Scalable Generation of High‐quality Protein Multiple Sequence Alignments Using Clustal Omega.”
Molecular Systems Biology 7, no. 1 (January 2011): 539. https://doi.org/10.1038/msb.2011.75.

Sievers, Fabian, and Desmond G. Higgins.
“Clustal Omega for Making Accurate Alignments of Many Protein Sequences.”
Protein Science: A Publication of the Protein Society 27, no. 1 (January 2018): 135–45. https://doi.org/10.1002/pro.3290.

BLAST+:
Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden.
“BLAST+: Architecture and Applications.”
BMC Bioinformatics 10, no. 1 (December 2009): 1–9. https://doi.org/10.1186/1471-2105-10-421.

https://blast.ncbi.nlm.nih.gov/doc/blast-help/references.html#references

Prerequisites

General:

Internet connection (when running search_proteins.py and retrieve_annotations.py)
Python 3.7+
~ 1 GB of storage for Swiss-Prot BLAST database (if running search_proteins.py)

Command-line tools:

Clustal Omega (http://www.clustal.org/omega/)
NCBI BLAST+ (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)

Python Libraries:

pandas
requests

Installing

Before running, ensure that required command-line tools are on your PATH.

Clustal Omega is required for alignment.py
NCBI BLAST+ is required for search_proteins.py

Download protein_alignment_tool and add it to PATH.

search_proteins.py

Takes one or more protein sequences (FASTA format) as input and BLASTs them against UniProt databases (uniprotkb_refprotswissprot).

NOTE: --stype dna is currently not supported in any form. May enter an infinite loop. Please do not use --stype dna until updated.

INPUT: FASTA-formatted file with at least one sequence

OUTPUT: A set of directories - one for each sequence in the original input file - that contain the following:
the BLAST results for that sequence (the query) against UniProt databases in both table ([QUERY].tsv) and readable form ([QUERY].out)
individual FASTA files with UniProt sequences for each BLAST hit
one FASTA file containing all protein sequences, including the query sequence (all.fasta)

If used in a "multi" run, downstream commands will be run on each resulting collection of outputs.

For example: blast.py will yield multiple all.fasta (one for each query), which can be sent to both retrieve_annotations.py and alignment.py. This is why "blast annotate align" is a valid input for the --order optional argument.

Example usage from main_tool.py:
python main_tool.py [-i INFILE] [-o OUT_DIRECTORY] blast [-h] [-s STYPE] [-e EMAIL] [-nr NUM_RES]

Example usage:
python blast.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-e EMAIL] [-nr NUM_RES]

optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory. Must end with "/".
-s STYPE, --stype STYPE
Sequence type ("protein" or "dna"). Use "dna" if aligning RNA sequences too. If run using multi, use only protein sequences and "protein" in the --stype optional argument.
-e EMAIL, --email EMAIL
Personal email. Used to submit BLAST and Clustal Omega jobs.
-nr NUM_RES, --num_res NUM_RES
Number of results.

retrieve_annotations.py

Takes one or more UniProt protein sequences (FASTA format) as input and retrieves annotations for those sequences.

NOTE: Please only use UniProt sequences here until this feature is updated. May run infinitely if it cannot find a given entry.

INPUT: A FASTA-formatted, CLUSTAL_NUM-formatted, or CLUSTAL-formatted file with at least one protein sequence. ALL protein sequences must be from UniProt (specifically, they must have a UniProt accession at the beginning of their names). At least three sequences are necessary if used in a multi run that includes alignment.py (align).

OUTPUT: A collection of files that includes the following:
individual annotation files (.ann), one for each unique sequence in the input file
one combined annotation file that includes all annotations for this collection of sequences (all.ann)

NOTE: all.ann can be used as input for clustal_to_svg.py. See ANNOTATION FORMAT for help formatting annotations by hand.

Example usage from main_tool.py:
python main_tool.py [-i INFILE] [-o OUT_DIRECTORY] annotate [-h]

Example usage:
python retrieve_annotations.py [-h] [-i INFILE] [-o OUT_DIRECTORY]

alignment.py

Takes at least three protein sequences as input and aligns them using Clustal Omega.

INPUT: A FASTA-formatted file with at least three sequences

OUTPUT: An alignment (.clustal_num) and associated percent identity matrix (.pim) of the given FASTA file

Example usage from main_tool.py:
python main_tool.py [-i INFILE] [-o OUT_DIRECTORY] align [-h] [-s STYPE] [-e EMAIL] [-t TITLE]

Example usage:
python alignment.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-s STYPE] [-e EMAIL] [-t TITLE]

optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory. Must end with "/".
-s STYPE, --stype STYPE
Sequence type ("protein" or "dna"). Use "dna" if aligning RNA sequences too. If run using multi, use only protein sequences and "protein" in the --stype optional argument.
-e EMAIL, --email EMAIL
Personal email. Used to submit BLAST and Clustal Omega jobs.
-t TITLE, --title TITLE
Title for the output alignment (.clustal_num) and percent identity matrix (.pim). Example: alignment1 -> alignment1.clustal_num, alignment1.pim

clustal_to_svg.py

Reformats a .clustal_num or .clustal alignment into an editable Inkscape SVG. Currently annotates conserved residues (automatic, not optional) and active site residues (requires an input annotation file, optional).

INPUT: A CLUSTAL or CLUSTAL_NUM file

OUTPUT: A sequential set of SVGs (.svg), numbered 0, 1, 2, etc., with formatted alignments and associated conserved residues and/or annotations.

Example usage from main_tool.py:
python main_tool.py [-i INFILE] [-o OUT_DIRECTORY] svg [-h] [-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS]

Example usage:
python clustal_to_svg.py [-h] [-i INFILE] [-o OUT_DIRECTORY] [-c CODES] [-n NUMS] [-u UNIPROT_FORMAT] [-a ANNOTATIONS]

optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory. Must end with "/".
-c CODES, --codes CODES
Default FALSE. If TRUE, will add Clustal Omega conservation codes to the bottom of each aligned block.
-n NUMS, --nums NUMS
Default FALSE. If TRUE, will add total residue numbers to the right side of every line.
-u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
Default FALSE. If TRUE, will truncate accessions according to UniProt formatting. Example: sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
-a ANNOTATIONS, --annotations ANNOTATIONS
Full path to annotation file. Currently only supports active site annotations. Others will be ignored. If run using multi, annotations can either be provided separately, or acquired from UniProt by including "annotate" in the --order optional argument.

main.py

Runs one or more of the above in the order given.

INPUT: Depends on which tool(s) are being executed. Should be an acceptable input of the first tool being executed. Inputs for runs that start with annotate.py may additionally be limited by what can be passed to downstream tools. Example: multi --order annotate svg -> must use .clustal or .clustal_num file as input (.fasta file cannot be used by clustal_to_svg.py)

OUTPUT: Depends on which tool(s) are being executed. Each tool will have its own output if it is included in a run. All outputs will be split into separate directories in a multi run that includes blast.py.

Example usage:
python main.py [-i INFILE] [-o OUT_DIRECTORY] {blast,annotate,align,svg,multi} [-h] [-ord ORDER [ORDER...]]
[-s STYPE] [-e EMAIL]
[-nr NUM_RES] [-t TITLE]
[-c CODES] [-n NUMS]
[-u UNIPROT_FORMAT]
[-a ANNOTATIONS]

Centralized Tool Manager

positional arguments:
{blast,annotate,align,svg,multi}
Tool to execute

optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Full path of input file.
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Full path of output directory. Must end with "/".
-ord ORDER [ORDER ...], --order ORDER [ORDER ...]
Order of tools to run if "multi" is used as a positional argument. There are currently limited ways to run multi (inputs and outputs will vary depending on start and end):
blast annotate align svg
blast annotate align
blast annotate
blast align annotate svg
blast align annotate
blast align svg
blast align
blast
annotate align svg
annotate align
annotate svg
annotate
align annotate svg
align annotate
align svg
align
svg
-s STYPE, --stype STYPE
Sequence type ("protein" or "dna"). Use "dna" if aligning RNA sequences too. If run using multi, use only protein sequences and "protein" in the --stype optional argument.
-e EMAIL, --email EMAIL
Personal email. Used to submit BLAST and Clustal Omega jobs.
-nr NUM_RES, --num_res NUM_RES
Number of results.
-t TITLE, --title TITLE
Title for the output alignment (.clustal_num) and percent identity matrix (.pim). Example: alignment1 -> alignment1.clustal_num, alignment1.pim
-c CODES, --codes CODES
Default FALSE. If TRUE, will add Clustal Omega conservation codes to the bottom of each aligned block.
-n NUMS, --nums NUMS
Default FALSE. If TRUE, will add total residue numbers to the right side of every line.
-u UNIPROT_FORMAT, --uniprot_format UNIPROT_FORMAT
Default FALSE. If TRUE, will truncate accessions according to UniProt formatting. Example: sp|P00784|PAPA1_CARPA -> PAPA1_CARPA
-a ANNOTATIONS, --annotations ANNOTATIONS
Full path to annotation file. Currently only supports active site annotations. Others will be ignored. If run using multi, annotations can either be provided separately, or acquired from UniProt by including "annotate" in the --order optional argument.

ANNOTATION FORMAT

As of 2024.03.19, only annotations of type "Active site" will be used. This will be updated in the future.

Annotation files that include the following columns and VALUES (tab-delimited) can be used as inputs for clustal_to_svg.py:

	prot	whole_prot	type	location.start.value	location.end.value
ARBITRARY_INDEX	UNIPROT_FORMAT_ACC	FULL_ACCESSION	ANNOTATION_TYPE	START	END

A real example might look like:

	prot	whole_prot	type	location.start.value	location.end.value
0	PAPA1_CARPA	sp|P00784|PAPA1_CARPA	Active site	158	158

A truncated, but real example of a valid annotation file can be found in annotation_example.ann.

grtakaha/protein_alignment_tool