MP3vec : A Reusable Machine-Constructed Feature Representation for Protein Sequences

Multi-Purpose Protein Prediction Vectors are generated by training a Deep Neural Network on a source problem (secondary structure prediction) and extracting the outputs of the intermediate layers. These vectors are transferable representations that can be used as baseline features across a wide range of target tasks involving sequence to sequence learning. This repository contains software tools to generate MP3 vectors for a dataset of given proteins.

If you use our code in your projects, please cite our paper

@INPROCEEDINGS{9313301,  
author={S. R. {Gupte} and D. S. {Jain} and A. {Srinivasan} and R. {Aduri}},  
booktitle={2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},   
title={MP3vec: A Reusable Machine-Constructed Feature Representation for Protein Sequences},   
year={2020},  volume={},  number={},  pages={421-425},  
doi={10.1109/BIBM49941.2020.9313301}}

S. R. Gupte, D. S. Jain, A. Srinivasan and R. Aduri, "MP3vec: A Reusable Machine-Constructed Feature Representation for Protein Sequences," 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea (South), 2020, pp. 421-425, doi: 10.1109/BIBM49941.2020.9313301.

Index

Package overview
Installation of NCBI BLAST+ and Uniref90
- Installing NCBI BLAST+
- Setting up the Uniref90 database
Installing the MP3vec package
Usage instructions
- Command line script for generating MP3 Vectors
- Command line script for generating PSSM files using PSI-BLAST
Examples
- Example 1: Using the command line scripts to generate MP3 vectors
- Example 2: A Python script for generating MP3 vectors

Package overview

The MP3vec python module has a core MP3Model class that wraps the trained model with a couple of utility functions to read and process input PSSM files. This model returns a (L,640) dimensional vector where L is the length of the protein sequence. For those users who wish to generate MP3vecs for PSSM files without using the python module, there is a command line utility script, mp3vec, to generate vectors for a list of PSSM files. Since the MP3vec generation requires PSSM files, another command line utility, mp3pssm, has been provided to generate them from a single FASTA file of sequences using PSI-BLAST.

This package has been tested on Ubuntu 20.04 but should work on other Linux based systems as well. Our code is tied to TensorFlow v2.2.0 and runs on Python v3.5-3.8. To generate PSSMs using PSI-BLAST, you will need to install the NCBI BLAST+ software suite and download the Uniref90 dataset.

Installation of NCBI BLAST+ and Uniref90

Installing NCBI BLAST+

Download the Linux binaries for NCBI BLAST+ from http://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST by copying the link in your browser's address bar. The Linux binaries will have a name like ncbi-blast-2.7.1+-x64-linux.tar.gz although the version number 2.7.1 may change over time.

Once the download is complete, copy the file to a folder of your choice, say ~/Desktop and extract the file with the command tar xvf ncbi-blast-2.7.1+-x64-linux.tar.gz. All the BLAST+ executables will be stored in the folder ~/Desktop/ncbi-blast-2.7.1+/bin/. However, you will not be able to run these programs from the command line unless you update the PATH environment variable. To do so, use the following command : export PATH="~/Desktop/ncbi-blast-2.7.1+/bin:$PATH". If you run echo $PATH, you should now be able to see that the PATH variable has been updated. To make this change permanent, it is recommended that you update the PATH variable in your ~/.bashrc file, otherwise the MP3vec utility scripts which generate PSSMs using PSI-BLAST will not be able to run psiblast. To do so, use your editor of choice to add the following line to your ~/.bashrc file : export PATH="~/Desktop/ncbi-blast-2.7.1+/bin":$PATH. Then reload ~/.bashrc with the command source ~/.bashrc.

anket@GPU:~$ cp Downloads/ncbi-blast-2.7.1+-x64-linux.tar.gz Desktop/
sanket@GPU:~$ cd Desktop/
sanket@GPU:~/Desktop$ tar xvf ncbi-blast-2.7.1+-x64-linux.tar.gz 
ncbi-blast-2.7.1+/
ncbi-blast-2.7.1+/doc/
ncbi-blast-2.7.1+/doc/README.txt
ncbi-blast-2.7.1+/LICENSE
ncbi-blast-2.7.1+/ChangeLog
ncbi-blast-2.7.1+/README
ncbi-blast-2.7.1+/ncbi_package_info
ncbi-blast-2.7.1+/bin/
ncbi-blast-2.7.1+/bin/makembindex
ncbi-blast-2.7.1+/bin/makeblastdb
ncbi-blast-2.7.1+/bin/tblastx
ncbi-blast-2.7.1+/bin/blastdbcheck
ncbi-blast-2.7.1+/bin/dustmasker
ncbi-blast-2.7.1+/bin/rpsblast
ncbi-blast-2.7.1+/bin/segmasker
ncbi-blast-2.7.1+/bin/windowmasker
ncbi-blast-2.7.1+/bin/convert2blastmask
ncbi-blast-2.7.1+/bin/update_blastdb.pl
ncbi-blast-2.7.1+/bin/psiblast
ncbi-blast-2.7.1+/bin/blastn
ncbi-blast-2.7.1+/bin/blast_formatter
ncbi-blast-2.7.1+/bin/blastp
ncbi-blast-2.7.1+/bin/makeprofiledb
ncbi-blast-2.7.1+/bin/legacy_blast.pl
ncbi-blast-2.7.1+/bin/blastdb_aliastool
ncbi-blast-2.7.1+/bin/deltablast
ncbi-blast-2.7.1+/bin/blastx
ncbi-blast-2.7.1+/bin/rpstblastn
ncbi-blast-2.7.1+/bin/tblastn
ncbi-blast-2.7.1+/bin/blastdbcmd
sanket@GPU:~/Desktop$ export PATH="/home/sanket/Desktop/ncbi-blast-2.7.1+/bin:$PATH"
sanket@GPU:~/Desktop$ echo $PATH
/home/sanket/Desktop/ncbi-blast-2.7.1+/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
sanket@GPU:~/Desktop$ which psiblast
/home/sanket/Desktop/ncbi-blast-2.7.1+/bin/psiblast
sanket@GPU:~/Desktop$ psiblast -version
psiblast: 2.7.1+
 Package: blast 2.7.1, build Oct 18 2017 19:57:24
sanket@GPU:~/Desktop$

Setting up the Uniref90 database

You can find the download link for Uniref90 at https://www.uniprot.org/downloads. Make sure you download the FASTA file from http://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz. Create a new folder, say ~/Desktop/uniref90, at a location of your choice and copy the uniref90.fasta.gz file to this folder. Extract the file in the folder using the command gunzip ~/Desktop/uniref90/uniref90.fasta.gz so that it appears at ~/Desktop/uniref90/uniref90.fasta.

Now you can build the BLAST database using the uniref90.fasta file that you have extracted. The makeblastdb command is used to create the BLAST database. Open a Terminal and navigate to the folder containing the uniref90.fasta file, ~/Desktop/uniref90/, and issue the command makeblastdb -in uniref90.fasta -out uniref90.db -dbtype prot. This creates a new database uniref90.db using the input uniref90.fasta file. The database type is specified by -dbtype prot which creates a protein database.

sanket@GPU:~$ mkdir ~/Desktop/uniref90
sanket@GPU:~$ cp ~/Downloads/uniref90.fasta.gz ~/Desktop/uniref90
sanket@GPU:~$ cd Desktop/uniref90/
sanket@GPU:~/Desktop/uniref90$ gunzip uniref90.fasta.gz
sanket@GPU:~/Desktop/uniref90$ makeblastdb -in uniref90.fasta -out uniref90.db -dbtype prot


Building a new DB, current time: 08/17/2018 00:23:46
New DB name:   /home/sanket/Desktop/uniref90/uniref90.db
New DB title:  uniref90.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 52629880 sequences in 1536.75 seconds.
sanket@GPU:~/Desktop/uniref90$

The database has been successfully built and you can now run queries against it using psiblast. For ease of use, you need to set a new environment variable, BLASTDB, which points to the location of this database. You can do this with export BLASTDB="/home/sanket/Desktop/uniref90/uniref90.db". Make sure that you do not accidentally enter the path to the original uniref90.fasta file, you don't need it any more. Consider adding this variable to your ~/.bashrc file to make the changes permanent, similar to how you updated the PATH variable.

To check that the BLASTDB environment variable has been set properly, open a Terminal and type echo $BLASTDB. You should see the location of the database, i.e. /home/sanket/Desktop/uniref90/uniref90.db.

sanket@GPU:~$ export BLASTDB="/home/sanket/Desktop/uniref90/uniref90.db"
sanket@GPU:~$ echo $BLASTDB
/home/sanket/Desktop/uniref90/uniref90.db
sanket@GPU:~$

Installing the MP3vec package

It is recommended that you use a virtual environment for the installation of this package. This creates a separate environment making it easier to manage packages and their dependencies. Once you create and activate your virtual environment, you can install the MP3vec package with pip install git+https://github.com/sanketx/MP3vec.git

(mp3env) sanket@GPU:~$ pip install git+https://github.com/sanketx/MP3vec.git 
Collecting git+https://github.com/sanketx/MP3vec.git
  Cloning https://github.com/sanketx/MP3vec.git to /tmp/pip-req-build-b6wh1uwg
...(text ommited)
Successfully built mp3vec
Installing collected packages: mp3vec
Successfully installed mp3vec-1.0.0
(mp3env) sanket@GPU:~$

You can test the installation by opening a python shell and importing the module.

>>> import mp3vec
>>> mp3vec.__version__
'1.0.0'

The package has now been successfully installed along with the mp3vec and mp3pssm utility scripts.

Usage instructions

Command line script for generating MP3 Vectors

A command line utility, mp3vec has been provided for users who wish to generate MP3vecs without writing code. This utility is automatically installed when you install the mp3vec python package. You can specify a directory containing PSSM files and the script will vectorize them and write them to a destination directory. Currently, two output formats are supported, the binary numpy format with a .npy extension and the CSV format with a .csv extension. You need to specify the format while using the script. An optional model parameter is available in case you want to use a custom model, by default the script uses the pre-trained model provided in the package. You can view the flags with mp3vec -h.

(mp3env) sanket@GPU:~/mp3_project$ mp3vec -h
usage: mp3vec [-h] -i IN_DIRECTORY -o OUT_DIRECTORY [-m MODEL_FILE] -t
              {NPY,CSV}

MP3vec command line utility. This program can be used to generate Multi-
Purpose Protein Prediction vectors as described in the paper. It accepts PSSM
files as input and creates a protein feature vector that can be saved as
either a Numpy array or as a CSV file. To use this program, please provide the
path to a directory containing PSSM files. Ensure that these files have a
'.pssm' extension or they will be ignored. You also have to specify an output
directory where the generated vectors will be stored. Finally, you need to
specify the output file format, either numpy array or CSV file.

optional arguments:
  -h, --help            show this help message and exit
  -i IN_DIRECTORY, --in_directory IN_DIRECTORY
                        Path to Input Directory containing PSSM files
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Path to Output Directory to write MP3vec files
  -m MODEL_FILE, --model_file MODEL_FILE
                        Path to MP3 model file. Optional parameter for custom
                        models
  -t {NPY,CSV}, --otype {NPY,CSV}
                        Output file format. numpy (NPY) or comma separated
                        values (CSV)
(mp3env) sanket@GPU:~/mp3_project$

Command line script for generating PSSM files using PSI-BLAST

The mp3pssm utility has been provided to ease the process of PSSM generation. It enables users to provide a single input fasta file with protein sequences and mp3pssm will automatically generate PSSMs for each sequence. You need to specify the input file, the destination directory for generated PSSM files, the BLAST database location (which you should have stored in the BLASTDB environment variable, if not, check the section "Installation of NCBI BLAST+ and Uniref90" for instructions. You can optionally specify the number of threads you wish to use in the BLAST search. You can view the flags with mp3pssm -h.

(mp3env) sanket@GPU:~/mp3_project$ mp3pssm -h
usage: mp3pssm [-h] -i IN_FILE -o OUT_DIRECTORY -d BLAST_DB [-n NUM_THREADS]

Wrapper utility for generating PSSM files using PSI-BLAST. This program
generates PSSM profiles for FASTA sequences. You need to provide the path to
an input FASTA file (multiple sequences are allowed) along with the
destination directory where the PSSM files will be stored. The files will be
named using the sequence ID in the FASTA file and will be saved with a '.pssm'
extension. You also need to specify the path to the Uniref90 BLAST database
and the number of threads you wish to use for the PSSM computation. By
default, only 1 thread will be used.

optional arguments:
  -h, --help            show this help message and exit
  -i IN_FILE, --in_file IN_FILE
                        Path to Input FASTA file containing protein sequences
  -o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
                        Path to Output Directory to write PSSM files
  -d BLAST_DB, --blast_db BLAST_DB
                        Path to Uniref90 BLAST database
  -n NUM_THREADS, --num_threads NUM_THREADS
                        NNumber of threads (CPUs) to use in the BLAST search
(mp3env) sanket@GPU:~/mp3_project$

Examples

Example 1: Using the command line scripts to generate MP3 vectors

The following example takes you through the steps of using the command line scripts to generate MP3 vectors for a FASTA file with two protein sequences. The steps are nearly identical for both Linux and Windows platforms, and any differences will be highlighted.

First, copy the contents of the text box below and save them as a FASTA file, say test.fa, in the mp3_project folder. For Windows users, this folder will be C:\Users\Sanket\mp3_project>, for Linux users this will be the ~/mp3_project directory.

>PROT1
MVKLTAELIEQAAQYTNAVRDRELDLRGYKIPVIENLGATLDQFDAIDFSDNEIRKLDGFPLLRRLKTLLVNNNRICRIG
EGLDQALPDLTELILTNNSLVELGDLDPLASLKSLTYLCILRNPVTNKKHYRLYVIYKVPQVRVLDFQKVKLKERQEAEK
MFKGKRGAQLAKDIAR
>PROT2
MDIRPNHTIYINNMNDKIKKEELKRSLYALFSQFGHVVDIVALKTMKMRGQAFVIFKELGSSTNALRQLQGFPFYGKPMR
IQYAKTDSDIISKMRG

You should now have a file test.fa in the mp3_project folder. Now make two new folders in the mp3_project folder, one called pssm_dir for storing the generated PSSM files, and another called vec_dir to store the generated MP3 vectors. Do this using the command mkdir pssm_dir vec_dir. Now you are ready to run the mp3pssm utility to generate the PSSM profiles for the protein sequences in test.fa.

Linux users should run the command mp3pssm -i test.fa -o pssm_dir/ -d $BLASTDB -n 8, while Windows users should run the command mp3pssm -i test.fa -o pssm_dir/ -d %BLASTDB% -n 8. The only difference is in the way the BLASTDB environment variable is provided. Make sure you have activated the virtual environment prior to running the script, otherwise you will get a command not found / unrecognized command error message. You can tell if the virtual environment is active by looking for the name of the environment enclosed in parentheses before the prompt, like (mp3env) sanket@GPU:~$ or (mp3env) C:\Users\Sanket>.

(mp3env) sanket@GPU:~/mp3_project$ mkdir pssm_dir vec_dir
(mp3env) sanket@GPU:~/mp3_project$ ls
mp3env  MP3vec  pssm_dir  test.fa  vec_dir
(mp3env) sanket@GPU:~/mp3_project$ mp3pssm -i test.fa -o pssm_dir/ -d $BLASTDB -n 8
Generated PSSM for protein PROT1
Generated PSSM for protein PROT2
Generated PSSMs for 2 proteins
(mp3env) sanket@GPU:~/mp3_project$

You can check the contents of pssm_dir to see if the PSSM files have been generated. You should find PROT1.pssm and PROT2.pssm in the folder. Keep in mind that the file name of the generated file will be the same as the text provided in the FASTA file after the initial > symbol. The .pssm extension is automatically added to this name. In rare cases, running the PSI-BLAST query will not result in any hits, and no output PSSM file will be created. Unfortunately, MP3 vectors cannot be generated without the PSSM file.

Once the PSSM files have been generated and stored in pssm_dir, you can run the mp3vec command to generate MP3 vectors from these files. From within the mp3_project directory, run the command mp3vec -i pssm_dir/ -o vec_dir/ -t NPY to save the vectors as numpy files, or run mp3vec -i pssm_dir/ -o vec_dir/ -t CSV to save the vectors as CSV files. The output file names will the same as the input file names, only the extension will be different, either .npy or .csv depending on your choice of the output file type. Please note that the mp3vec script will only search the input folder for files ending with .pssm, any other files will be ignored.

(mp3env) sanket@GPU:~/mp3_project$ mp3vec -i pssm_dir/ -o vec_dir/ -t NPY
Vectorized PROT1, file 1 / 2
Vectorized PROT2, file 2 / 2
(mp3env) sanket@GPU:~/mp3_project$
(mp3env) sanket@GPU:~/mp3_project$ mp3vec -i pssm_dir/ -o vec_dir/ -t CSV
Vectorized PROT1, file 1 / 2
Vectorized PROT2, file 2 / 2
(mp3env) sanket@GPU:~/mp3_project$ ls vec_dir/
PROT1.csv  PROT1.npy  PROT2.csv  PROT2.npy
(mp3env) sanket@GPU:~/mp3_project$

Check the contents of vec_dir, you should be able to see the generated vector files. You can now use these vectors for your Machine Learning experiments. If you're using Python, you can load the numpy matrices directly. R and Matlab users can load the vectors from the csv files. Once you are done, you can deactivate the virtual environment by typing deactivate in the Terminal or Command Prompt.

Example 2: A Python script for generating MP3 vectors

If you would like to generate MP3 vectors using a python script, you can import the mp3vec package into your code. The mp3vec module has a core MP3Model class. The pretrained model provided with this package is used by default but you can specify a custom model by specifying the model file in the class constructor. The vectorize() function can be called on a numpy array containing the PSSM and the one-hot encoded protein sequence. The output of this function is the MP3 vector for that protein.

A utility function, encode_file(), is provided in order to read a PSSM file and convert it into a numpy array which can then be fed as input to the model. Note that this function automatically reads the protein sequence from the file and converts it to a one-hot encoded form. The function returns this sequence along with the protein matrix (one-hot vector + PSSM vec). The model's vectorize() function can then be used to convert this protein matrix into the MP3 vector.

For more documentation, please see the mp3vec.py file.

>>> from mp3vec import *
>>> model = MP3Model()
>>> seq, protein_matrix = model.encode_file("PROT9.pssm")
>>> seq
'LECHNQQSSQTPTTTGCSGGETNCYKKRWRDHRGYRTERGCGCPSVKNGIEINCCTTDRCNN'
>>> protein_matrix.shape
(1, 62, 42)
>>> vec = model.vectorize(protein_matrix)
>>> vec.shape
(62, 640)