Disorderly.py
Compare protein sequences by their lengths and compositions
MIT License.
Requires Python 3+
To see the commands:
$ python3 disorderly.py -h
How to use it?
1. Prepare your query
Put your query sequences in FASTA format and put them in a file
2. Prepare your database
Your database is made of sequences that you want to compare against. This is also in FASTA format, but we need to convert it to a .disorderdb database so it can be used to search against. Generate a .disorderdb file from your database using the following command:
$ python3 disorderly.py -v -fb path/to/your_database.fasta
-v Verbose flag
-fb Database FASTA file
This will generate your_database.fasta.disorderdb in the same folder as your_database.fasta
3. Search
Each of your queries is compared only to sequences of the same length in the database. Once a same-length sequence is found, the Euclidean distance between the compositions of your query and the database sequence is computed. The output contains all the same-length sequences sorted by the Euclidean distance (low to high).
This search is distributed over all the available CPUs!
$ python3 disorderly.py -v -i path/to/query.fasta -db path/to/your_database.fasta.disorderdb
-i Your query sequences in FASTA
-db The converted .disorderdb database
This will generate a .csv with the same name as your query with a bit of additional stuff (i.e. for query.fasta, the result will be query_search-20180816190934-ABCD.csv). The -v verbose flag will tell you where your result is, which will be in the same directory as your query)
Alternatively, you can run everything all at once:
$ python3 disorderly.py -v -i query.fasta -fb your_database.fasta
The previous step-by-step instruction is meant to help you understand what is really going on.
Reading the result
Open the .csv file with a text editor or Excel
The format is (sequence IDs are the FASTA headers):
Queries | Hits | Distances |
---|---|---|
query-seq-1 | database-seq-9 | 0.000 |
query-seq-1 | database-seq-5 | 0.135 |
query-seq-1 | database-seq-14 | 0.246 |
query-seq-2 | database-seq-3 | 0.000 |
query-seq-2 | database-seq-75 | 0.321 |
How to get it? (Install)
No wheel currently :( , so just:
1. Download the .zip
2. Unpack it wherever you want
3. Find disorderly.py under src/ and run as described above
For Stanford folks
Those that run on MEMEX (or any of our servers that uses SLURM):
Feel free to use the bash_run.sh file to submit jobs so it can be run on multiple CPUs
$ sbatch bash_run.sh -v -i query.fasta -fb your_database.fasta
NOTE: bash_run.sh must be in the same folder as disorderly.py
ALSO: It is currently configured to use the DPB partition and 24 cores (1 node on MEMEX). Edit the file with any editor to change this, i.e.:
#SBATCH -p dge # To use the DGE partition
#SBATCH -c 12 # for 12 cores