Disorderly.py

Compare protein sequences by their lengths and compositions

Requires Python 3+

To see the commands:

$ python3 disorderly.py -h

How to use it?

1. Prepare your query

Put your query sequences in FASTA format and put them in a file

Your database is made of sequences that you want to compare against. This is also in FASTA format, but we need to convert it to a .disorderdb database so it can be used to search against. Generate a .disorderdb file from your database using the following command:

$ python3 disorderly.py -v -fb path/to/your_database.fasta

-v Verbose flag

-fb Database FASTA file

This will generate your_database.fasta.disorderdb in the same folder as your_database.fasta

3. Search

Each of your queries is compared only to sequences of the same length in the database. Once a same-length sequence is found, the Euclidean distance between the compositions of your query and the database sequence is computed. The output contains all the same-length sequences sorted by the Euclidean distance (low to high).

This search is distributed over all the available CPUs!

$ python3 disorderly.py -v -i path/to/query.fasta -db path/to/your_database.fasta.disorderdb

-i Your query sequences in FASTA

-db The converted .disorderdb database

This will generate a .csv with the same name as your query with a bit of additional stuff (i.e. for query.fasta, the result will be query_search-20180816190934-ABCD.csv). The -v verbose flag will tell you where your result is, which will be in the same directory as your query)

Alternatively, you can run everything all at once:

$ python3 disorderly.py -v -i query.fasta -fb your_database.fasta

The previous step-by-step instruction is meant to help you understand what is really going on.

Reading the result

Open the .csv file with a text editor or Excel

The format is (sequence IDs are the FASTA headers):

Queries	Hits	Distances
query-seq-1	database-seq-9	0.000
query-seq-1	database-seq-5	0.135
query-seq-1	database-seq-14	0.246
query-seq-2	database-seq-3	0.000
query-seq-2	database-seq-75	0.321

$ sbatch bash_run.sh -v -i query.fasta -fb your_database.fasta

NOTE: bash_run.sh must be in the same folder as disorderly.py

ALSO: It is currently configured to use the DPB partition and 24 cores (1 node on MEMEX). Edit the file with any editor to change this, i.e.:

#SBATCH -p dge    # To use the DGE partition
#SBATCH -c 12     # for 12 cores

qks1lver/disorderly

Disorderly.py

Compare protein sequences by their lengths and compositions

MIT License.

Requires Python 3+

How to use it?

1. Prepare your query

2. Prepare your database

3. Search

Alternatively, you can run everything all at once:

Reading the result

Open the .csv file with a text editor or Excel

How to get it? (Install)

No wheel currently :( , so just:

1. Download the .zip

2. Unpack it wherever you want

3. Find disorderly.py under src/ and run as described above

For Stanford folks

Those that run on MEMEX (or any of our servers that uses SLURM):

Feel free to use the bash_run.sh file to submit jobs so it can be run on multiple CPUs