peptblast
A set of scripts providing convenient interface for finding peptide matches in proteins and/or peptide matches between non-orthologous proteins.
peptblast.py
Script performs blast and generates two files: PEP_DB_1.txt contains continuous matches and PEP_DB_2.txt contains matches with a gap.
Parameters
-h, --help show help message and exit.
--db DB Database in fasta format (required).
--pep PEP Peptides or proteins file in fasta format (required).
--cont CONT Minimal length of continuous fragment, default = 5.
--leftmin LEFTMIN Minimal length of left fragment, default = 3.
--rightmin RIGHTMIN Minimal length of right fragment, default = 3.
--summin SUMMIN Minimal total matched length for fragmented hit, default = 8.
--gapmax GAPMAX Minimal gap size, maximal gap size, default = 3.
--threads THREADS Blast threads, default = 4.
--eval EVAL Required e-value for blast, otherwise estimated as `approx.db_size/(20^cont)`.
--wordsize WORDSIZE Blast word size, default = 2.
-s Use blast-short tool instead of blast.
--printold Use old filtering/printing method.
-S Suppress continuous hots in _2.txt file.
--sort Sort output, metric = length for contunous hits or `sum(matched)-length(gap)` for fragmented hits.
--keeptmp Keep temporary files.
--usetmp USETMP Use existing temporary directory for creating output. Do not perform blast.
Example command line
./peptblast.py --pep Targets.fasta --db UniProtKB.fasta --keeptmp --threads 24 --sort --cont 6
Output
Each output hit stored as a 6-line text record with empty lines between them. Example:
6 length hit at [43:49]
WP_002955589.1 [4 : 220] MULTISPECIES: ABC transporter ATP-binding protein [Helicobacter]
vfitgasgsGKSTILssffghlgvksghlnvfgvsmqkaskarinhlrrnigivfqdyklikewniernvmlpmvingykkevcksqvekllvhiklshk
+ + g sg+GKSTIL f + sg + + g + + ++a lr +ig+v qd l + i n+ v g +ve +
lalvgpsgaGKSTILrllfrfydissgciridgqdisqvtqa---slrshigvvpqdtvlfnd-tiadnirygrvtag------ndeveaaaqaagihda
tr|A0A024R436|A0A024R436_HUMAN [183 : 398] ATP-binding cassette, sub-family B (MDR/TAP), member 6, isoform CRA_f OS=Homo sapiens GN=ABCB6 PE=3 SV=1
Lines description:
- Contains 3 tab separated fields.
1.1. Length of continuous hit ora/b/c
where a is left fragment length, c is right fragment length and b is gap length in fragmented hit.
1.2. Position of the hit in the blasts HSP.
1.3. Number of occurrences of db hit in the target (added by another script). - Also contains 3 tab separated fields.
2.1. ID in the pep file.
2.2. Position of the hit in the pep file record.
2.3. Description of the hit in the pep file (optional). - Hit sequence, hit is in UPPERCASE.
- Blasts midline, hit is in UPPERCASE.
- Hit sequence in db, hit is in UPPERCASE.
- As field 2 contains 3 tab separated fields.
6.1. ID in the db file
6.2. Position of the hit in the db file record.
6.3. Description of the hit in the pep file (optional).
merger_filer.py
Script performs merging _1.txt and _2.txt files produced by peptblast.py
and basic filtration.
Parameters
-h, --help show help message and exit.
--1 ONE continuous (_1.txt) hits file (required).
--2 TWO gaped (_2.txt) hits file (required).
--filter FILTER Only use less or equal cont occurrences (Can be applied only to `counter_filter.py` processed files).
--best Select only best match with query/hit combination based on metric (length for continuous hits a+c-b for gaped hits), default = False.
--out OUT Output file, default = merged.txt.
Example command line
./merger_filter.py --1 PEP_DB_1.txt --2 PEP_DB_2.txt --best --out merged.txt
gi_annotator.py
Adds GI annotations (fields 3) in lines 2 or 6 if not present. Modifies file in-place.
Parameters
infile Peptblast output file (required).
-h, --help show help message and exit.
--nobk Do not save backup file.
Example command line
./gi_annotator.py merged.txt
counter_filter.py
Script counts queries and hits occurrences and performs filtration. Modifies file in-place.
Parameters
infile Peptblast output file (required).
-h, --help show help message and exit.
--count {query,hit} What id to count: `query` or `hit`.
--nosort Don't sort and modify input file by occurencies.
--table TABLE Create output table of occurrences.
--nobk Do not save backup file.
--taxfilter TAXFILTER Allow only hits with these taxons separated by commas (if organism unidentified the record is leaved).
--organisms ORGANISMS Create output table for organism frequencies.
--le LE Print only hits occurred less or equal times
Example command line
./counter_filter.py merged.txt --count query --table queries.txt --taxfilter bacteria --organisms organinsms.txt --le 10
descriptionfilter.py
Filter entries by description of query and hit. Delete records with fraction of common words in description >'minfcommon'.
Parameters
infile Peptblast output file.
-h, --help show this help message and exit
--wordlen WORDLEN Minimal length of a word to consider
--minfcommon MINFCOMMON Minimal proportion of common words to all words in descriptions to filter out a record
--out OUT Output file.
Example command line
./descriptionfilter.py --wordlen 3 --minfcommon 0.5
freqsplitter.py
Splits input file according to query frequencies: head
least frequent entries into infile_head, tail
most frequent entries into infile_tail and rest into infile_mid.
Parameters
infile Peptblast output file.
head Size of 'head'
tail Size of 'tail'
Example command line
./freqsplitter.py merged.txt 50 50
Typical workflow
peptblast.py -> merger_filter.py -> gi_annotator.py -> descriptionfilter.py -> counter_filter.py -> freqsplitter.py