PredictProtein: A Perl repository from Rostlab

PP Help 01: Introduction

WHAT IS IT? HOW TO USE IT?

What is PredictProtein (PP)?

PP is an automatic service for protein database searches and the prediction of aspects of protein structure and function. Given an amino acid sequence or an alignment input PP returns:

1. a multiple sequence alignment (i.e. database search),
2. ProSite sequence motifs (more info),
3. low-complexity retions (SEG) ( more info),
4. ProDom domain assignments (more info),
5. Nuclear localisation signals ( more info),

6. and predictions of
1. secondary structure (more info),
2. solvent accessibility (more info),
3. globular regions ( more info),
4. transmembrane helices (more info),
5. coiled-coil regions ( more info).
6. structural switch regions ( more info).
7. b-value (more info),
8. disorder (more info),
9. intra-residue contacts (more info),
10. protein protein and protein/DNA binding sites (more info),
11. sub-cellular localization (more info),
12 . domain assignment
13. beta barrels
14. cysteine predictions and disulphide bridges

The following features are available upon request:

1. Fold recognition by prediction-based threading (more info):
PDB is searched for possible remote homologues (sequence identity 0-25%) to your sequence,
2. Evaluation of prediction accuracy (more info):
For a given predicted and observed secondary structure (for one or several proteins), per-residue and per-segment scores are compiled.

For all services, you can submit your query over the Web.

How does PredictProtein work?

Generating an alignment. The following steps are performed.

1. The sequence database (compiled of SWISSPROT+TrEMBL+PDB) is scanned for similar sequences (by BLASTP).
2. a multiple sequence alignment is generated by iterative blast searches PSI-BLAST.
3. ProSite motifs are retrieved from the ProSite database,
4. low-complexity regions (e.g. composition bias) are marked by the program SEG,
5. and your protein is compared to a domain database (ProDom),

Prediction of protein structure in 1D. The multiple alignment is used as input for profile-based neural network predictions (PROF methods). The following levels of prediction accuracy have been evaluated in cross-validation experiments:

1. Secondary structure prediction (PHDsec or PROFsec):
expected three-state (helix, strand, rest) overall accuracy >72% (PHD) >76% (PROF) for water-soluble globular proteins. For an automatic, continuous comparison of prediction accuracy to other programs see EVA.
You may find details about accuracy in graphs, on tables, and in the literature: Rost 1997 (paper) and 1996 (paper); Rost & Sander 1993 (abstract) and 1994 (abstract).
2. Solvent accessibility prediction (PHDacc or PROFacc):
Expected correlation between observed and predicted relative accessibility > 0.5.
You may find details about accuracy in graphs, on tables, and in the literature: Rost 1997 (paper) and 1996 (paper), Rost & Sander 1994 (abstract).
3. Transmembrane helix prediction (PHDhtm):
Expected overall two-state accuracy (transmembrane, non-transmembrane) > 95%; refined prediction of transmembrane helices and topology & expected likelihood of predicting all helices correctly about 89%, expected accuracy of topology prediction > 86%
You may find details about accuracy on tables, and in the literature: Rost, Casadio & Fariselli 1996 (abstract), and Rost, Casadio, Fariselli & Sander 1995 abstract).
4. Other predictions
reference to literature
Fold recognition by prediction-based threading. Predictions of secondary structure and accessibility are aligned against PDB to detect remote homologues (prediction-based threading). As for other threading methods, results should be taken with caution.

* The first hit of the prediction-based threading is on average in 30% of the cases correct.
* Hits with z-scores above 3.0 are more reliable (accuracy > 60%).
* For exceptional cases the resulting alignments suffice for building correct homology-based models.

You may find details about accuracy in the literature: Rost, Schneider & Sander, 1996 (paper), Rost 1995 (abstract) and 1994 (abstract).

Evaluation of prediction accuracy. If you opt for 'evaluate prediction accuracy', we evaluate the accuracy of the secondary structure prediction provided. The following per-residue and per-segment scores are returned: overall three-state accuracy, single state accuracy, correlation coefficients, information entropy, fractional segment overlap, and finally the accuracy of predicting secondary structure content and structural class (Rost et al., JMB, 1994, 235, 13-26, example for output).

What is META-PP?

META-PP provides a single-page interface to various World Wide Web services for sequence analysis (list of servers available at the moment). 'Single-page interface' means that you fill in your sequence only once, and can select any number of a list of services. For each selected service, you will receive the results by email. Currently, the following features of sequence analysis are covered by META-PP:

1. signal peptides
2. cleavage sites
3. O-glycosylation sites
4. cleavage sites of picornaviral proteases
5. chloroplast transit peptides and cleavage sites
6. secondary structure prediction
7. membrane helix prediction
8. threading, or remote homology modelling (searching for proteins of known 3D structure that appear structurally similar to your protein)
9. database searches
10. homology modelling (prediction of protein 3D structure by homology to a sequence similar protein of known structure) NOTE: this will only work if there is a protein of known structure that has sufficient sequence similarity to your protein!

How to use PP and META-PP?
Use of the PredictProtein server is free for academics. Commercial users may want to apply for a license.
The use of META-PredictProtein is currently restricted to academical users.
Using the web:
1. Home page: http://www.predictprotein.org
2. Help page (this): http://www.predictprotein.org/doc/help_hello.html
3. Submit request to PP:

http://www.predictprotein.org/submit.php

Submit request to META-PP:

http://www.predictprotein.org/meta.php

Questions, feedback: http://www.predictprotein.org/feedback.php

What can we do for you?
* You have a protein sequence and want to find out anything we can say about structure and function?
In general, we can provide multiple sequence alignments and predictions of secondary structure, residue solvent accessibility and the location of transmembrane helices (examples for: request; and output).

* You have a helical transmembrane protein sequence and want a refined prediction of the helix locations and topology?
We provide multiple sequence alignments and refined predictions for the location of transmembrane helices and for the topology, i.e. the orientation of the N-term with respect to the membrane (examples for: request; and output).

* You have a protein sequence and search for remote homologues (i.e., homologues with <25% sequence identity)?
We find secondary structure and accessibility motifs similar between a known structure and your protein by prediction-based threading (examples for: request; and output).

* You have a multiple sequence alignment and want to obtain a prediction of 1D structure based on that alignment?
We use your alignment as input to the methods predicting secondary structure, solvent accessibility and transmembrane helices (examples for: request; and output).

* You have a list of sequences not in current databases and want it to be used for 1D predictions?
We align your sequences and use the resulting alignment as input to the structure and function (examples for: request; and output).

* You have a prediction of secondary structure and accessibility and search similar motifs in known structures?
We base the threading procedure on your prediction (examples for: request; and output).

* You have a prediction and an observation of secondary structure and you want to compile the prediction accuracy?
We compile per-residue and per-segment based score for the evaluation of prediction accuracy (examples for: request; and output).

QUOTE and COPYRIGHT

## Who are we?

### Current Team @ Rostlab:

- Burkhard Rost:
- Founded the PredictProtein server
- Contributed the PHD and PROF methods
- Continues to support the server by all means possible
- Tim Karl:
- Maintains the PredictProtein software and databases
- Keeps everything running
- Michael Bernhofer:
- Contributed the TMSEG method
- Christian Dallago:
- Developed and maintains the Bioembeddings server
- Michael Heinzinger:
- Contributed the ProtBERTsec and goPredSim methods
- Maria Littmann:
- Contributed the goPredSim method
- Tobias Olenyi:
- Contributed the goPredSim method
- Lothar Richter:
- Scientific advisor
- Konstantin Schütze:
- Developed and maintains the Bioembeddings server
- Guy Yachdav:
- Designed and implemented the PredictProtein pipeline and online service

### Schneider Lab @ LCSB

- Reinhard Schneider:
- Hosts and supports the PredictProtein server at the LCSB
- Original author of the HSSP method
- Piotr Gawron:
- Hosts and supports the PredictProtein server at the LCSB
- Wei Gu:
- Hosts and supports the PredictProtein server at the LCSB
- Yohan Jarosz:
- Hosts and supports the PredictProtein server at the LCSB
- Venkata Satagopam:
- Hosts and supports the PredictProtein server at the LCSB
- Noua Toukourou:
- Hosts and supports the PredictProtein server at the LCSB
- Christophe Trefois:
- Hosts and supports the PredictProtein server at the LCSB
- Maharshi Vyas:
- Hosts and supports the PredictProtein server at the LCSB

### MMseqs2 Support

- Martin Steinegger:
- Hosts and supports the MMseqs2 server
- Milot Mirdita:
- Hosts and supports the MMseqs2 server

### Additional Contributors

- Yana Bromberg:
- Contributed the original SNAP method
- Nir Ben-Tal:
- Contributed the ConSurf method
- Sean O'Donoghue:
- Developer of Aquaria
- Andrea Schafferhans:
- Contributed the PSSH method for sequence to structure mapping
- Developer of Aquaria
- Laszlo Kajan:
- Contributed the Freecontact method
- Software development and packaging
- Tatyana Goldberg:
- Contributed the LocTree3 method
- Jiajun Qiu:
- Contributed the ProNA2020 method
- Haim Ashkenazy:
- Contributed the ConSurf method
- Henry Bigelow:
- Contributed the PROFtmb method
- Tobias Hamp:
- Contributed the Metastudent method
- Maximilian Hecht:
- Contributed the SNAP2 method
- David Hoksza:
- Developer of MolArt
- Peter Hönigschmid:
- Contributed the SomeNA method
- Marco Punta:
- Contributed the Meta-Disorder method
- Avner Schlessinger:
- Contributed the PROFbval, Meta-Disorder, and NorsNet methods

### Original Contributors

- Chris Sander:
- Guided the development of the first round of PredictProtein method(s) with one of his first graduate students
- Worked with Reinhard Schneider on the HSSP method
- Developed the Evolutionary Couplings method for 3D folding with Debora Marks that built on the key original idea "let's build on evolutionary information" in PredictProtein
- Antoine de Daruvar:
- Helped getting the first PredictProtein server online
- Roy Omond:
- Helped in the communication between VMS and Unix systems for the first server
- Gerrit Vriend:
- Helped getting the PredictProtein server online

### Former Contributors

- Juan Miguel Cejuela:
- Added the literature search feature
- Rachel First:
- Designed the artwork for the localization prediction
- Designed the site tutorial
- Paolo Frasconi:
- Contributed the DISULFIND method
- Edda Kloppmann:
- Scientific advisor
- Jinfeng Liu:
- Contributed code for the PredictProtein pipeline
- Contributed the NORS, CHOP & CHOPnet methods
- Sven Mika:
- Contributed the UniqueProt method
- Rajesh Nair:
- Contributed the LocTree method
- Yanay Ofran:
- Contributed the PPSites and PROFdisis methods
- Dariusz Przybylski:
- Contributed the AGAPE method
- Jonas Reeb:
- Scientific advisor
- Manfred Roos:
- Maintained the PredictProtein Knowledgebase
- Thomas Splettstoesser:
- Designed the PredictProtein logo
- Kazimierz Wrzeszczynski:
- Contributed code and ideas

Please cite the latest publication on PredictProtein.
Links to literature related to PP.

- PredictProtein: Michael Bernhofer, Christian Dallago, Tim Karl, Venkata Satagopam, B. Rost, et. al(2021) PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Research.
* Contact: info@predictprotein.org
* URL: http://www.predictprotein.org
- PredictProtein: B Rost, G Yachdav and J Liu (2003) The PredictProtein Server. Nucleic Acids Research 32(Web Server issue):W321-W326.
* Author: B Rost
- PROSITE: A Bairoch, P Bucher & K Hofmann (1997) Nucleic Acids Research, 25:217-221
* Author: A Bairoch, P Bucher & K Hofmann
* Contact: bairoch@cmu.unige.ch
* URL: http://www.expasy.ch/prosite
* Version: 99.07
* Description: PROSITE is a database of functional motifs. ScanProsite, finds all functional motifs in your sequence that are annotated in the ProSite db.
- SEG: J C Wootton & S Federhen (1996) Methods in Enzymology, 266:554-571
* Author: J C Wootton & S Federhen
* Contact: rost@columbia.edu
* URL: wootton@ncbi.nlm.nih.gov
* Version: 1994
* Description: SEG divides sequences into regions of low-, and high-complexity. Low-complexity regions typically correspond to 'simple sequences' or 'compositionally-biased' regions.
- ProDom: ELL Sonnhammer & D Kahn (1994) Protein Science, 3:482-492
* Author: LL Sonnhammer; J Gouzy, F Corpet, F Servant, D Kahn, dkahn@zyx.toulouse.inra.fr
* Contact: dkahn@zyx.toulouse.inra.fr
* URL: http://protein.toulouse.inra.fr/prodom.html
* Version: 2000.1
* Description: ProDom is a database of putative protein domains. The database is searched with BLAST for domains corresponding to your protein.
- PHD: B Rost (1996) Methods in Enzymology, 266:525-539
* Author: B Rost
* Contact: rost@columbia.edu
* URL: http://www.predictprotein.org
* Version: 1.0.16
* Description: PHD is a suite of programs predicting 1D structure (secondary structure, solvent accessibility) from multiple sequence alignments.
- PHDsec: B Rost & C Sander (1993) J. of Molecular Biology, 232:584-599
* Author: B Rost
* Contact: rost@columbia.edu
* URL: www.predictprotein.org
* Version: 1.0.16
* Description: PHDsec predicts secondary structure from multiple sequence alignments.
- PHDacc: B Rost & C Sander (1994) Proteins, 20:216-226
* Author: B Rost
* Contact: rost@columbia.edu
* Version: 1.0.16
* Description: PHDacc predicts per residue solvent accessibility from multiple sequence alignments.
- PHDhtm: B Rost, P Fariselli & R Casadio (1996) Protein Science, 7:1704-1718
* Author: B Rost
* Contact: rost@columbia.edu
* URL: http:// www.predictprotein.org
* Version: 1.0.16
* Description: PHDhtm predicts the location and topology of transmembrane helices from multiple sequence alignments.
- PROF: B Rost (2004) Meth. Mol. Biol., submitted.
* Author: B Rost
* Contact: rost@columbia.edu
* Version: 1.0.16
* Description: PROF is a suite of programs predicting 1D structure (secondary structure, solvent accessibility) from multiple sequence alignments.
- PROFsec: B Rost (2004) Meth. Mol. Biol., submitted.
* Author: B Rost
* Contact: rost@columbia.edu
* URL: http:// www.predictprotein.org * Version: 1.0.16
* Description: PROFsec predicts secondary structure from multiple sequence alignments.
- PROFACC: B Rost (2004) Meth. Mol. Biol., submitted.
* Author: B Rost
* Contact: rost@columbia.edu
* URL: http:// www.predictprotein.org * Version: 1.0.16
* Description: PROFacc predicts per residue solvent accessibility from multiple sequence alignments.
- GLOBE: B Rost (1998) unpublished
* Author: B Rost
* Contact: rost@columbia.edu
* URL: http:// www.predictprotein.org
* Version: 1.0.0
* Description: GLOBE predicts the globularity of a protein
- DISULFIND: A.Ceroni, P.Frasconi, A.Passerini and A.Vullo (2004) Bioinformatics, 20, 653-659, 2004
* Author:A.Ceroni, P.Frasconi, A.Passerini and A.Vullo
* Contact:cystein@dsi.unifi.it
* URL:http://cassandra.dsi.unifi.it/cysteines/index.html
* Version: 1.0-rg2
* Description: DISULFIND is a disulphide bridges predictor based on a two steps process.
- A conformational switch prediction program: Young et al. Protein Science(1999) 8:1752-64.
* Author: Young M, Kirshenbaum K, Dill KA and Highsmith S.
* Version: 1.0
* Description: ASP finds regions that are most likely to behave as switches in proteins known to exhibit this behavior
17. NORS
18. CHOP
19. ISIS
20. DISIS
21. NORSnet
22. PROFbval
23. MD
24. PROFcon
25. PROFtmb
26. SNAP
27. LOCtree

## COPYRIGHT

PredictProtein is released under the Academic Free License ("AFL") v. 3.0. A copy of the license can be found in the `LICENSE` document.

## Contact

- by email via: info@predictprotein.org
- or via the web: https://predictprotein.org/contact

## Feedback
Address questions, suggestions, bug reports, or comments

- by email via: help@predictprotein.org

DEFAULT INPUT AND OUTPUT

Please:

* make sure that you fill in your correct email address,
* paste protein sequence or an alignment in sequqnece algnment format (SAF) into the respective box.

thanks!

Default output
The output format is self-documenting (examples given for the major prion protein precursor prio_human (all,phd only) and the known 3D structure of HIV protease 1hhp). The output contains:

1. A list of likely homologues found in the protein database (SWISSPROT+TREMBEL+PDB), and the multiple sequence alignment of these sequence (by default in 'MSF' format)
2. If found: a list of the putative ProSite motifs.
3. If found: a list of ProDom domain assignments.
4. If found: a prediction of coiled-coil regions.
5. Information about the expected levels of accuracy of structure predictions. (We suggest that newcomers read this carefully.)
6. Prediction of aspects of protein structure. These are grouped in the following way:
1. Prediction of secondary structure for all residues, with an expected average three-state accuracy of > 72%;
2. Prediction of secondary structure for reliably scored residues only, with an expected three-state accuracy for these residues of > 82%;
3. Prediction of solvent accessibility for all residues, with an expected average correlation to the experimentally observed values of 0.54;
4. Prediction of solvent accessibility for reliably scored residues only, with an expected correlation between experimental observation and prediction of 0.69;
5. Prediction of transmembrane helices and their topology (if any detected), with an expected prediction accuracy of about 95% in two states.

Note: for the prediction of transmembrane helices a conservative threshold is chosen. Thus, your protein may not be reported to contain a HTM although it may have one. If you opt explicitly for the refined prediction of transmembrane helices and topology ("predict htm"), four predictions are given (example for output)):

1. neural network prediction (expected accuracy for HTM's about 78%);
2. result of empirical filter (expected accuracy for HTM's about 97%);
3. refined prediction (expected accuracy for HTM's about 99%);
4. prediction of topology (expected accuracy about 86%).

Example for input and output (important for email submission)
Submitting a single sequence

* INPUT is: your protein sequence,
* OUTPUT is: alignment + prediction

OUTPUT (detailed example)
If your sequence has at least one non-trivial homologue in the database of protein sequences, you receive a multiple sequence alignment and the annotated prediction in the following form:
Block with multiple sequence alignment.
Block with explanations about the prediction method.
Block with prediction (example for secondary structure prediction follows).

.........1.........2.........3.........4.........5.........6
AA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD
PHD EEEEEE EEEEEE EEEEEE EEEE EEE
Rel 854777641334566643102441577762566642443213663122112234155

ADVANCED INPUT OPTIONS (FORMATS)

The following input options have been implemented:

4. Use your SAF alignment ('# SAF')
You can provide your alignment input to the predictions. To do so, you have to generate a file in SAF (simple alignment format). Then you append this file to the same header as described before, with "# SAF" instead of "# sequence name".
Example for email submission format.
5. Use your MSF alignment ('# MSF')
You can provide your alignment input to the PHD predictions. To do so, you have to generate a file in MSF (multiple sequence format), e.g., generated by the program PILEUP (GCG). Then you append this file to the same header as described before, with "# MSF" instead of "# sequence name".
Example for email submission format.
8. Use your prediction of 1D structure for threading ('# COLUMN format')
You can provide your prediction of secondary structure and relative accessibility. (Both predictions are required. If you don't have, e.g., a prediction of accessibility, use PHDacc to generate it.) Your prediction will be used for threading. You use the header as described before, with "# COLUMN format" instead of "# sequence name". Then you append your prediction in the "COLUMN format" after this line.
Example for email submission format.

ADDITIONAL OPTIONS

4. Filtering PHD input alignment (Default)
If the divergence found in your family is not 'well' spread, prediction accuracy may drop. In particular, too many highly similar sequences may be problematic in absence of further diverged family members. This problem came up only in the post-genome era, i.e. since the number of sequences is exploding. To correct for this problem we run a crude filter on the alignment, by default. To tick the 'no filter' in the submission form.

Prediction-based threading ('prediction-based threading')
The results of the search for remote homologues are given in three blocks. (1) An MSF formatted (default) alignment between your sequence and possible remote homologues. (2) A summary of some statistics such as alignment scores for a cross-validation experiment and for your request. (3) An alignment presenting the entire motifs of 1D structure for the alignment.

3.
Evaluation of secondary structure prediction accuracy (for program developers, only!)
The output consists of two parts: (1) per-residue and per-segment scores for each protein in your input file, and (2) per-residue and per-segment scores for all proteins in your input file (the latter includes scores for the accuracy in predicting secondary structure content and secondary structural class; example for output).

EXAMPLES for input formats (required for email submissions)

Example for input and output Submitting a single sequence

* INPUT is: your protein sequence,
* OUTPUT is: alignment + prediction

Definition of family

* Family = structural family, i.e. all proteins in family have similar structures
We search with your input sequence against the specified sequence database (SWISS-PROT by default). All proteins P which have a level of sequence similarity to your query protein Q that allows to predict that P and Q have a similar three-dimensional structure are returned in the MaxHom alignment. The iterated PSI-BLAST frequently finds more diverged proteins P2 that also have a similar structure as Q.
* Functional homology
Iin general, much higher levels of sequence similarity are required to infer particular aspects of function. Thus, the members of a protein family may or may not share particular functional motifs.

PREDICTION METHODS

Multiple sequence alignment (MaxHom ; more info)
The multiple sequence alignments is built up in essentially three steps (MaxHom, Sander & Schneider, Proteins, 1991, 9, 56-68).

1. The protein database (currently SWISS-PROT) is searched by a fast alignment program (currently BLASTP).
2. In sweep 1, sequences are aligned consecutively to the search sequence by a standard dynamic programming method. After each sequence has been added a profile is compiled, and used to align the next sequence.
3. In sweep 2, after all sequences with significant homology have been picked from the BLASTP output, the profile is recompiled, and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of sweep 1.

Iterated profile-based search (PSI-BLAST ; more info)
PSIblast is a fast, yet sensitive database search program.
We are running the iterated PSI-BLAST on a subset of the BIG database with SWISS-PROT + TrEMBL + PDB sequences. The number of iteration, the cut-off thresholds and the particular details of which sequences are used from BIG has been optimised in our group.

Functional sequence motifs (ProSite; example for output; more info)
The following description is from the original ProSite site:
ProSite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites, patterns and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs.

Low-complexity regions (SEG; example for output; more info)
The following description is from the original SEG documentation (JC Wootton & S Federhen, 1996, Meth Enzymology, 266, 554-571):
SEG divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions".
Locally-optimized low-complexity segments are produced at defined levels of stringency, based on formal definitions of local compositional complexity. The segment lengths and the number of segments per sequence are determined automatically by the algorithm.

Domain assignment (ProDom; example for output; more info)
The following description is from the original ProDom site (which supplies a rather useful graphical interface to the ProDom database):
The ProDom protein domain database consists of an automatic compilation of homologous domains detected in the SWISS-PROT database by the DOMAINER algorithm (ELL Sonnhammer & D Kahn, Prot. Sci., 1994, 3, 482-492). It has been devised to assist with the analysis of the domain arrangement of proteins.
ProDom `domains' are inferred on the basis of conserved subsequences as found in various proteins. Such a conservation corresponds frequently, though not always, to genuine structural domains: therefore domain boundaries should be treated with caution. For some domain families experts have been asked to correct domain boundaries on the basis of both sequence and structural information. This expertise will complement the automated process and improve the quality of ProDom domain families.

Prediction of nuclear localisation signal (PredictNLS; example for output; more info)
PredictNLS finds experimentally known nuclear localisation signals present in your protein. The program produces an output if and only if a known NLS was found.
Note that the original version of the program at http://cubic.bioc.columbia.edu/predictNLS also allows you to obtain statistics for putative NLS motifs.

Secondary structure (PHDsec; more info)
Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72; evaluation of accuracy). Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993).
PHDsec predictions have three main features:
1. improved accuracy through evolutionary information from multiple sequence alignments
2. improved beta-strand prediction through a balanced training procedure
3. more accurate prediction of secondary structure segments by using a multi-level system

Solvent accessibility (PHDacc; more info)
Solvent accessibility is predicted by a neural network method rating at a correlation coefficient (correlation between experimentally observed and predicted relative solvent accessibility) of 0.54 cross-validated on a set of 238 globular proteins (Rost & Sander, Proteins, 1994, 20, 216-226; evaluation of accuracy). The output of the neural network codes for 10 states of relative accessibility. Expressed in units of the difference between prediction by homology modelling (best method) and prediction at random (worst method), PHDacc is some 26 percentage points superior to a comparable neural network using three output states (buried, intermediate, exposed) and using no information from multiple alignments.

Globularity of proteins (GLOBE; more info)
An additional result from the prediction of solvent accessibility is that of protein globularity. That method is not published, yet. For more information, you may have a look at the preliminary preprint.

Transmembrane helices (PHDhtm; example for output; more info)
Transmembrane helices in integral membrane proteins are predicted by a system of neural networks. The shortcoming of the network system is that often too long helices are predicted. These are cut by an empirical filter. The final prediction (Rost et al., Protein Science, 1995, 4, 521-533; evaluation of accuracy) has an expected per-residue accuracy of about 95%. The number of false positives, i.e., transmembrane helices predicted in globular proteins, is about 2% (Rost et al. 1996).
The neural network prediction of transmembrane helices (PHDhtm) is refined by a dynamic programming-like algorithm. This method resulted in correct predictions of all transmembrane helices for 89% of the 131 proteins used in a cross-validation test; more than 98% of the transmembrane helices were correctly predicted. The output of this method is used to predict topology, i.e., the orientation of the N-term with respect to the membrane. The expected accuracy of the topology prediction is > 86%. Prediction accuracy is higher than average for eukaryotic proteins and lower than average for prokaryotes. PHDtopology is more accurate than all other methods tested on identical data sets (Rost, Casadio & Fariselli, 1996a and 1996b; evaluation of accuracy).

Secondary structure (PROFsec; more info)
Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 78% for the three states helix, strand and loop (Rost, 2000, unpublished). Evaluated on the same data set, PROFsec is rated at 6-8 percentage points higher three-state accuracy than PHDsec.

Solvent accessibility (PROFacc; more info)
Solvent accessibility is predicted by a system of neural networks rating at an expected average accuracy > 78% for the two states exposed and buried (Rost, 2000, unpublished). Evaluated on the same data set, PROFacc is rated at about five percentage points higher two-state accuracy than PHDacc.

Coiled-coil regions (COILS; example for output; more info)
The following description is from the original COILS site:
COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

Cysteine bridges (CYSPRED; example for output; more info)
CYSPRED finds whether the cys residue in your protein forms disulfide bridge.
The following description is from the original CYSPRED publication:
A neural network-based predictor is trained to distinguish the bonding states of cysteine in proteins starting from the residue chain. Training is performed using 2452 cysteine-containing segments extracted from 641 non homologous proteins of well resolved 3D structure. After a cross-validation procedure efficiency of the prediction scores as high as 72% when the predictor is trained using protein single sequences. The addition of evolutionary information in the form of multiple sequence alignment and a jury of neural networks increase the prediction efficiency up to 81%. Assessment of the goodness of the prediction with a reliability index indicates that more than 60% of the predictions have an accuracy level greater than 90%. A comparison with a statistical method previously described and tested on the same data base shows that the neural network-based predictor is performing with the highest efficiency.

Structural switches (ASP; example for output; more info)
ASP identifies amino acid subsequences that are the most likely to switch between different types of secondary structure. The program was developed by MM Young, K Kirshenbaum, KA Dill and S Highsmith. ASP was designed to identify the location of conformational switches in proteins with known switches. It is NOT designed to predict whether a given sequence does or does not contain a switch. For best results, ASP should be used on sequences of length >150 amino acids with >10 sequence homologues in the SWISS-PROT data bank. ASP has been validated against a set of globular proteins and may not be generally applicable. Please see Young et al., Protein Science 8(9):1752-64. 1999. and Kirshenbaum et al., Protein Science 8(9):1806-1815. 1999. for details and for how best to interpret this output. We consider ASP to be experimental at this time, and would appreciate any feedback from our users.

REAPLCE WITH AGAPE

PROFCon (PROFcon;examples for: request and output; more info)
PROFcon predicts contacts between residue pairs in single chains. Our definition of contact is based on Cbeta atoms distances (Calpha for glycines). Two residues whose Cbeta's are closer than 8 Ang are considered to be in contact, not in contact otherwise. The last column of the output is the predicted contact score, (contact probability is high if score is close to 1).

CHOP
PROFTMB
ISIS
DISIS
MD
PROFBVAL
SNAP
LOCTREE
PROFTMB
NORS
NORSNET
HINTS FOR USERS
Note
The following notes result from the experiences I have gathered by offering, and running the PredictProtein service and during various structure prediction workshops. The comments are tailored in particular to the PROF methods; however, most comments hold also for using other secondary structure prediction methods.

What can you expect from secondary structure prediction?
How accurate are the predictions ?The expected levels of accuracy (PROFsec = 72±11% (three state per-residue accuracy); PROFacc = 75±7% (two-state per-residue accuracy); PHDhtm = 94±6% (two-state per-residue accuracy)) are valid for typical globular, water-soluble (PROFsec, PROFacc), or helical transmembrane proteins (PROFhtm) when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions. (Note: for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values.) PROFsec predictions tend to be relatively accurate for porins; however, for helical membrane proteins other programs ought to be used.

Confusion between strand and helix? PROFPHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PROFsec).

Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PROFPHD predicts the core of helices and strands more accurately than the caps (B. Rost and C. Sander, 1D secondary structure prediction through evolutionary profiles, in: H. Bohr and S. Brunak (eds.), Protein Structure by Distance Analysis, Amsterdam: IOS Press, 257-276 (1994)). This seems to also hold for other methods (Garnier, priv. comm.).

Are internal helices predicted poorly? Steven Benner has indicated that internal buried helices are particularly difficult to predict. On average, this is not the case for PROFPHD predictions (expected accuracy of PROFsec for buried helices).

Accessibility useful to provide upper limits for contacts? The predicted solvent accessibility (PROFacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PROFacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts).

How to predict porins? PHDhtm predicts only transmembrane helices, and PROFsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PROFsec was relatively high (70%) for the known structures. Thus, PROFsec appears to be applicable.

How to use the prediction of transmembrane helices? One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices.

What about protein design and synthesised peptides? The PROFPHD networks are trained on naturally evolved proteins. However, the predictions have proven to be useful in some cases to investigate the influence of single mutations (e.g. for Chameleon ), or for Janus, Rost, unpublished). For short poly-peptides, the following should be taken into account: the network input consists of 17 adjacent residues, thus, shorter sequences may be dominated by the ends (which are treated as solvent).

In a nutshell: how to avoid pitfalls?
70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.

Spread of prediction accuracy. An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly (expected accuracy of PHDsec). Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein. Few methods supply well tested indices for the reliability of predictions. Such indices can help to reduce or increase your trust in a particular prediction.

Special classes of proteins. Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.

Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.

Better + worse = even better? Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.)

1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, the secondary structure motif 'H-E-E-H-E-E' is contained in, at least, 16 structurally unrelated proteins.

Nuts and bolts: what to keep in mind?
Information content in multiple sequence alignment
If the multiple sequence alignment contains only a few proteins very similar to the one you sent (pairwise sequence identity > 90%), the expected accuracy for 1D structure predictions (secondary structure, accessibility, transmembrane helices) drops significantly. Note: this implies a reduction of the expected accuracy for threading. The scores for expected accuracy (PROFsec, PROFacc., PHDhtm) are valid for typical alignments as to be found in the HSSP database. The information content of the alignment is difficult to measure. Two important parameters are:

* (1) Number of aligned sequences: the more sequences in the alignment, the better. The exact number of sequences needed for a 'good prediction' cannot be given, as it depends on the variation and on characteristics of the particular protein family. As a rule of thumb: one is clearly NOT sufficient, more than five sequences can be enough.
* (2) Variation of aligned sequences: the aligned sequences should have a considerable variation with respect to the guide sequence (your protein). Ideally, the alignment should contain sequences at levels of 80%, 60%, 50%, 40%, and about 30% pairwise sequence identity (with respect to the predicted protein). In general, more diverged sequences (30-40%) contribute more to the information content than do very similar ones (> 80%). Note: the levels of sequence identity are summarised in the alignment header of the output returned (example).

NOTE HOWEVER:

* (A) Alignment errors for distant homologues: More distantly related sequences contribute more to the alignment diversity which is the base for an improved prediction accuracy. However, the more distant relative are difficult to align (actually below levels of some 40% sequence identity some alignment errors are guaranteed). Furthermore, even the correct detection of more distant relatives is getting highly complicated below levels of about 35% sequence identity.
* (B) Bias by identical sequences: Growing data bases result in an explosion of highly redundant information. This has recently (1996-7) led to the situation where the previous rule 'the more sequences, the better' is not applicable anymore. Instead, you should leave out some (or all) family members in the high homology (>70%) region, in particular, when there are not many rather diverged sequences present. Furthermore, the current version of PROFPHD does not handle redundant information, i.e., when you have two proteins A and B of say 40% sequence identity to your query, and when A and B are highly similar (>90% sequence identity to one another), you should leave out one of the two from the alignment you use for the prediction!

Cut-off for including homologues in alignment
In the multiple sequence alignment returned to you, only homologues down to levels of 30% pairwise sequence identity over 80 or more residues are included. This cut-off is five percentage points above the threshold for structural homology (Sander & Schneider, 1990), in an attempt to stay clearly off the twilight zone of sequence similarity, and provide high-quality multiple alignments in an automated fashion.

Quality of multiple sequence alignment
On average, more residues are falsely aligned for lower levels of pairwise sequence identity. Down to levels of about 30%, the automatic MaxHom alignments are usually quite accurate. However, for many families there are regions for which the 'correct' alignment is, in principle, not well defined. One way to spot such regions is the stability of the alignment with respect to including or excluding some of the aligned sequences. By providing different lists of sequences ("input option 'PIR list'") you can monitor the stability of the alignment. Often such regions may form surface loops. Predictions may be less accurate in such regions.

Minimal length of sequences
The PROFPHD programs treat N- and C-terminal ends of proteins as solvent molecules. The size of the input window for predicting 1D structure is up to 17 residues. Thus, the first and the last 17 residues of your sequence will 'see solvent'. Especially for short fragments you did cut out from large proteins, this may result in false predictions.

Insertions in multiple sequence alignment

* Insertions in guide sequence: Do NOT use insertions for the guide sequence when you supply your alignment to be used as input for the predictions ("input option 'MSF format'"). In the current implementation, PROFPHD will treat such insertions as if the corresponding positions were occupied by solvent. This may lead to particular prediction errors ( example )!
* Split alignment into domains: If your alignment (of say 20 sequences) contains long (> 10 residues) regions for which only very few sequences do not have insertions (in positions R1-R2), split the alignment into fragments that are not full of insertions for all sequences. For the problematic region (R1-R2) it may be better to include only those sequences without insertions. The existence of such regions may indicate that the protein contains various domains (one for residues < R1, another for residues > R2). When you submit your alignment in fragments, mind the minimal length of sequences (see above).

'Untypical' proteins

* Globular, water-soluble proteins. The PROFPHD neural networks have been trained on proteins with typical features as contained in the database of known protein structures (PDB). Thus, accuracy may be lower if the methods are applied to other proteins. For instance, PROFsec (secondary structure) correctly predicts only about 50% of the residues in transmembrane helices of integral membrane proteins. However, the network system trained on transmembrane proteins (PHDhtm) predicts residues in transmembrane helices on average at a level of well above 90% accuracy. In general, the PROFPHD methods learn to extract characteristics features of currently known protein structures. Problematic cases are proteins with many cysteine bridges that stabilise the particular protein structure, or proteins for which the structure is stabilised by functional constraints (co-factors).
* Transmembrane proteins. PHDhtm for globular proteins. The rate of false positives, i.e., of globular, water-soluble proteins for which PHDhtm predicts transmembrane helices, is in the order of 5%. Such false positive predictions occur more often for structures with very hydrophobic beta-strands. Consequently, a prediction of transmembrane helices for a globular protein may indicate the existence of very hydrophobic beta-strands. PROFPHD for porin-like beta structures. For the beta-strand transmembrane protein, porin, the accuracy of PROFsec was below the expected average (60%), but it was higher than the average for helical transmembrane proteins (50%). The explanation may be that the barrels formed by porins share features of globular, water-soluble proteins and thus can be predicted relatively well. MaxHom alignments for transmembrane proteins. The alignment procedure MaxHom is optimised on globular water-soluble proteins. For transmembrane proteins, the alignments of the more hydrophobic transmembrane segments may require changes in the alignment details. Furthermore, in particular in transmembrane regions, often more distantly related sequences could be aligned by hand based on, e.g., hydrophobicity analyses. Unfortunately, we do not yet provide such refinements of the alignment automatically.
* Multi-domain proteins. The accuracy for predicting solvent accessibility (PROFacc) for single-domain proteins is higher than for multi-domain proteins. Predictions are more likely to be wrong at interfaces between domains. This shortcoming may be used to predict inter-domain interfaces in regions where PROFacc predicts buried residues that would otherwise not be compatible with your guess about the fold of the protein.
* Novel folds are NOT 'untypical proteins'. The expected prediction accuracy for the PROFPHD programs has been re-evaluated several times over the last years. So far, the results have always confirmed our estimates (Rost & Sander, 1995, Proteins, 1995, 23, 295-300).

Prediction of transmembrane helices (HTM's) and topology

* False positives: globular proteins predicted with HTM's. By default we search for possible transmembrane helices in your sequence. The rate of false positive detection (i.e. proteins falsely predicted to contain transmembrane helices) is about 1.6%. Thus, a reported transmembrane segment may just indicate a rather hydrophobic patch in a globular protein. If you explicitly request a prediction of transmembrane helices (HTM's), we assume that you know the protein to contain HTM's and consequently apply a lower threshold to eliminate false positives.
* Refined prediction of transmembrane helices and topology. By default we use the neural network system PHDhtm and an empirical filter to predict the locations of transmembrane helices. A refined (more accurate) version of that program, as well as, the prediction of transmembrane topology (orientation of N-terminal non-transmembrane region with respect to cell) is available upon request ("predict htm topology"). All predicted HTM's are sorted according to the reliability of the prediction. This may help experts to spot HTM's predicted falsely based on a reliability index provided. Note: try NOT to provide sequences that start or end with HTM regions as this may result in wrong topology predictions!

Reliability indices for PROFPHD predictions
The reliability indices of the PROFPHD methods correlate well with prediction accuracy. In other words, residues predicted with high reliability (0 = low, 9 = high) are more likely to be predicted correctly. However, when basing the prediction on single sequences (rather than multiple alignments) the scale has to be shifted. For instance, values of RI > 4 usually imply an expected accuracy of > 80% for PROFsec. When using a single sequence as input the same level of accuracy is reached only for residues predicted at RI > 7.

Combination of results with that of other methods
A combination of two prediction methods is likely to improve the accuracy only if the following points are met:

* (1) the predictions are based on methods using independent information, e.g., prediction-based threading and potential-based threading,
* (2) the accuracy of the two methods is comparable, e.g., NOT for combining Chou-Fasman (about 50% accuracy) and PROFsec (> 72% accuracy),

Say you want to focus on the most likely secondary structure segments. You may hope that the best predicted segments are those for which methods X and PROFsec agree. This may or may not be correct. However, it may be more reasonable to identify such regions based on the reliability index provided by PROFPHD. The PROFPHD methods have been tailored to provide a reasonable estimate for the reliability of the prediction, whereas a combination of two arbitrary prediction methods, at best, yields improvements, at random.

Homologue of known structure
Ab-initio prediction (by e.g. PROFPHD) is, in general, less accurate than is homology modelling. Thus, if we find a protein of known structure that has > 25% pairwise sequence identity to your sequence, you ought to make use of the known structure by homology modelling.

Prediction-based threading

* Most alignments returned are wrong! We decided to return quite a number of alignment hits from the threading search. Most of those will be wrong. On the one hand, this is caused by a rather low accuracy of threading methods. On the other hand, successful threading requires experience in analysing protein sequences on the side of the user. Although most hits reported by the threading program (TOPITS alias TOPITS) will be wrong, you may arrive at some correct conclusions from the alignments. Just bear in mind: threading is likely to be more often wrong than correct.
* Combining prediction-based and potential-based threading. The first problem of all threading programs is a high proportion of false positives, i.e., proteins falsely predicted to have a fold similar to the search sequence. One successful strategy to reduce the number of false positives is a combination of the results from prediction-based threading (such as AGAPE) with those from potential-based threading.

Rostlab/PredictProtein