____ ___________________ __ __ ______ \ \ / /___ ___| _ \| | | | _ \ \ \ / / | | | |_) | | | | |_) | \ V / | | | ___/| | | | / \ / | | | | | |_| | \ \ / __| |__| | | | |\ \ \_/ |________|__| \_____/|_ | \__\ Code for running VIPUR can be found here. The VIPUR training set files (feature set and original structures) and third party executables (built for 64-bit linux) can be found at https://drive.google.com/a/nyu.edu/folderview?id=0B6exAih8BuuUVXBHRUlFRVZKd1U&usp=drive_web. What is VIPUR? The VIPUR pipeline analyzes protein variants by considering conservation scores and structural scores to identify variants that are likely to disrupt protein function Using Rosetta, these structural features allow you to interpret what is causing the variant to be identified as deleterious Conservation scores are derived from a PSSM of similar sequences found using PSIBLAST against the NCBI nr database Structural analysis is done using Rosetta to: consider variant structures by rapid structure optmization, allowing fast evaluation of approximate variant ddG values (Rosetta ddg_monomer) and refine variant structures and consider the distribution of energies and structural scores (rms, gdtmm) across several low energy conformations (physically near the input conformation, Rosetta relax) Additional features are provided by an internal "aminochange" classification (crude similarity of amino acid properties) and the variant position surface area, evaluated using PROBE These analyses are combined with a learned Logistic Regression model to classify input variants as "neutral" or "deleterious" and provide a structurally-informed hypothesis as to why variants are likely to disrupt the protein ################## (rewrite all this, copied from VIPUR.py for now) VIPUR INSTALLATION - download the VIPUR module - verify that you have the necessary software running VIPUR requires PSIBLAST PROBE Rosetta PyMOL (or PyRosetta) -add paths to settings.py DOWNLOAD PSIBLAST PSIBLAST is a sequence search tool based on iterative alignment profile scans PSIBLAST is part of the free NCBI BLAST+ distribution that can be found at: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ There are many tutorials on downloading, installing, and running PSIBLAST, we suggest the NCBI usage book: http://www.ncbi.nlm.nih.gov/books/NBK52640/ VIPUR has been benchmarked using BLAST 2.2.25+ with the NCBI nr database (downloaded Feb 2012) Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10(421) (2009). DOWNLOAD PROBE PROBE is a tool for analyzing contacts within protein structures (PDB format) Here, we use PROBE as an accurate method for calculating the ACCessible surface area of Protein amino acids (ACCP), the fraction of potential surface area in contact with other residues PROBE is freely available for download: http://kinemage.biochem.duke.edu/software/probe.php VIPUR has been benchmarked using probe.2.12.071128 Word, et. al. Visualizing and Quantifying Molecular Goodness-of-Fit: Small-probe Contact Dots with Explicit Hydrogens. J. Mol. Biol. 285, 1711-1733 (1999). DOWNLOAD ROSETTA Rosetta is a software suite capable of modeling and designing proteins and other biomacromolecules Here, we use the well established Rosetta protocols "relax" and "ddg_monomer" for refining variant protein structures and evaluating protein energetics Rosetta is free for academic use with the license and download tutorial at: https://www.rosettacommons.org/software/license-and-download Compiling Rosetta may introduce additional dependencies depending on your system Here, we will use the default "release" compilation settings (e.g. no "mode=MPI" etc.) Please consult the Rosetta user documenation and forums if you encounter complications during setup: https://www.rosettacommons.org/docs/latest/Build-Documentation.html Note: both relax and ddg_monomer perform stochastic searches and may output slightly different energy values on evaluation with different random seeds VIPUR has been benchmarked using the Rosetta 3.4 release version (54167) Leaver-Fay, A. et al. ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules. Methods in Enzymology 487, 548-574 (2011). DOWNLOAD PYMOL PyMOL is a tool for molecular visualization and analysis PyMOL has several versions with advanced features, however we only require simple PyMOL functionality Please consult the PyMOL website to determine which license works best for you: http://pymol.org/educational/ PyMOL can be downloaded at: http://pymol.org/dsc/ Note: PyMOL will be used to create variant structures, PyRosetta can alternatively be used for this task, however you only need one of these programs The PyMOL Molecular Graphics System, Version 1.5.0.4 Schrodinger, LLC. Note: you only need a copy of PyMOL or PyRosetta to run VIPUR (for creating variant structures) DOWNLOAD PYROSETTA PyRosetta is an interactive Python interface to Rosetta PyRosetta is free for academic use but on a separate license from Rosetta: http://c4c.uwc4c.com/express_license_technologies/pyrosetta The PyRosetta software and a download tutorial can be found at: http://www.pyrosetta.org/dow Chaudhury, S. et al. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26(5), 689-691 (2010). ########### USING VIPUR VIPUR is written as a simple Python module Once you have VIPUR and the required software setup, you can use VIPUR.py as a Python script or a library (from VIPUR import run_VIPUR) other scripts that make up VIPUR outline the specific feature generation steps, including input file parsing and output file analysis Currently, VIPUR supports a commandline interface to the options necessary to run a single protein variant, specifying the input structure file (-p) <pdb_filename> and variants (-v) <variant_filename> ex. python VIPUR.py -p example_input/2C35.pdb -v example_input/2C35.txt VIPUR can also run on a directory of (PDB, .txt) file pairs by inputting a path to the directory containing the files (-p), defaults to the current directory ex. python VIPUR.py -p example_input/ Please consult the help (-h) for more details on run options, including: -d filename of the output predictions file -o path for output to be written (several intermediate files) -c chain of the input PDB that contains the native sequence (if not A) -s filename of the intermediate (native) sequence file (FASTA format) -w option to write out the numbering map (PDB to 1-indexed) -q option to run in "sequence only" mode, no structural analysis An example of VIPUR input files is provided in "example_input" and their expected output is provided in "example_output_reference" You can automatically run VIPUR on this demo with the --demo option ex. python VIPUR.py --demo VARIANT INPUT FORMAT VIPUR currently takes variant input files as plain text files with one variant per line, expecting a reference amino acid for the native protein ex. E14R R84P A101W the native reference amino acid is necessary to ensure the input numbering matches the input protein structure (so you don't have to worry about constantly renumbering your indices or PDB files) currently, any input variants that do not have correct positions or native amino acids will be skipped INTERPRETING RESULTS VIPUR outputs a ".predictions" file containing a summary of the predictions and analysis of input variants. The default output file is tab delimited ('\t') with one line per variant indicating: variant the variant (ex. P335R) predicted label "deleterious" or "neutral"* prediction confidence P(deleterious|analysis), deleterious score* structure-only label prediction of the structure-only classifier structure-only confidence structure-only confidence score sequence-only label prediction of the sequence-only classifier sequence-only confidence sequence-only confidence score exposure "surface" or "interior" for the position** "essential" score can indicate conserved positions*** ddG prediction prediction of ddG in kcal/mol (approx) interpretation simple interpretation of the effect**** explanation top contributions to the interpretation**** *VIPUR was trained on a dataset of natural variants, pseudomutations, and protein variants from mutagenesis experiments. Variants are curated as "deleterious" if they have literature or UniProt annotations indicating loss of an essential protein function. Pseudomutations are from HumDiv and assumed to be "neutral" (though not all neutral examples come from pseudomutations). Please see the <main text> for a full description of the VIPUR training set. Note: this binary label refers to PROTEIN function which may, or may not, indicate disease association (a more complicated phenotype) variants that are predicted "deleterious" are expected to lack an essential protein function (in many cases, the protein variant is misfolded or too unstable to maintain its native fold, as indicated by the structure-based features), however they may display a severely reduced function or maintain other functions variants that are predicted "neutral" are expected to function effectively as the native protein, however they may have slightly reduced function (nearly-neutral, not predicted as not well-curated data exists) Note: our "deleterious" predictions are NOT synonymous with the impact on stability, deleterious variants may minimally change or even stabilize a protein (ex. can make an enzyme too rigid) and neutral variants can reasonably destabilize a protein as long as it still folds, please consult the ddG prediction provided by the ddg_monomer protocol to interpret changes in stability Note: our "deleterious" predictions are NOT synonymous with "disease assocition" while many disease associated/causal variants are loss-of-function changes, the disruption of protein function is itself insufficient to indicate the cause of a disease (this comes from knowledge of the protein's function or prior variant association) the goal of VIPUR is to provide a clear prediction of variants that disrupt protein function to help INTERPRET variants that already have some correlated label (e.g. disease or other phenotype) not all deleterious predictions of VIPUR necessarily influence disease, however confident deleterious predictions of variants known to be disease associated suggests a strong effect worth investigating Since VIPUR contains a binary classifier, the output confidence metric (the learned conditional probability) indicates confidence of the binary prediction. This confidence score is effectively a "deleterious" score, with higher values indicating increased probability of being deleterious. Lower values similarly indicate an increased probability of being neutral (since the labels are binary). You can filter your results to contain the most confident deleterious predictions (ex. identify the 5 protein variants with the highest deleterious confidence etc.). **"exposure" indicates the local environment of the variant amino acid position Here, we restrict "exposure" to "surface" and "interior", indicating if the position is (approximately) on the protein surface or inside the protein. The VIPUR training set does not suggest there is a significant difference in performance for positions on the surface or interior of the protein, although available data is imbalanced (many surface variants are annotated "neutral", many interior variants are annotated "deleterious"). ***Disagreement between the overall VIPUR classifier (sequence+structure) and the structure-only classifier indicates a strong sequence signal without apparent destabilization of the monomer structure. In some cases, this is due to the inadequacy of the monomeric structure to capture the protein energetics, suggesting this amino acid is important of interactions (can be with other proteins, metals, ligands, nucleic acids etc.). When analyzing VIPUR output, consider surface variants with high (>=.8) deleterious scores and low structure-only scores (difference >=.2) as "potential interaction sites". Please see the <main text> for more information on VIPUR analysis and scores. ****VIPUR automatically identifies which features contribute to deleterious predictions. The structure-based features are directly interpretable since they represent specific physical interactions (hydrogen bonding, solvent favorability, dilsufide bond strain, etc.) and expected distributions based on other proteins (backbone conformational strain, side chain interactions, etc.). The top structure features are included as an "explanation" (3 by default, can be set to include more). We also include a reduced "interpretation" of the variant deleterious effect as: structural conservation The variant destabilizes the protein structure, likely due to improper size or surface area. This includes unfit deviations in backbone conformation (e.g. G and P positions) and packing configurations (e.g. V, L, I, M differences). In some cases, it is seen that binding sites have more restricted conformations, detectable by destabilizing variations even in the absence of the binding partner (e.g. structurally conserved and the variant cannot attain the necessary conformation). disrupted chemical interactions This variant eliminates an important and/or stabilizing chemical interaction of the protein. This includes hydrogen bonding partners, conserved hydroxyl groups, salt bridges, and disulfide bonds. potential interaction site As noted above*** high deleterious scores with low confidence structure-only scores can indicate conservation where there is no structural evidence and can indicate potential interaction sites on the protein surface. other The remaining features are difficult to directly interpret. They indicate sequence conservation and structural conservation, but cannot clearly suggest why. PREPARING STRUCTURES FOR ROSETTA Rosetta requires structural models that are cleanly readable. For many applications, this includeds removal of waters, ligands, nucleic acids, and metals. Rosetta has methods for handling all of these inputs, however our benchmark uses protein structures stripped of all these additional coordinates (note: while Rosetta can handle these cases, many proteins lack models of docked conformations, ligands, etc. and we wanted to ensure the same information was available for all samples). There are several methods available for cleaning structures for Rosetta, including simply removing all HETATM and nucleic acid ATOM coordinates (note: Rosetta can handle input structures with missing densities or regions ex. removed a non-canonical amino acid). The script available at: https://github.com/Olvikon/miscellaneous_scripts/blob/master/process_pdb.py outputs a directory containing the monomeric structure cleaned for use with Rosetta (either ".clean.pdb" or ".protein.pdb" if nucleic acid lines are removed). More suggestions on how to clean structures for Rosetta is available at: https://www.rosettacommons.org/manuals/archive/rosetta3.5_user_guide/dd/da1/preparing_structures.html note: additional refinement steps of initial models are unnecessary since Rosetta relax is run as part of feature generation, and the benchmark performance is using structures that have not been pre-relaxed