RSA-normalization-values: A PostScript repository from mtien

RSA normalization Project

----------------------------------------------------------------------------------------------------------------

Table of Contents:

I. Important Programs
A. Mine PDB information
1. parse_alignment.py
2. get_PDB.py
B. Theoretical Model Construction
2. Geometry.py
3. PeptideBuilder.py
C. Theoretical data generation
1. iterateThroughModels.py
3. DSSPData.py
D. Data Analysis
1. max_bins_with_population_restriction.py
2. max_bins_with_population_restriction_theoretical.py
3. SeperateOverAndUnderRSA1.py
4. EmpVCalc_get_Diff_with_pop_restriction.py
5. max_bin_all_data.py
6. getRoseRSA.py
7. make_ALLOWED_GeoFiles.py
8. get_ALLOWED_bins.py
9. get_CORE_bins.py
10. get_GENEROUS_bins.py
E. R scripts
1. getMaximumValues.r
2. getMeanSA.r
3. CorrelatonTableNewScales.r
4. get_population_cut_offs.r
4. FigureScripts
a. barGRSA.r
b. makeRSAdistribution.r
c. makePlotsWithPopRestriction.r
d. makeRamaPlot.r
e. makeEmpCalVpop.r
f. makeNormCorPlot.r
g. makeALLOWEDBinnedRamaPlot.r
h. makeCOREBinnedRamaPlot.r
i. makeGENEROUSBinnedRamaPlot.r
j. getAngles.r
F. Misc.
1. editHydroScales.py
2. runAll.py

II. Data Files
A. Xxx_geo
B. AnglesIteratedThroughAgainXXX
C. XXX_SA_Over/Under, Xxx_Rose_RSA
D. XXX_max_bins_all
E. EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX
F. NormalizationValuesByPercentDataCoverage
G. NormalizationValuesByPercentDataCoverageAndGenerous.txt
H. Hydrophobicity_Scales_Updated.txt
I. Wolfden.txt, rose.txt, Kite_Doolittle.txt, Fauchere.txt, Wimley.txt, Moon.txt, Radzicka.txt, MacCallum.txt
J. cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz
K. Allowed, Core, and Generous Bins

----------------------------------------------------------------------------------------------------------------

I. Important Programs
This section is dedicated to all python, R, and other scripts used to obtain or analyze data.

A. Mine PDB Information
To mine the protein structures I parsed the "cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz"
file in "parse_alignment.py".

1. parse_alignment.py
This is a python script that takes in a list of PDB and Chain ID's (we used the
"cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz" file from the Dunbrak lab).
This program creates the "Xxx_geo" files (ie. Ala_geo, Asn_geo, etc...).
This program also only outputs information from non-chain-terminating residues, which
is determined by the peptide bond length between two residues and non-ambigous neighbors.
This program is mainly a frame to process the output from the real PDB parser "get_PDB"
Any files that is either corrupted or did not exist in the PDB database is reported to
the "Error_report_bond_length.txt".
The two numbers that are printed to the terminal are for testing.

2. get_PDB.py
This is a python program that takes in a PDB file name and Chain ID name, downloads
the PDB file to the "structures" folder than extracts all the information from the
PDB file. This program needs the PDB parser from Biopython and DSSP and the proper
parser for it. The program outputs lists of information. This program contains many
functions inorder to properly mine the data and needs
DSSPData.py to work.

B. Theoretical Model Construction
"Geometry" and "PeptideBuilder" are both used to build any more information about their
functionality is discussed in.

1. Geometry.py
This is a library of Amino Acid Geometry objects. By reading in the one-letter Amino
Acid abbreviations, it can create the correct geometric parameters to construct a
protein residue. There are 20 classes (one for each amino acid) and one function
to create the geometry object. Some of these classes have an inputRotamers function
that takes in a list of integers in order to change the Amino Acid's rotamers. The
generateRandomRotamers method is used in "iterateThroughModels"

2. PeptideBuilder.py
This takes in geometry objects to construct amino acid chains. There are 20 methods
to construct the amino acids and has the calculate coordinates method in "makeStructure"
The program also contains a makeStructure method that takes in a string of Amino Acids,
Phi list of float, Psi list of float numbers. It contains two add residue methods, an
initialize_residue method, and a makeExtended Structure method. It also has a output
structure method. This program creates pdb files of your name choice

C. Theoretical data generation
These scripts were used to iterate through the models phi and psi conformations.
"iterateThroughModels" is a script that uses Geometry and Peptide Builder to build the
phi and psi conformation, and if possible, all rotamer conformations.

1. iterateThroughModels.py
The program need the "Geometry" and "PeptideBuilder" to create the phi and psi
conformation. This has a nice DSSP method and methods to iterate through the psi, psi,
and chi angles of a residue. This program creates "AnglesIteratedThroughAgainXXX"

2. DSSPData.py
Parser object to read the output of DSSP program

D. Data Analysis
After obtaining the information in the "Xxx_geo" files and rotating through the theoretical
models, we wrote scripts to parse and analyze the data.

1. max_bins_with_population_restriction.py
OBSOLETE, replaced by I.D.5
This program takes in the "Xxx_geo" data file and bins the data into 5-degree by 5-
degree Phi and Psi coordinates and put in the max SA found for that bin and the number
of data points in that bin. The input of the program requires an all-caps three letter
abbreviation of which amino acid you want to look at. This outputs the "XXX_max_
emperical_bins".

2. max_bins_with_population_restriction_theoretical.py
OBSOLETE, replaced by I.D.5
This program takes in the "XXX_max_emperical_bins_pop_restriction" file and the
"AnglesIteratedThroughAgainXXX" file to make the "XXX_max_theoretical_bin_Again" file.
The command line argument is the all-caps three letter abbreviation of which amino acid
you want to look at. This file makes sure that the theoretical data is binned and
treated in parallel with the empirical data. This program was not used in the final
write up of the paper, but is a good script nonetheless.

3. SeperateOverAndUnderRSA1.py
This program takes in the "Xxx_geo" data file and creates two files "XXX_SA_Over/Under"
This program is fairly basic, but it's a good basis to parse the "Xxx_geo" files. The
command line argument is the all-caps three letter abbreviation of which amino acid you
want to look at.

4. EmpVCalc_get_Diff_with_pop_restriction.py
This program takes in the "XXX_geo" file and the "AnglesIteratedThroughAgain" file to
create the "EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX" file.
The command line argument is the all-caps three letter abbreviation of which amino acid
you want to look at. This program just compares the bins from both "XXX_max_bins" files.

5. max_bin_all_data.py
This program takes in the "XXX_geo" files and the "AnglesIteratedThroughAgain" files to
create the "XXX_max_all_bins" file. The command line argument is the all-caps three letter
abbreviation of which amino acid you want to look at. This program compresses and bins
SA information from both files into one.

6. getRoseRSA.py
This program main use is to make the "Xxx_Rose_RSA" data files, which contain the
normalized RSA values using the normalization constants from the Rose paper. The
input for this program is the "Xxx_geo" files.

7. make_ALLOWED_GeoFiles.py
This program parses the "XXX_geo" files by the information obtained from the data
generated by the "max_bin_all_data.py." The program takes in the "XXX_max_bin_all",
"XXX_geo", "NormalizationValuesByPercentDataCoverageAndGenerous.txt", to make new
data files based on being in an ALLOWED region of the Ramachandran plot. This program
outputs the "XXX_ALLOWED_geo" files and the "AnglesIteratedThroughAgain_ALLOWED_XXX"
data files.

8. get_ALLOWED_bins.py
This program bins the data from the "make_ALLOWED_GeoFiles.py" script.

9. get_CORE_bins.py
This program bins the data from "XXX_geo" files and "AnglesIteratedThroughAgainXXX" files
based on the regions defined as CORE in the "NormalizationValuesByPercentDataCoverageAndGenerous.txt"
output.

10. get_GENEROUS_bins.py
This program bins the data from "XXX_geo" files and "AnglesIteratedThroughAgainXXX" files
based on the regions defined as GENEROUS in the
"NormalizationValuesByPercentDataCoverageAndGenerous.txt" output.

E. R scripts
The R-scripts were mainly used to make figures and to do simple things that would have been a
bit more complicated to do in python.

1. getMaximumValues.r
OBSOLETE
Obsolete but useful script to look at the data from another perspective
This gets the maximum SA values from both "XXX_max_bins" data files. It outputs the
"NormalizationValues.txt" in a csv format.

2. getMeanSA.r
This gets the mean RSA, median RSA, square root mean RSA, box-cox transformed mean RSA,
fraction of 100% buried residues, and fraction of 95% residues (for theoretical
normalization values only). For all the average estimates, the script uses both
empirical and theoretical normalization values from ALLOWED regions. The output file is a
csv file called "Hydrophobicity_Scales_updated.txt". It also has optional
"MeanHydrophobicityScales.txt" and "BuriedHydrophobicityScales.txt" output. The script
requires the "NormalizationValuesByPercentDataCoverageAndGenerous.txt" file or the
"NormalizationValuesByPercentDataCoverage.txt".

3. CorrelatonTable.r
This makes the correlation table of all scales in "Hydrophobicity_Scales_updated.txt"
with the Wolfenden, Kyte, Radzicka, MacCallum, Moon, Wimley, Fauchere, and Rose scales.
This require all files to run. This performs the pearson correlation test.

4. get_population_cut_offs.r
This script obtains the normalization (MAX RSA) of both the empirical and theoretical data.
This script is unique in that it needs to be run twice and commented out. First run, is to obtain
the ALLOWED and CORE bin angle cut-offs. After the ALLOWED and CORE bins are obtained, the program
is ran again with the uncommented out lines. The GENEROUS bins rely on the areas of the ALLOWED
bins and are calculated from the "XXX_ALLOWED_geo" and "AnglesIteratedThroughAgain_ALLOWED_XXX".

5. Figure Scripts
These scripts were used to make the figures in R. most have a pdf/png alternate code in
their scripts that are commented out.

a. barGRSA.r
Using the "Xxx_geo" files and the "Xxx_Rose_RSA" files, the script makes the
"BarGraphRSA.pdf" figure. There is an optional png script at the bottom of this
file, it is commented out currently.

b. makeRSAdistribution.r
This script to makes the RSA distribution of Alanine. However, with a bit of tweaking
it can make all Amino Acids. This makes the "Alanine_RSA_distribution.pdf"
This uses the hist function in R.

c. makePlotsWithPopRestriction.r
SEMI-OBSOLETE
This file makes the best figure. It makes the "XXX_Rama_HSV.svg" figure (the
figure that compares the Theoretical and Empirical results. This figure needs
to be edited in Inkscape. It requires both "XXX_all_bins" files. The file requires
that you specify "code" which is the three-letter amino acid abbreviation for the
amino acid you want to make the figure for. This uses the image function to make
the figures.

d. makeRamaPlot.r
This file makes the Ramanchandran plot of all RSA with the Miller normalization
values, where the data point are put into two catagories RSA>1 and RSA<=1. This
script requires the "XXX_SA_Over" and "XXX_SA_Under." The file reads out as
"XXX_RamaPlotRSA.pdf". To specify which XXX you want to make, assign the variable
'code' to the three-letter abbreviation.

e. makeEmpCalVpop.r
This script makes the plot where I map the population of data point in the
empirical bins to the difference between the Theoretical and Empirical maximums
for each bin. The script needs the
"EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX" file and makes
"XXX_DifferenceVPopulation" where you have to specify or create the variable
'code' in order to get the plot for your amino acid of interest.

f. makeNormCorPlot.r
This script makes the correlation (3x3) plot. This script needs the "Wolfden.txt",
"Rose.txt", and "Kite_Doolittle.txt". It also needs the
"Hydrophobicity_Scales_updated.txt" file. It outputs the
'NormalizedCorrelations.pdf.' This file takes the pearson correlation.

g. makeALLOWEDBinnedRamaPlot.r
Same idea as II.5.c but with the "XXX_max_bins_ALLOWED"

h. makeCOREBinnedRamaPlot.r
Same idea as II.5.c but with the "XXX_max_bins_CORE"

i. makeGENEROUSBinnedRamaPlot.r
Same idea as II.5.c but with the "XXX_max_bins_GENEROUS"

j. getAngles.R

FOR DARIA to fill out

II. Data Files
Descriptions how some of the data files look like.

A. Xxx_geo files (RSA-normalization-values\GeoFiles\
These files hold all bond angles, bond length, dihedral angles, SA, RSA (Miller), neighboring
residues, and secondary structure from the PDB files I mined from the
"cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz" file. This is tab delimited. I read this into R
with the read.delim command

Needs: "get_PDB.py", "cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz", DSSP program, DSSPData.py
generated from program: parse_alignment

B. AnglesIteratedThroughAgainXXX (RSA-normalization-values\AnglesIteratedThroughAgain)
These files are the 1 degree discrete rotations of all phi and psi conformations. I did not record
the rotamer conformations that gave the high SA. This file has three column: "SA\tPhi\tPsi". It is
tab delimited.

Needs: "Geometry.py", "PeptideBuilder.py", DSSP
generated from program: iterateThroughModels.py

C. XXX_SA_Over/Under, Xxx_Rose_RSA (RSA-normalization-values\SA_Over_Under\Over, ...\Under)
These files are pretty self explanitory. The "XXX_SA_Over/Under" files have the phi and psi values
of all the residues with RSA >1 and RSA <=1, respectively. The file is tab delimited. Xxx_Rose_RSA
has the normalized SA values using the Rose normalization values. It has two columns that are tab
delimited. The AA and its neighbors followed by the RSA (with Rose constant)

Needs: "Xxx_geo"
generated from program: "SeperateOverAndUnderRSA1.py" or "getRoseRSA.py"

D. XXX_max_bins_all
These files are the binned empirical and theoretical results. This file has four columns.
"Phi \t Psi \t max_obs_SA \t max_theo_SA \t obs_bin_pop". This is also tab delimited.

Needs: "Xxx_geo"
generated from program: "max_bin_all_data.py"

E. EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX (RSA-normalization-values\EmpiricalVTheoretical)
These files are used to make the population difference figures. The file has two columns the
SA_difference for a bin and the population of the bin. This is also tab delimited.

Needs: "XXX_max_bins_all"
generated from program: "EmpVCalc_get_Diff_with_pop_restriction.py"

F. NormalizationValuesByPercentDataCoverage
This is a tab delimited file is the output of the R script getMaximumValues.r. This is commonly
used in every R script to normalize things.

Needs: "Xxx_geo" and "AnglesIteratedThroughAgainXXX"
generated from program: "get_population_cut_offs.r"

G. NormalizationValuesByPercentDataCoverageAndGenerous.txt
This is a tab delimited file is the output of the R script getMaximumValues.r. This is commonly
used in every R script to normalize things.

Needs: "Xxx_ALLOWED_geo" and "AnglesIteratedThroughAgain_ALLOWED_XXX"
generated from program: "get_population_cut_offs.r"

H. Hydrophobicity_Scales_Updated.txt
This is a tab delimited file that contains the Empirical and Theoretical normalized RSA mean,
median, square root mean, box-cox mean, and the the percent buried residues.

Needs: "NormalizationValuesByPercentDataCoverageAndGenerous.txt", "Xxx_geo"
generated from program: "getMeanSA.r"

I. Wolfden.txt, rose.txt, Kite_Doolittle.txt, Fauchere.txt, Wimley.txt, Moon.txt, Radzicka.txt, MacCallum.txt
This is a tab delimited file that contains the hydrophobic values from each of the respective
papers.

J. cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz
In the list below, the resolution and percent identity cutoffs are given in each filename. E.g., for cullpdb_pc20_res1.8_R0.25_d130517_chains3211, the percentage identity cutoff is 20%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25. The list was generated on May 22, 2013. The number of chains in the list is 3211

K. Allowed, Core, and Generous Bins
Files used to make figures and to estimate resonable Ramachandran angle cut-offs

mtien/RSA-normalization-values