RSA normalization Project

----------------------------------------------------------------------------------------------------------------

Table of Contents:

I. Important Programs
	A. Mine PDB information
		1. parse_alignment.py		
		2. get_PDB.py
	B. Theoretical Model Construction
		2. Geometry.py
		3. PeptideBuilder.py
	C. Theoretical data generation
		1. iterateThroughModels.py
		3. DSSPData.py
	D. Data Analysis
		1. max_bins_with_population_restriction.py
		2. max_bins_with_population_restriction_theoretical.py
		3. SeperateOverAndUnderRSA1.py
		4. EmpVCalc_get_Diff_with_pop_restriction.py
		5. max_bin_all_data.py
		6. getRoseRSA.py
		7. make_ALLOWED_GeoFiles.py
		8. get_ALLOWED_bins.py
		9. get_CORE_bins.py
		10. get_GENEROUS_bins.py
	E. R scripts
		1. getMaximumValues.r
		2. getMeanSA.r
		3. CorrelatonTableNewScales.r
		4. get_population_cut_offs.r
		4. FigureScripts
			a. barGRSA.r
			b. makeRSAdistribution.r
			c. makePlotsWithPopRestriction.r
			d. makeRamaPlot.r
			e. makeEmpCalVpop.r
			f. makeNormCorPlot.r
			g. makeALLOWEDBinnedRamaPlot.r
			h. makeCOREBinnedRamaPlot.r
			i. makeGENEROUSBinnedRamaPlot.r
			j. getAngles.r
	F. Misc.
		1. editHydroScales.py
		2. runAll.py

		
II. Data Files
	A. Xxx_geo
	B. AnglesIteratedThroughAgainXXX
	C. XXX_SA_Over/Under, Xxx_Rose_RSA
	D. XXX_max_bins_all
	E. EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX
	F. NormalizationValuesByPercentDataCoverage
	G. NormalizationValuesByPercentDataCoverageAndGenerous.txt
	H. Hydrophobicity_Scales_Updated.txt
	I. Wolfden.txt, rose.txt, Kite_Doolittle.txt, Fauchere.txt, Wimley.txt, Moon.txt, Radzicka.txt, MacCallum.txt 
	J. cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz
	K. Allowed, Core, and Generous Bins
	

----------------------------------------------------------------------------------------------------------------

I. Important Programs
	This section is dedicated to all python, R, and other scripts used to obtain or analyze data.
	
	A. Mine PDB Information
		To mine the protein structures I parsed the "cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz" 
		file in "parse_alignment.py".
		
		1. parse_alignment.py
			This is a python script that takes in a list of PDB and Chain ID's (we used the 
			"cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz" file from the Dunbrak lab).
			This program creates the "Xxx_geo" files (ie. Ala_geo, Asn_geo, etc...).
			This program also only outputs information from non-chain-terminating residues, which
			is determined by the peptide bond length between two residues and non-ambigous neighbors.
			This program is mainly a frame to process the output from the real PDB parser "get_PDB"
			Any files that is either corrupted or did not exist in the PDB database is reported to
			the "Error_report_bond_length.txt".
			The two numbers that are printed to the terminal are for testing.
	
		2. get_PDB.py
			This is a python program that takes in a PDB file name and Chain ID name, downloads
			the PDB file to the "structures" folder than extracts all the information from the
			PDB file. This program needs the PDB parser from Biopython and DSSP and the proper
			parser for it. The program outputs lists of information. This program contains many
			functions inorder to properly mine the data and needs
			DSSPData.py to work.
			
	B. Theoretical Model Construction
		"Geometry" and "PeptideBuilder" are both used to build any more information about their 
		functionality is discussed in. 
		
		1. Geometry.py
			This is a library of Amino Acid Geometry objects. By reading in the one-letter Amino
			Acid abbreviations, it can create the correct geometric parameters to construct a
			protein residue. There are 20 classes (one for each amino acid) and one function
			to create the geometry object. Some of these classes have an inputRotamers function
			that takes in a list of integers in order to change the Amino Acid's rotamers. The
			generateRandomRotamers method is used in "iterateThroughModels"

		2. PeptideBuilder.py
			This takes in geometry objects to construct amino acid chains. There are 20 methods
			to construct the amino acids and has the calculate coordinates method in "makeStructure"
			The program also contains a makeStructure method that takes in a string of Amino Acids,
			Phi list of float, Psi list of float numbers. It contains two add residue methods, an
			initialize_residue method, and a makeExtended Structure method. It also has a output
			structure method. This program creates pdb files of your name choice

	C. Theoretical data generation
		These scripts were used to iterate through the models phi and psi conformations.
		"iterateThroughModels" is a script that uses Geometry and Peptide Builder to build the
		phi and psi conformation, and if possible, all rotamer conformations.
	
		1. iterateThroughModels.py
			The program need the "Geometry" and "PeptideBuilder" to create the phi and psi
			conformation. This has a nice DSSP method and methods to iterate through the psi, psi,
			and chi angles of a residue. This program creates "AnglesIteratedThroughAgainXXX"
			
		2. DSSPData.py
			Parser object to read the output of DSSP program

	D. Data Analysis
		After obtaining the information in the "Xxx_geo" files and rotating through the theoretical
		models, we wrote scripts to parse and analyze the data.
		
		1. max_bins_with_population_restriction.py
			OBSOLETE, replaced by I.D.5
			This program takes in the "Xxx_geo" data file and bins the data into 5-degree by 5-
			degree Phi and Psi coordinates and put in the max SA found for that bin and the number
			of data points in that bin. The input of the program requires an all-caps three letter
			abbreviation of which amino acid you want to look at. This outputs the "XXX_max_
			emperical_bins".

		2. max_bins_with_population_restriction_theoretical.py
			OBSOLETE, replaced by I.D.5
			This program takes in the "XXX_max_emperical_bins_pop_restriction" file and the 
			"AnglesIteratedThroughAgainXXX" file to make the "XXX_max_theoretical_bin_Again" file.
			The command line argument is the all-caps three letter abbreviation of which amino acid 
			you want to look at. This file makes sure that the theoretical data is binned and
			treated in parallel with the empirical data. This program was not used in the final
			write up of the paper, but is a good script nonetheless.

		3. SeperateOverAndUnderRSA1.py
			This program takes in the "Xxx_geo" data file and creates two files "XXX_SA_Over/Under"
			This program is fairly basic, but it's a good basis to parse the "Xxx_geo" files. The
			command line argument is the all-caps three letter abbreviation of which amino acid you
			want to look at.

		4. EmpVCalc_get_Diff_with_pop_restriction.py
			This program takes in the "XXX_geo" file and the "AnglesIteratedThroughAgain" file to
			create the "EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX" file.
			The command line argument is the all-caps three letter abbreviation of which amino acid 
			you want to look at. This program just compares the bins from both "XXX_max_bins" files.

		5. max_bin_all_data.py
			This program takes in the "XXX_geo" files and the "AnglesIteratedThroughAgain" files to
			create the "XXX_max_all_bins" file. The command line argument is the all-caps three letter 
			abbreviation of which amino acid you want to look at. This program compresses and bins 
			SA information from both files into one.

		6. getRoseRSA.py
			This program main use is to make the "Xxx_Rose_RSA" data files, which contain the
			normalized RSA values using the normalization constants from the Rose paper. The 
			input for this program is the "Xxx_geo" files.
			
		7. make_ALLOWED_GeoFiles.py
			This program parses the "XXX_geo" files by the information obtained from the data 
			generated by the "max_bin_all_data.py." The program takes in the "XXX_max_bin_all",
			"XXX_geo", "NormalizationValuesByPercentDataCoverageAndGenerous.txt", to make new
			data files based on being in an ALLOWED region of the Ramachandran plot. This program
			outputs the "XXX_ALLOWED_geo" files and the "AnglesIteratedThroughAgain_ALLOWED_XXX" 
			data files.
			
		8. get_ALLOWED_bins.py
			This program bins the data from the "make_ALLOWED_GeoFiles.py" script.
			
		9. get_CORE_bins.py
			This program bins the data from "XXX_geo" files and "AnglesIteratedThroughAgainXXX" files
			based on the regions defined as CORE in the "NormalizationValuesByPercentDataCoverageAndGenerous.txt"
			output.
		
		10. get_GENEROUS_bins.py
			This program bins the data from "XXX_geo" files and "AnglesIteratedThroughAgainXXX" files
			based on the regions defined as GENEROUS in the 
			"NormalizationValuesByPercentDataCoverageAndGenerous.txt" output.
			
	E. R scripts
		The R-scripts were mainly used to make figures and to do simple things that would have been a 
		bit more complicated to do in python.

		1. getMaximumValues.r
			OBSOLETE
			Obsolete but useful script to look at the data from another perspective
			This gets the maximum SA values from both "XXX_max_bins" data files. It outputs the
			"NormalizationValues.txt" in a csv format.

		2. getMeanSA.r
			This gets the mean RSA, median RSA, square root mean RSA, box-cox transformed mean RSA,
			fraction of 100% buried residues, and fraction of 95% residues (for theoretical 
			normalization values only). For all the average estimates, the script uses both 
			empirical and theoretical normalization values from ALLOWED regions. The output file is a 
			csv file called "Hydrophobicity_Scales_updated.txt". It also has optional 
			"MeanHydrophobicityScales.txt" and "BuriedHydrophobicityScales.txt" output. The script 
			requires the "NormalizationValuesByPercentDataCoverageAndGenerous.txt" file or the 
			"NormalizationValuesByPercentDataCoverage.txt".

		3. CorrelatonTable.r
			This makes the correlation table of all scales in "Hydrophobicity_Scales_updated.txt"
			with the Wolfenden, Kyte, Radzicka, MacCallum, Moon, Wimley, Fauchere, and Rose scales. 
			This require all files to run. This performs the pearson correlation test. 
			
		4. get_population_cut_offs.r
			This script obtains the normalization (MAX RSA) of both the empirical and theoretical data.
			This script is unique in that it needs to be run twice and commented out. First run, is to obtain
			the ALLOWED and CORE bin angle cut-offs. After the ALLOWED and CORE bins are obtained, the program
			is ran again with the uncommented out lines. The GENEROUS bins rely on the areas of the ALLOWED 
			bins and are calculated from the "XXX_ALLOWED_geo" and "AnglesIteratedThroughAgain_ALLOWED_XXX". 

		5. Figure Scripts
			These scripts were used to make the figures in R. most have a pdf/png alternate code in 
			their scripts that are commented out.

			a. barGRSA.r
				Using the "Xxx_geo" files and the "Xxx_Rose_RSA" files, the script makes the 
				"BarGraphRSA.pdf" figure. There is an optional png script at the bottom of this
				file, it is commented out currently.
 
			b. makeRSAdistribution.r
				This script to makes the RSA distribution of Alanine. However, with a bit of tweaking 
				it can make all Amino Acids. This makes the "Alanine_RSA_distribution.pdf" 
				This uses the hist function in R.

			c. makePlotsWithPopRestriction.r
				SEMI-OBSOLETE
				This file makes the best figure. It makes the "XXX_Rama_HSV.svg" figure (the 
				figure that compares the Theoretical and Empirical results. This figure needs
				to be edited in Inkscape. It requires both "XXX_all_bins" files. The file requires
				that you specify "code" which is the three-letter amino acid abbreviation for the
				amino acid you want to make the figure for. This uses the image function to make
				the figures.

			d. makeRamaPlot.r
				This file makes the Ramanchandran plot of all RSA with the Miller normalization 
				values, where the data point are put into two catagories RSA>1 and RSA<=1. This
				script requires the "XXX_SA_Over" and "XXX_SA_Under." The file reads out as 
				"XXX_RamaPlotRSA.pdf". To specify which XXX you want to make, assign the variable
				'code' to the three-letter abbreviation.
		
			e. makeEmpCalVpop.r
				This script makes the plot where I map the population of data point in the
				empirical bins to the difference between the Theoretical and Empirical maximums
				for each bin. The script needs the 
				"EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX" file and makes 
				"XXX_DifferenceVPopulation" where you have to specify or create the variable 
				'code' in order to get the plot for your amino acid of interest.		

			f. makeNormCorPlot.r
				This script makes the correlation (3x3) plot. This script needs the "Wolfden.txt",
				"Rose.txt", and "Kite_Doolittle.txt". It also needs the 
				"Hydrophobicity_Scales_updated.txt" file. It outputs the 
				'NormalizedCorrelations.pdf.' This file takes the pearson correlation. 
				
			g. makeALLOWEDBinnedRamaPlot.r
				Same idea as II.5.c but with the "XXX_max_bins_ALLOWED"
				
			h. makeCOREBinnedRamaPlot.r
				Same idea as II.5.c but with the "XXX_max_bins_CORE"
				
			i. makeGENEROUSBinnedRamaPlot.r
				Same idea as II.5.c but with the "XXX_max_bins_GENEROUS"
				
			j. getAngles.R
			
				FOR DARIA to fill out
				
				
II. Data Files
	Descriptions how some of the data files look like.

	A. Xxx_geo files (RSA-normalization-values\GeoFiles\
		These files hold all bond angles, bond length, dihedral angles, SA, RSA (Miller), neighboring 
		residues, and secondary structure from the PDB files I mined from the 			
		"cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz" file. This is tab delimited. I read this into R
		with the read.delim command
		
		Needs: "get_PDB.py", "cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz", DSSP program, DSSPData.py
		generated from program: parse_alignment

	B. AnglesIteratedThroughAgainXXX (RSA-normalization-values\AnglesIteratedThroughAgain)
		These files are the 1 degree discrete rotations of all phi and psi conformations. I did not record
		the rotamer conformations that gave the high SA. This file has three column: "SA\tPhi\tPsi". It is
		tab delimited.

		Needs: "Geometry.py", "PeptideBuilder.py", DSSP
		generated from program: iterateThroughModels.py

	C. XXX_SA_Over/Under, Xxx_Rose_RSA (RSA-normalization-values\SA_Over_Under\Over, ...\Under)
		These files are pretty self explanitory. The "XXX_SA_Over/Under" files have the phi and psi values
		of all the residues with RSA >1 and RSA <=1, respectively. The file is tab delimited. Xxx_Rose_RSA
		has the normalized SA values using the Rose normalization values. It has two columns that are tab
		delimited. The AA and its neighbors followed by the RSA (with Rose constant)

		Needs: "Xxx_geo"
		generated from program: "SeperateOverAndUnderRSA1.py" or "getRoseRSA.py"

	D. XXX_max_bins_all
		These files are the binned empirical and theoretical results. This file has four columns. 
		"Phi \t Psi \t max_obs_SA \t max_theo_SA \t obs_bin_pop". This is also tab delimited.

		Needs: "Xxx_geo"
		generated from program: "max_bin_all_data.py"

	E. EmpericalVCalculated_diff_pop_nonZeroed_with_pop_restriction_XXX (RSA-normalization-values\EmpiricalVTheoretical)
		These files are used to make the population difference figures. The file has two columns the 
		SA_difference for a bin and the population of the bin. This is also tab delimited. 

		Needs: "XXX_max_bins_all"
		generated from program: "EmpVCalc_get_Diff_with_pop_restriction.py"

	F. NormalizationValuesByPercentDataCoverage
		This is a tab delimited file is the output of the R script getMaximumValues.r. This is commonly
		used in every R script to normalize things.

		Needs: "Xxx_geo" and "AnglesIteratedThroughAgainXXX"
		generated from program: "get_population_cut_offs.r"
	
	G. NormalizationValuesByPercentDataCoverageAndGenerous.txt
		This is a tab delimited file is the output of the R script getMaximumValues.r. This is commonly
		used in every R script to normalize things.

		Needs: "Xxx_ALLOWED_geo" and "AnglesIteratedThroughAgain_ALLOWED_XXX"
		generated from program: "get_population_cut_offs.r"

	H. Hydrophobicity_Scales_Updated.txt
		This is a tab delimited file that contains the Empirical and Theoretical normalized RSA mean, 
		median, square root mean, box-cox mean, and the the percent buried residues. 
		
		Needs: "NormalizationValuesByPercentDataCoverageAndGenerous.txt", "Xxx_geo"
		generated from program: "getMeanSA.r"
	
	I. Wolfden.txt, rose.txt, Kite_Doolittle.txt, Fauchere.txt, Wimley.txt, Moon.txt, Radzicka.txt, MacCallum.txt 
		This is a tab delimited file that contains the hydrophobic values from each of the respective
		papers.

	J. cullpdb_pc30_res1.8_R0.25_d130607_chains4961.gz
		In the list below, the resolution and percent identity cutoffs are given in each filename. E.g., for cullpdb_pc20_res1.8_R0.25_d130517_chains3211, the percentage identity cutoff is 20%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25. The list was generated on May 22, 2013. The number of chains in the list is 3211

	K. Allowed, Core, and Generous Bins
		Files used to make figures and to estimate resonable Ramachandran angle cut-offs