This is a Repo consists of two modules:
- A PDB (Protein Data Bank) dataset parser that will give cleaned ATOM information.
- A Ramachandran Analysis tool.
See PDB_Parser.py
.
It goes to RCSB PDB (Protein Data Bank) and download (stream) the list of non-redundant protein structure files at 30% sequence identity level. The resulting text file "clusters-by-entity-30.txt" contains over 300,000 lines, each of which corresponds to a cluster of single-chain sequences and structures (those four alphanumeric characters are PDB IDs and they are followed by "_" and then by a polymer entity identifier, not chain identifier)
Next, loops over the largest 100 clusters (the first 100 lines) in the list, select one random structure for each cluster/line.
Finally, it extracts ATOM information from PDB and FASTA dataset, and returns a cleaned dataframe with: atom_name, residue_name, x, y, z.
See Ramachandran_Analysis.py
and find the experiment result at Ramachandran_Report.pdf
.
Gives the Ramachandran Plots (scatter plots) for: (a) all residues but glycines and prolines (b) all glycines (c) all prolines
To get the Ramachandran Plots, execute $python __main__.py
.
Note that Ramachandran_Analysis must be initialized with a pandas dataframe having the format (columns): atom_name, residue_name, x, y, z.
- Current version skips the PDBx/mmCIF Format, thus the "first 100" structures is actually giving fewer (94, in the report case).
- Current version simply accumulate all ATOMs, instead of making them into groups of chains, which may cause mis-calculations (alleviated to 128 occurrences in over 500000 amino acids).
- Multi-processing/multi-threading methods can be used to improve the speed.