Adding a function for computing the radius of gyration
rasbt opened this issue · 2 comments
The rgyr
staticmethod should be implemented similar to the rmsd staticmethod in PandasPDB
Hey, I've had a go at this and wrote this function, let me know if it looks good and if it would need any additional functionality, I'm not sure of all the use cases for the radius of gyration (I just have amateur knowledge of bioinformatics). For one, the function only works properly for proteins, as it doesn't have masses for other atoms, and it adds a default mass of 0 if the atom is not in atomic_masses
. More masses could be added of course, but I'm not sure if anyone would need the rgyr for smaller molecules. I would also suggest that we call the function gyradius
as it is a synonym for radius of gyration and would be more descriptive.
@staticmethod
def rgyr(df):
"""Compute the Radius of Gyration of a molecule
Parameters
----------
df : pandas.DataFrame
DataFrame with HETATM, ATOM, and/or ANISOU entries.
Returns
---------
rg : float
Radius of Gyration of df in Angstrom
"""
# could be made as a class variable if it will be needed elsewhere
atomic_masses = {"C": 12.0107, "O": 15.9994, "N": 14.0067, "S": 32.065}
coords = df[["x_coord", "y_coord", "z_coord"]].to_numpy()
masses = np.array([atomic_masses.get(atom, 0) for atom in df["element_symbol"]])
total_mass = masses.sum()
center_of_mass = (masses[:, None] * coords).sum(axis=0) / total_mass
distances = np.linalg.norm(coords - center_of_mass, axis=1)
rg = np.sqrt((distances**2 * masses).sum() / total_mass)
return round(rg, 4)
I also wrote some tests for it, I don't know if they're appropriate (bear in mind that I'm a beginner in general), so please suggest more or different tests if needed. In the rgyr
function I used numpy to improve performance with vectorized calculations, but in test_accuracy
I used a more basic and straightforward implementation, in case any logic is changed in rgyr
, but I'm not sure if the test is even necessary, maybe test_pdb_df
is enough. I should also note that I looked at pymol
's rgyrate
function to see their implementation, so it is similar but with clearer variable names
# BioPandas
# License: BSD 3 clause
# Project Website: http://rasbt.github.io/biopandas/
# Code Repository: https://github.com/rasbt/biopandas
from pandas_pdb import PandasPdb
import os
import pandas as pd
TESTDATA_1t48 = os.path.join(os.path.dirname(__file__), "data", "1t48_995.pdb")
p1t48 = PandasPdb()
p1t48.read_pdb(TESTDATA_1t48)
def test_accuracy():
# Create test DataFrame with 3 atoms
test_df = pd.DataFrame({"element_symbol": ["C", "O", "N", "S"],
"x_coord": [1.0, 2.0, 3.0, 4.0],
"y_coord": [5.0, 6.0, 7.0, 8.0],
"z_coord": [9.0, 10.0, 11.0, 12.0]})
coords = test_df[["x_coord", "y_coord", "z_coord"]].to_numpy()
masses = [12.0107, 15.9994, 14.0067, 32.065]
total_mass = sum(masses)
weighted_coords = [(m*x, m*y, m*z) for (x, y, z), m in zip(coords, masses)]
weighted_deviation = sum(m * (x**2 + y**2 + z**2) for (x, y, z), m in zip(coords, masses))
mean_weighted_coords = [sum(coords) / total_mass for coords in zip(*weighted_coords)]
mean_weighted_deviation = sum(coord**2 for coord in mean_weighted_coords)
# rounding needs to be changed to pass test if final rounding in rgyr will be changed
expected_rg = round((weighted_deviation / total_mass - mean_weighted_deviation)**0.5, 4)
rg = PandasPdb.rgyr(test_df)
assert rg == expected_rg, f"Expected {expected_rg}, got {rg} instead"
def test_pdb_df():
rg = PandasPdb.rgyr(p1t48.df['ATOM'])
expected_rg = 18.1508
assert rg == expected_rg, f"Expected {expected_rg}, got {rg} instead"
And I could also write a small tutorial for the usage
Awesome! Happy to review if you make a PR.
Small nit: you can replace masses = np.array([atomic_masses.get(atom, 0) for atom in df["element_symbol"]])
with pandas .map()
.