BioPandas/biopandas

Adding a function for computing the radius of gyration

rasbt opened this issue · 2 comments

rasbt commented

The rgyr staticmethod should be implemented similar to the rmsd staticmethod in PandasPDB

Hey, I've had a go at this and wrote this function, let me know if it looks good and if it would need any additional functionality, I'm not sure of all the use cases for the radius of gyration (I just have amateur knowledge of bioinformatics). For one, the function only works properly for proteins, as it doesn't have masses for other atoms, and it adds a default mass of 0 if the atom is not in atomic_masses. More masses could be added of course, but I'm not sure if anyone would need the rgyr for smaller molecules. I would also suggest that we call the function gyradius as it is a synonym for radius of gyration and would be more descriptive.

@staticmethod
def rgyr(df):
    """Compute the Radius of Gyration of a molecule

    Parameters
        ----------
        df : pandas.DataFrame
            DataFrame with HETATM, ATOM, and/or ANISOU entries.

        Returns
        ---------
        rg : float
            Radius of Gyration of df in Angstrom

        """
    # could be made as a class variable if it will be needed elsewhere
    atomic_masses = {"C": 12.0107, "O": 15.9994, "N": 14.0067, "S": 32.065}

    coords = df[["x_coord", "y_coord", "z_coord"]].to_numpy()
    masses = np.array([atomic_masses.get(atom, 0) for atom in df["element_symbol"]])
    total_mass = masses.sum()
    center_of_mass = (masses[:, None] * coords).sum(axis=0) / total_mass
    distances = np.linalg.norm(coords - center_of_mass, axis=1)
    rg = np.sqrt((distances**2 * masses).sum() / total_mass)
    return round(rg, 4)

I also wrote some tests for it, I don't know if they're appropriate (bear in mind that I'm a beginner in general), so please suggest more or different tests if needed. In the rgyr function I used numpy to improve performance with vectorized calculations, but in test_accuracy I used a more basic and straightforward implementation, in case any logic is changed in rgyr, but I'm not sure if the test is even necessary, maybe test_pdb_df is enough. I should also note that I looked at pymol's rgyrate function to see their implementation, so it is similar but with clearer variable names

# BioPandas
# License: BSD 3 clause
# Project Website: http://rasbt.github.io/biopandas/
# Code Repository: https://github.com/rasbt/biopandas

from pandas_pdb import PandasPdb
import os
import pandas as pd


TESTDATA_1t48 = os.path.join(os.path.dirname(__file__), "data", "1t48_995.pdb")

p1t48 = PandasPdb()
p1t48.read_pdb(TESTDATA_1t48)


def test_accuracy():
    # Create test DataFrame with 3 atoms
    test_df = pd.DataFrame({"element_symbol": ["C", "O", "N", "S"],
                            "x_coord": [1.0, 2.0, 3.0, 4.0],
                            "y_coord": [5.0, 6.0, 7.0, 8.0],
                            "z_coord": [9.0, 10.0, 11.0, 12.0]})

    coords = test_df[["x_coord", "y_coord", "z_coord"]].to_numpy()
    masses = [12.0107, 15.9994, 14.0067, 32.065]
    total_mass = sum(masses)

    weighted_coords = [(m*x, m*y, m*z) for (x, y, z), m in zip(coords, masses)]
    weighted_deviation = sum(m * (x**2 + y**2 + z**2) for (x, y, z), m in zip(coords, masses))
    mean_weighted_coords = [sum(coords) / total_mass for coords in zip(*weighted_coords)]
    mean_weighted_deviation = sum(coord**2 for coord in mean_weighted_coords)

    # rounding needs to be changed to pass test if final rounding in rgyr will be changed
    expected_rg = round((weighted_deviation / total_mass - mean_weighted_deviation)**0.5, 4)
    rg = PandasPdb.rgyr(test_df)
    assert rg == expected_rg, f"Expected {expected_rg}, got {rg} instead"


def test_pdb_df():
    rg = PandasPdb.rgyr(p1t48.df['ATOM'])
    expected_rg = 18.1508
    assert rg == expected_rg, f"Expected {expected_rg}, got {rg} instead"

And I could also write a small tutorial for the usage

a-r-j commented

Awesome! Happy to review if you make a PR.

Small nit: you can replace masses = np.array([atomic_masses.get(atom, 0) for atom in df["element_symbol"]]) with pandas .map().