BioPandas/biopandas

Column names differ from VMD. Were they chosen intentionally?

StevenCHowell opened this issue · 4 comments

I just started using biopandas and it seems nice. Toward promoting discussion related to its design, I have a general question regarding the column names, particularly for df['ATOM']. Note that I work with simulations of molecular structures so I care most about the df['ATOM'] field.

Coming from using VMD for several years, I noticed many differences to access the PDB fields. I have not compared to Chimera or PyMol, but below is a comparison between biopandas, VMD, and the documentation for the PDB file format. I would prefer the VMD options, partially because they are familiar but they are also more succinct. They are probably not as clear as a verbose name, but they are a closer match to the PDB field.

@rasbt, did you have strong motivation for the selections you made? If anyone else have other ideas or preferences, I would be interested to hear them.

BIOPANDAS        VMD        | COLUMNS        DATA  TYPE    FIELD        DEFINITION
----------------------------|---------------------------------------------------------------------------------------
record_name      atom       |  1 -  6        Record name   "ATOM  "
atom_number      index      |  7 - 11        Integer       serial       Atom  serial number.
atom_name        name       | 13 - 16        Atom          name         Atom name.
alt_loc          altloc     | 17             Character     altLoc       Alternate location indicator.
residue_name     resname    | 18 - 20        Residue name  resName      Residue name.
chain_id         chain      | 22             Character     chainID      Chain identifier.
residue_number   resid      | 23 - 26        Integer       resSeq       Residue sequence number.
insertion        insertion  | 27             AChar         iCode        Code for insertion of residues.
x_coord          x          | 31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms.
y_coord          y          | 39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms.
z_coord          z          | 47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms.
occupancy        occupancy  | 55 - 60        Real(6.2)     occupancy    Occupancy.
b_factor         beta       | 61 - 66        Real(6.2)     tempFactor   Temperature  factor.
segment_id       segname    |
element_symbol   element    | 77 - 78        LString(2)    element      Element symbol, right-justified.
charge           charge     | 79 - 80        LString(2)    charge       Charge  on the atom.
rasbt commented

Thanks for the feedback and bringing it up

The chosen naming convention was mostly based on what I found intuitive wrt the rcsb PDB "definition" column -- I didn't think about VMD at all when I chose those column names.

I agree with you that this may be confusing/inconvenient for a VMD user though ... Since this is a dev tool, I think making a decision/change "now" seems reasonable (aka rather sooner than later). Here are some thoughts:

record_name vs atom

I wouldn't change that to 'atom', since the record_name column could also store HETATM and ANISOU entries, for instance.

atom_number vs index

I'd stick with atom_number here, because it is more clear what this refers to. E.g., in the context of pandas, index could also mean DataFrame index

atom_name vs name

again, I find the atom prefix more clear, since 'name' is more ambiguous (atom vs residue vs molecule name)

alt_loc vs altLoc

Not sure, for some reason I don't like camelcase (except for Python classes). "Don't like" is actually a bit too strong: I prefer lowercase whenever possible, because I don't have to remember which letter to capitalize (okay, but there's an underscore one might say ...).

residue_name vs resname

No strong preference here; maybe residue_name is a bit (too) verbose!? What's confusing though is that VMD uses resname instead of e.g., resName like it does for altLoc

x_coord, y_coord, z_coord vs x, y, z

That's one thing that could be shortened. _coord is a bit too verbose maybe. On the other hand, it's more clear what this refers to (thinking of math variables)

b_factor vs beta

I'd go with b_factor or temp_factor here; beta is too ambiguos imho (again, may sound like a generic variable name)

element_symbol vs element

that could definitely be shortened to just element; I don't see any reason to make it more verbose?!

Although I actually like the current column names, i agree that it's maybe a bit annoying if they are different from the PDB specs. I think changing them to the PDB spec column names sounds reasonable. Another thing to think about other data formats (SDF, MOL2, ...). I.e., what's more useful, having column names consistent across different dataformats or using a column name that is specific to each data format. E.g., in MOL2, there's "atom_id & atom_name" vs "serial & name" in PDB.

I am slightly leaning towards using the official "field names" from each format's spec though ... And I guess this change should be done rather sooner and later :P

alt_loc vs altLoc
Not sure, for some reason I don't like camelcase (except for Python classes). "Don't like" is actually a bit too strong: I prefer lowercase whenever possible, because I don't have to remember which letter to capitalize (okay, but there's an underscore one might say ...).

sorry, that was a typo. The pdb standard uses camel case, but VMD uses one word all lowercase.

What's confusing though is that VMD uses resname instead of e.g., resName like it does for altLoc

Fixing my typo resolves this discrepancy.

I must admit, I am not too familiar with other formats, SDF, MOL2, etc. Are these each handled through their own object, similar to PandasPdb, e.g., PandasSdf? If that's the case, I think having a separate standard for each, aligned with the SDF, MOL2, etc, standards makes sense.

Another idea/approach that may make sense would be to have some base Molecule class that would have methods to populate properties by reading in different files, PDB, SDF, MOL2, etc (though what if the structure model is actually multiple molecules? maybe you would want a System class that inherits from Molecule?). In this situation, it would be best to either have a single standard way to query/return molecule properties, e.g., residue name, coordinates, etc, or to accommodate the standard for each file type by having multiple queries for the same thing.

rasbt commented

I must admit, I am not too familiar with other formats, SDF, MOL2, etc. Are these each handled through their own object, similar to PandasPdb, e.g., PandasSdf?

SDF is probably the most popular format for small molecules I'd say. I work with small molecules a lot (actually much more often than with proteins and PDB files) but I almost exclusively work with MOL2 files. The reason is that SDF doesn't store partial charge information, which is (almost) crucial for analyzing electrostatic features and creating pharmacophores imho. But back to the question: there's currently a PandasMol2, analogous to PandasPdb, as well (http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/). I am going to make a PandasSdf at some point, but it isn't on my priority list, yet.

If that's the case, I think having a separate standard for each, aligned with the SDF, MOL2, etc, standards makes sense.

Yeah, so now that final's week is over, I am probably going to change the column format to the original specs this weekend or so -- rather sooner than later. I am currently using PandasMol2 in one of my research projects for some analysis and planning to submit the manuscript in 2-3 weeks coupled with a software package, so I'd rather like to make this change now before I have to worry about dependencies and versioning and those things.

Another idea/approach that may make sense would be to have some base Molecule class that would have methods to populate properties by reading in different files, PDB, SDF, MOL2, etc (though what if the structure model is actually multiple molecules? maybe you would want a System class that inherits from Molecule?). In this situation, it would be best to either have a single standard way to query/return molecule properties, e.g., residue name, coordinates, etc, or to accommodate the standard for each file type by having multiple queries for the same thing.

I was thinking about that when I made the PandasMol2 class. However, I think that's generally not feasible because the formats are so different. I.e., the methods would have to be implemented for each class (PandasMol2, PandasPdb) individually anyway, so inheritance would be kind of overkill. The idea right now is to make the method names at least consistent. Also, with future features in mind, I'd rather prefer not to oppose any common denominator requirements at this point. Working with proteins vs small molecules (e.g., for ligand-based virtual screening) require very different workflows etc.

One future feature could be a converter function though, which converts one DataFrame-based class to another (a_pandasMol2_instance = convert(in=a_PandasPdb_instance, out=PandasMol2)