Column names differ from VMD. Were they chosen intentionally?
StevenCHowell opened this issue · 4 comments
I just started using biopandas and it seems nice. Toward promoting discussion related to its design, I have a general question regarding the column names, particularly for df['ATOM']
. Note that I work with simulations of molecular structures so I care most about the df['ATOM']
field.
Coming from using VMD for several years, I noticed many differences to access the PDB fields. I have not compared to Chimera or PyMol, but below is a comparison between biopandas, VMD, and the documentation for the PDB file format. I would prefer the VMD options, partially because they are familiar but they are also more succinct. They are probably not as clear as a verbose name, but they are a closer match to the PDB field.
@rasbt, did you have strong motivation for the selections you made? If anyone else have other ideas or preferences, I would be interested to hear them.
BIOPANDAS VMD | COLUMNS DATA TYPE FIELD DEFINITION
----------------------------|---------------------------------------------------------------------------------------
record_name atom | 1 - 6 Record name "ATOM "
atom_number index | 7 - 11 Integer serial Atom serial number.
atom_name name | 13 - 16 Atom name Atom name.
alt_loc altloc | 17 Character altLoc Alternate location indicator.
residue_name resname | 18 - 20 Residue name resName Residue name.
chain_id chain | 22 Character chainID Chain identifier.
residue_number resid | 23 - 26 Integer resSeq Residue sequence number.
insertion insertion | 27 AChar iCode Code for insertion of residues.
x_coord x | 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms.
y_coord y | 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms.
z_coord z | 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms.
occupancy occupancy | 55 - 60 Real(6.2) occupancy Occupancy.
b_factor beta | 61 - 66 Real(6.2) tempFactor Temperature factor.
segment_id segname |
element_symbol element | 77 - 78 LString(2) element Element symbol, right-justified.
charge charge | 79 - 80 LString(2) charge Charge on the atom.
Thanks for the feedback and bringing it up
The chosen naming convention was mostly based on what I found intuitive wrt the rcsb PDB "definition" column -- I didn't think about VMD at all when I chose those column names.
I agree with you that this may be confusing/inconvenient for a VMD user though ... Since this is a dev tool, I think making a decision/change "now" seems reasonable (aka rather sooner than later). Here are some thoughts:
record_name vs atom
I wouldn't change that to 'atom', since the record_name
column could also store HETATM and ANISOU entries, for instance.
atom_number vs index
I'd stick with atom_number here, because it is more clear what this refers to. E.g., in the context of pandas, index
could also mean DataFrame index
atom_name vs name
again, I find the atom
prefix more clear, since 'name' is more ambiguous (atom vs residue vs molecule name)
alt_loc vs altLoc
Not sure, for some reason I don't like camelcase (except for Python classes). "Don't like" is actually a bit too strong: I prefer lowercase whenever possible, because I don't have to remember which letter to capitalize (okay, but there's an underscore one might say ...).
residue_name vs resname
No strong preference here; maybe residue_name is a bit (too) verbose!? What's confusing though is that VMD uses resname
instead of e.g., resName
like it does for altLoc
x_coord, y_coord, z_coord vs x, y, z
That's one thing that could be shortened. _coord
is a bit too verbose maybe. On the other hand, it's more clear what this refers to (thinking of math variables)
b_factor vs beta
I'd go with b_factor or temp_factor here; beta is too ambiguos imho (again, may sound like a generic variable name)
element_symbol vs element
that could definitely be shortened to just element; I don't see any reason to make it more verbose?!
Although I actually like the current column names, i agree that it's maybe a bit annoying if they are different from the PDB specs. I think changing them to the PDB spec column names sounds reasonable. Another thing to think about other data formats (SDF, MOL2, ...). I.e., what's more useful, having column names consistent across different dataformats or using a column name that is specific to each data format. E.g., in MOL2, there's "atom_id & atom_name" vs "serial & name" in PDB.
I am slightly leaning towards using the official "field names" from each format's spec though ... And I guess this change should be done rather sooner and later :P
alt_loc vs altLoc
Not sure, for some reason I don't like camelcase (except for Python classes). "Don't like" is actually a bit too strong: I prefer lowercase whenever possible, because I don't have to remember which letter to capitalize (okay, but there's an underscore one might say ...).
sorry, that was a typo. The pdb standard uses camel case, but VMD uses one word all lowercase.
What's confusing though is that VMD uses resname instead of e.g., resName like it does for altLoc
Fixing my typo resolves this discrepancy.
I must admit, I am not too familiar with other formats, SDF, MOL2, etc. Are these each handled through their own object, similar to PandasPdb
, e.g., PandasSdf
? If that's the case, I think having a separate standard for each, aligned with the SDF, MOL2, etc, standards makes sense.
Another idea/approach that may make sense would be to have some base Molecule
class that would have methods to populate properties by reading in different files, PDB, SDF, MOL2, etc (though what if the structure model is actually multiple molecules? maybe you would want a System
class that inherits from Molecule
?). In this situation, it would be best to either have a single standard way to query/return molecule properties, e.g., residue name, coordinates, etc, or to accommodate the standard for each file type by having multiple queries for the same thing.
I must admit, I am not too familiar with other formats, SDF, MOL2, etc. Are these each handled through their own object, similar to PandasPdb, e.g., PandasSdf?
SDF is probably the most popular format for small molecules I'd say. I work with small molecules a lot (actually much more often than with proteins and PDB files) but I almost exclusively work with MOL2 files. The reason is that SDF doesn't store partial charge information, which is (almost) crucial for analyzing electrostatic features and creating pharmacophores imho. But back to the question: there's currently a PandasMol2, analogous to PandasPdb, as well (http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/). I am going to make a PandasSdf at some point, but it isn't on my priority list, yet.
If that's the case, I think having a separate standard for each, aligned with the SDF, MOL2, etc, standards makes sense.
Yeah, so now that final's week is over, I am probably going to change the column format to the original specs this weekend or so -- rather sooner than later. I am currently using PandasMol2 in one of my research projects for some analysis and planning to submit the manuscript in 2-3 weeks coupled with a software package, so I'd rather like to make this change now before I have to worry about dependencies and versioning and those things.
Another idea/approach that may make sense would be to have some base Molecule class that would have methods to populate properties by reading in different files, PDB, SDF, MOL2, etc (though what if the structure model is actually multiple molecules? maybe you would want a System class that inherits from Molecule?). In this situation, it would be best to either have a single standard way to query/return molecule properties, e.g., residue name, coordinates, etc, or to accommodate the standard for each file type by having multiple queries for the same thing.
I was thinking about that when I made the PandasMol2 class. However, I think that's generally not feasible because the formats are so different. I.e., the methods would have to be implemented for each class (PandasMol2, PandasPdb) individually anyway, so inheritance would be kind of overkill. The idea right now is to make the method names at least consistent. Also, with future features in mind, I'd rather prefer not to oppose any common denominator requirements at this point. Working with proteins vs small molecules (e.g., for ligand-based virtual screening) require very different workflows etc.
One future feature could be a converter function though, which converts one DataFrame-based class to another (a_pandasMol2_instance = convert(in=a_PandasPdb_instance, out=PandasMol2
)