Handling multi-PDB files
rasbt opened this issue · 0 comments
I am cross-posting a discussion from the mailing list with regard to multi-PDB files containing MODEL & ENDMDL tags, which are currently not handled by BioPandas.
However, it should definitely be handled in one way or the other. Currently, I don't have any best idea on how to handle that and would welcome and thoughts and feedback (let me cross-post that on the GitHub issue tracker -- maybe better to continue the discussion about potential ways to implement it there).
I think one of the problems with the DataFrame format is that having them all in one DataFrame would probably result in a lot of weird -- or unexpected -- results, thus it would probably best to separate the structures one way or the other ...
-
One option would be to provide a utility function (analogous to the split_multimol2 function, http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/#parsing-multi-mol2-files) that generates multiple PandasPdb objects from such a file. I.e., it would simply be a list
pdbs = [pdb_1, pdb_2, .... pdb_n]
which would preserve the current functionality of the library without any e.g., backwards-incompatible changes. This would then also help with using the multiprocessing library more easily and efficiently for the analysis of multiple PandasPdb objects in parallel.
- Right now, the PandasPdb objects have a dictionary containing multiple DataFrames
dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])
For multi-PDB files, the dictionary could be expanded to
dict_keys(['ATOM_1', 'HETATM_1', 'ANISOU_1', 'OTHERS_1', 'ATOM_2', 'HETATM_2', 'ANISOU_2', 'OTHERS_2', ...])
I strongly favor scenario 1) though; however, I would love to hear feedback on this and are open to other suggestions!
In any case, also an error (or at least a warning) should be raised if MODEL & ENDMDL tags are found in a PDB file if the current read_pdb method is used such that this doesn't lead to any unexpected behavior.