mittinatten/freesasa

Add CIF output option

mittinatten opened this issue · 19 comments

Suggested by @danny305. Related to #48.

It's best if we can do this without porting the library to C++ (but if we have to, it might be worth it).

How do you recommend I go about this?

So, as mentioned, I think we should first make an attempt at doing this without converting the library to C++. Line 739 in main.cc is what triggers the output. We should be able to to manage it by surrounding that with an if clause that checks if the output format is CIF and in that case call a different function, that could be located in cif.cc.

With regards to format, I think we should try to make it as unsurprising as possible (and maybe investigate if there is already something out there that we can adopt, without inventing our own?). You're the expert here :)

I think we should aim at having the same level of detail as in the xml/json formats, parameterized by the depth option.

Could you elaborate on what you mean by parameter I rd by the depth option? Or show where in the code I can see this?

The depth option determines the level of detail in the output (set by the --depth CLI option). I.e. should we only output results for the whole structure or break it down to chains, residues or atoms. For the json format I find out which is deepest node type in the result tree to be included, on lines 245-247 here: https://github.com/mittinatten/freesasa/blob/master/src/json.c. Then on lines 174-179 I check if we have reached the deepest/lowest level and prune the tree accordingly in the switch statement below that.

However, CIF is already such a verbose format, that one might argue that it makes more sense simply to include all levels always and skip the depth parameter. What do you think?

There's one more important thing we need to consider, that I completely forgot about. We need a method to link the output to the original file. I assume the idea is that the new file is the old one with the extra data appended. Then we need to preserve that link somehow. I can think of a few hacks on top of my head, that would probably work fine, but ideally we should do something pretty 😄

The current solution for producing PDB output stores copies of the pdb-file lines as strings together with each atom, and then they are modified when the output is produced. It's an efficient solution in terms of lines of code and bytes moved around, but mixes up responsibilities a bit (and parts of the file are discarded), and I wouldn't do it that way now if I rewrote the thing from scratch. If we come up with something good for CIF files we can perhaps reuse that solution for PDB files as well in a future iteration.

These are the hacks I could think of:

1a. Store an anonymous reference to the source together with the structure

Add an optional void pointer to freesasa_structure that keeps a reference to the gemmi document or structure, that can be cast back later when generating the output. Pretty messy, but probably doesn't require much extra work. We could use a setter, freesasa_structure_attach_reference_to_source(void *src) (or ...attach_source_ptr()), and the corresponding getter, to make it clear what it's for. When we write the output we send the structure(s) together with the results and match them up.

1b. Store tagged reference with the structure

Add the same pointer and add a tag property, that is enum-valued, i.e.

typedef enum {
  STRUCTURE_SOURCE_UNKNOWN,
  STRUCTURE_SOURCE_PDB,
  STRUCTURE_SOURCE_CIF
} structure_source_type;

struct freesasa_structure {
    struct atoms atoms;
    struct residues residues;
    struct chains chains;
    char *classifier_name;
    coord_t *xyz;
    int model; /* model number */
    void *source;
    structure_source_type type;
};

And then add a function freesasa_structure_attach_cif_reference(void * src) (...attach_cif_ptr(...)) that adds the reference and sets the correct tag. And then freesasa_structure_get_cif_reference()would return NULL if the tag is wrong. This is kind of overkill right now, but possiblye protects us agains future bugs, and it is more explicit - so maybe possibly more intuitive.

2. Store reference in CLI next to structure

Ideally though, freesasa_structure should be agnostic of the source, except maybe a reference to file name for logging purposes. So this leads us to the next hack: Let freesasa_structure_from_cif and freesasa_cif_structure_array return a wrapper object that contains both the soruce document and the freesasa_structure, and then keep track of that in main.cc, and use that to pass the wrapper back when we generate the output. This requires rewriting main.cc a bit (not necessarily that much), but keeps the library and API untainted, so I think it's probably better.

3. Hybrid between 1 and 2.

Use the wrapper from 2, and then use the source ptr trick, but for the results object instead. I think this is cleaner than 1, but not sure compared to 2.

4. ??

Do you have any other ideas?

However, CIF is already such a verbose format, that one might argue that it makes more sense simply to include all levels always and skip the depth parameter. What do you think?

I totally agree here with this point. You can just add a freesasa block that includes all the depth options.

And then for the atom level depths, just append the freesasa value as a column so when a reader is parsing the atom they can read in all the data at once. (This is what we would like to do for our ML)

I totally agree here with this point. You can just add a freesasa block that includes all the depth options.

And then for the atom level depths, just append the freesasa value as a column so when a reader is parsing the atom they can read in all the data at once. (This is what we would like to do for our ML)

Great, lets go for that. One thing we will have to consider is what to do with atoms that were not part of the calculation: hydrogens, hetatms, skipped for other reasons, etc. I suggest we use '.' or '?' for that (whichever is more standard).

Yeah I like the '.' but ill check with the standard. I won't be able to actively focus on this probably for about a week. I have some other stuff I need to focus on for a bit. But I will submit a PR when I got something worth sharing and post to this issues if I need to bug you about stuff along the way.

Cool. I will try to figure out the remaining quirks with CIF input in the mean time.

Can you give me a TLDR explanation of your the tree structure you create to store all the freesasa values?

I am trying to figure out the best way to parse this and simultaneously append it to a cif out file and I know nothing about trees.

Ok, so the tree has nodes of the following types

typedef enum {
    FREESASA_NODE_ATOM,      /**< Atom node. */
    FREESASA_NODE_RESIDUE,   /**< Residue node. */
    FREESASA_NODE_CHAIN,     /**< Chain node. */
    FREESASA_NODE_STRUCTURE, /**< Structure node. */
    FREESASA_NODE_RESULT,    /**< Result node, wraps results for one or more related structures. */
    FREESASA_NODE_ROOT,      /**< Root node, wraps one or more unrelated results. */
    FREESASA_NODE_NONE       /**< for specifying not a valid node. */
} freesasa_nodetype;

Where the atoms are the leaves, but there's data at the other levels too. When a calculation is performed, the results are stored in this type of tree. If the input file was split into several structures, the tree will have several structures under the same result node. At each level of the tree, from atom to structure, there is a freesasa_nodearea struct, which can be accessed through the function freesasa_node_area(), which contains the SASA values. So to get the atomic SASA values you have to traverse the tree to reach all the atoms and then read the node area of each. The node areas of higher level nodes are simply the sum of its childrens' node areas.

To traverse tree there's the functions freesasa_node_children(), freesasa_node_next() (for siblings) and freesasa_node_parent(). freesasa_node_type() can be used to check which level you're at. For some node types there are specialized functions such as e.g. freesasa_node_structure_chain_labels() which tells you which chains are in the structure.

The full docs are here: http://freesasa.github.io/doxygen/group__node.html

The JSON output module is probably the simplest illustration of how this is used, I imagine the cif output will be simpler in one sense because we only need to handle the leaves and maybe the top level for a summary. But one complication is that we need to handle atoms that were excluded. https://github.com/mittinatten/freesasa/blob/dev/src/json.c (I just added one fix last week to avoid NaN and Infinities, so its best to look at the dev branch, not master or our feature-branch)

Quick question:

How do I access the following atom pointer (defined in structure.c) from the freesasa tree?

image

I am looping down to the leaf nodes atoms and am struggling.

I want to access the x, y, z coordinates or the chain, res_num, res_name, atom_name combinations to pair cif atoms with freesasa tree atoms.

So the short answer is that it's not directly available from the atom node. But freesasa_node_name() can be used to get chain, res_name and atom_name at the corresponding levels (http://freesasa.github.io/doxygen/group__node.html#ga10627933902bc5428eed987e59663b0a). To get the res_number you have freesasa_node_residue_number() (http://freesasa.github.io/doxygen/group__node.html#ga9cc0f1b3077d9a981912bad2fde276a7). The coordinates are not available.

I guess it could be convenient to have more information directly on the atom node, but the atom struct was intended to be an implementation detail and not part of the API - I think it's good if we can avoid exposing it, since that would introduce unnecessary coupling between modules. One thing we could consider is storing a pointer to the freesasa_cif_atom as an optional reference on the atom and pass it on to the atom node. This would reduce the coupling since it's just an input format, a DTO of sorts, and is already part of the API, just as PDB lines are. But it would require touching quite a few parts of the code, and some juggling of pointers and allocations. So it's probably a bigger undertaking than just keeping track of chain and residue as you traverse the tree.

Another option would be to rewrite freesasa_node_residue_number() to fetch its parent residue if called on an atom. And add corresponding functions for accessing chain labels and residue names for an atom node.

Thanks for that! I almost got a working version that outputs a CIF file. Ill submit a PR sometime this week once the PoC is complete.

One thing I want to bring to your attention: the library is unable to generate freesasa hetatm radii when the input is a CIF format. I did not test this thoroughly but just happen to come across it with one of my testing files. For 2isk it is unable to calculate SASA values bc it does not have atom radii values for the flavin cofactor. However, it has no problem when I pass in the pdb file. It would be great if you can check this out in the meantime. This should probably be its own issue?

Good catch! I intended to test a few variants of CLI options, such as --hetatms in the test cases, but forgot about it. I'll see what I can do.

I am not able to reproduce the problem with 2isk.

I think when I merged your upstream branch it fixed the problem.