GPCR-ModSim/qfepweb

Random failure when uploading a SDF.

xbello opened this issue · 20 comments

When a SDF is uploaded, the system sometimes runs smoothly, and sometimes hangs/crashes.

My main suspect is that the SDF file "pointer" is sometimes GC'd when is passed to RDKit, and in the next round you get an empty reference (use a more real file than an io stream?)

Another possibility is that the file isn't getting rewinded properly sometimes, so RDKit receives a valid file pointer, but at the EOF (assert that file is at 0 before passing?)

Could this have to do with the fact that in some cases RDKit, when it fails to read a molecule just returns None instead of throwing an exception? This is behaviour that they've set deliberately (I'm sure there's a good reason).

It that is true, we have to shield against Nones and... retry? fail (not user-friendly)?

Silencing errors is a mistake, IMO. Even more if they happen randomly.

I'll run some tests, could you point me to an example molecule that is failing (sometimes)?

I'm having some trouble reproducing your bug, so perhaps RDKit isn't the problem here?
image
Could you point me to where you use RDKit to read in the molecules?

Usually, here:

def __init__(self, molecule, core=None, palette="OKABE"):

You'll need to create some object with the field "in_sdf", and put a file there. I'm using "io" from the python stdlib.

The file is also read here:

def two_ligands(sdf_io):
to perform the first validation.

So the current processing involves:

To reiterate: we are not reading from a disk file, but from a memory file.

It's not immediately clear to me where this could go wrong, as you say it might be more of a memory issue. AFAIK RDKit shouldn't have any trouble with loading these molecules.

You could potentially build in two more checks in the validator:

  • if len(suppl) == 1:
    if not len(suppl) > 1: .. (or something similar), this will catch cases where the supplier doesn't contain any molecules (for whatever reason)
  • for the molecules contained in the supplier, check that each molecule has more than 1 atom, by e.g. for mol in supplier: if not mol.GetNumAtoms() > 1 ..

These are just the RDKit checks I can think of, if it is indeed an issue with file passing I unfortunately don't have many ideas.

FWIW, doing simple checks like that with RDKit shouldn't add any considerable computing cost to the workflow:
image

It's not immediately clear to me where this could go wrong, as you say it might be more of a memory issue. AFAIK RDKit shouldn't have any trouble with loading these molecules.

You could potentially build in two more checks in the validator:

Sure we can. And "one only" should be gone, as we plan to support multiple file upload.

just looking at this now, it might be a good idea to add docstrings to mapgen.py as at the moment it is completely unclear what each function does, or why they exist.

my guess is that @shevsea will have the best intuition for this - I'd suggest a simple line to describe the main function idea should be enough, just as in https://github.com/GPCR-ModSim/qfepweb/blob/main/networkgen/mapgen.py#L64.

posting discussion (+ my thoughts) with @xbello here for posterity.

There seem to be two root causes for this issue, being:

  1. in cases where only two ligands are in a group (i.e. either in a charge group, or only two ligands were uploaded to begin with) the cyclisation algorithm fails because an outer_node is unable to connect to another.
  2. in some cases a cycle is created leaving one remaining outer_node that is unable to connect to any other outer_nodes.

Both cases cause the while loop in make_map() to hang. I think ideally we'd refactor this code to just use a for loop (the need for a while here isn't really obvious to me): it should be okay to just loop over outer_nodes, connect them and then call it a day (potentially double-check if everything is in a cycle).

for 1): it would save us a lot of work (and some user compute time) to just catch cases where there are only two ligands and disregard the whole network generation algorithm (since it's obvious what the graph will look like).

for 2): I don't quite understand why in this stage outer_nodes are forced to connect to outer_nodes only. Presumably any outer_node should be cyclised to whichever other node it has the highest similarity (or some other metric) to? I'd be curious to find out the rationale for this @shevsea. Regardless, if this is intended then there should be functionality to revert to connecting to an inner_node in cases where no outer_nodes are available.

To clarify case 1), two outer_nodes cannot never get out of outer_nodes, because the condiction to be an outer_node is to have 1 edge, that by definition is the number of edges two nodes can have. So the two outer_nodes get connected, but are still outer nodes.

Hi both,

Sorry for my long absence, I'm on vacation until Wednesday.

Regarding 1) I find @JenkeScheen suggestion of neglecting the algorithm and just connecting both ligands ideal.

Concerning 2) initially the algorithm was written so in the cyclisation process all the nodes are considered, however I believe this lead me to crowded/high density graphs (since you'll need to add more edges to compensate the removal of the restraint), I also remember this first implementation had poorer performance MUE wise.

okay, should be easy enough to implement. I'll try to find the time to PR something

1) is pretty easy, 2) I'm not sure yet. @xbello, is there any way for me to get your observed behaviour of the while loop 'sometimes' hanging? The multi-sdf https://github.com/GPCR-ModSim/qfepweb/blob/main/networkgen/test_files/CDK2_ligands.sdf seems to work consistently. If you see no way of getting it consistently, does it always hang at the last outer_node (i.e. len(self.outer_nodes(H)) == 1), or do you think there can be multiple during the hang?

The only consistent hang I got was with two ligands.

The 2) is what you said. At least we have a suspect (alone outer_node) we could solve, and keep going until the it happens again. We could exit the while on that condition.

Another option is to keep track on new edges created on each while loop: if no new edges, exit immediately because we are entering an infinite loop. This also works with the 2-nodes event. Makes sense?

yes, that sounds reasonable. Will implement your last point. I think I'll keep in the 'stop if only two ligands' for now because it improves readability.

#17 Closes this. Will reopen if we hit it again.