cmelab/planckton-flow

Difficulty finding jobs by "input" and "density"

Closed this issue · 6 comments

Is your feature request related to a problem? Please describe.
It is hard to find jobs by density and input due to how they are stored in the statepoint file.
If searching for molecule type, you have to add parenthesis and commas (PH3T,). Density does not work as intended due to how the units are appended to the number value.

Describe the solution you'd like
We want to be able to put in a number for density and find the files that match. We also want to just put in the name of the compounds and find all matching files.

Describe alternatives you've considered
Maybe the molecule name and density should be appended to the job document when the workspace is created.

Additional context
Add any other context or screenshots about the feature request here.

related to cmelab/planckton#55

@rejrice1 I agree that the unyt quantities (the densities) are being handled well with signac right now. I could use a little more information about how the "input" arg isn't working.

To expand on adding a molecule_name flag to the job document, I think this would also be useful when initializing simulations from SMILES strings.

For example, say someone is using planckton-flow to study 3 different ITIC molecules, and initializing them from SMILES strings. It would be kind of awkward after the fact to filter jobs by these unwieldy smiles strings. Maybe the init.py file has a field for molecule name that gets added to the job document.

related to cmelab/planckton#55

@rejrice1 I agree that the unyt quantities (the densities) are being handled well with signac right now. I could use a little more information about how the "input" arg isn't working.

For some reason, it does not work with signac find from the terminal. I can't figure out why. I think it might be because the actual state point isn't something like "molecule" but "(molecule,)" so maybe there is an issue with parsing the parenthesis from the command line?

Ah ok--I don't use signac CLI very much 🤔 Here is a function that you could use in a notebook to retrieve all the jobs that contain a molecule name (e.g. P3HT)

def get_molecule(moleculename):
    import signac
    
    project = signac.get_project()
    joblist = []
    for job in project:
        for i in job.sp.input:
            if isinstance(i, Iterable) and not isinstance(i, str):
                if [x for x in i if moleculename in x]:
                    joblist.append(job)
            else:
                if moleculename in i:
                    joblist.append(job)
    return joblist

EDIT: updated function to work with mixtures

example:

get_molecule("P3HT") 

would return a list of the jobs containing P3HT. (This would work with a file named P3HT.mol2 or using the P3HT-gaff key in the init.py file.)

To expand on adding a molecule_name flag to the job document, I think this would also be useful when initializing simulations from SMILES strings.
For example, say someone is using planckton-flow to study 3 different ITIC molecules, and initializing them from SMILES strings. It would be kind of awkward after the fact to filter jobs by these unwieldy smiles strings. Maybe the init.py file has a field for molecule name that gets added to the job document.

This is a good idea, but I'm not sure how to implement it offhand. Having another parameter that a user has to mess with seems like it could introduce errors... and how do I keep the the name tied to the correct input structure? It is easy enough to implement but difficult to implement well. We could have it tied to the input in the form of a tuple--but I already feel like the way mixtures are specified as nested lists causes a lot of confusion. Something like this example where I have a P3HT/PCBM mixture along with their "names" highlights the messy brackets:

"input": [[("P3HT", "p3htname"),("PCBM", "pcbmname")]]

I worry this makes a sort-of-confusing input system even more confusing.
I'd love to hear feedback on what you all think or how you imagine this could be done well. Is there an obvious solution that I'm missing?

Oh! alternatively if you like using signac from the command line, here is a way to use signac find:

signac find --sp input | grep -B 1 PCBM

(I'm searching for PCBM in the above example. The -B 1 flag to grep asks it to also display the line before the match, which in this case contains the job ID.)

If this is useful, we could add it to planckton-flow as a bash script.