Matgenix/jobflow-remote

How to add metadata to `flows` docs?

janosh opened this issue ยท 9 comments

There currently appears to be no way of adding metadata like material IDs, formulas, structure provenance and the like to documents added into the flows collection.

i think it would make sense to e.g. look for a metadata attribute on a jobflow.Flow and if found, add that to the flow_doc prior to DB insertion in JobController.add_flow:

if hasattr(flow, 'metadata'):
     flow_doc['metadata'] = flow.metadata

flow_doc = get_initial_flow_doc_dict(flow, job_dicts)

happy to submit a PR for this if there's interest

Hi @janosh, indeed there is no way of setting the metadata in the flow document, thanks for reporting that.
I am not 100% convinced by the solution though. I can see a few potential minor issues:

  1. the fact that jobflow's Flow object does not have a metadata could make this a bit hackish
  2. There may be a confusion between this metadata attribute and the update_metadata method in Flow. One calling this method on a Flow may be tricked in thinking that also the Flow metadata is set.
  3. In principle a user could run a code like this:
    flow1 = Maker1().make()
    flow1.metadata = {"a": 1}
    flow2 = Maker2().make(flow1.output)
    flow = Flow([flow1, flow2])
    submit_flow(flow)
    and no Flow metadata will be added to the DB. Since only the top layer Flow is added in the DB it would be ambiguous what to do with those metadata.
  4. If jobflow introduces a metadata attribute to Flow in the future this may break something.

Adding the metadata attribute to Flownow could help, although will leave point 3 open and point 2 could still be tricky, but at least the behaviour of update_metadata could be documented with respect to the Flow's metadata.

An alternative solution could be to to pass a metadata (or flow_metadata, to be more explicit) argument to submit_flow. This would solve the points above, but would probably feel a bit more clunky.

What do you think?

Thanks @janosh for opening this issue. Indeed, as @gpetretto mentioned, it is currently not possible. Concerning point 1., I propose to include @utf in the discussion. Maybe there is a need (or at least a wish) to have flow metadata. Not sure how (and how easy it would be) this could be added to jobflow itself in the first place and "passed down" to jobflow-remote. The tricky point is that in jobflow, the Flow exists at definition time but not anymore at execution (nor in the database). I think this was done in order to avoid duplication of outputs of jobs in outputs of flows (if they existed). If there is a strong push towards that, maybe we could have a call altogether to discuss options ?

Thanks @janosh for opening this issue. Indeed, as @gpetretto mentioned, it is currently not possible. Concerning point 1., I propose to include @utf in the discussion. Maybe there is a need (or at least a wish) to have flow metadata. Not sure how (and how easy it would be) this could be added to jobflow itself in the first place and "passed down" to jobflow-remote. The tricky point is that in jobflow, the Flow exists at definition time but not anymore at execution (nor in the database). I think this was done in order to avoid duplication of outputs of jobs in outputs of flows (if they existed). If there is a strong push towards that, maybe we could have a call altogether to discuss options ?

I am not sure that it would be particularly tricky to handle this in jobflow and jobflow-remote. Adding the metadata attribute to Flow should not pose particular problems, except that it should be clarified the behaviour of update_metadata. The fact that the Flow stops existing after the Flow is stored in the jobflow-remote DB is not really a problem, since that would be enough to add the metadata to the DB in the way suggested by @janosh.
I assumed that @janosh's requests was only to ease the query of the Flows in jobflow-remote's DB, not to add those metadata to the outputs. Is this correct?

thanks for the quick replies!

@gpetretto 1 - 4 are excellent points and should be handled intuitively and without pitfalls. i should have formulated my issue more like an RFC (which is what this is now anyway ๐Ÿ˜„).

i tried the update_metadata first and was mostly expecting that to be reflected in the submitted flow documents in the database. my hacky solution was step 2 after that didn't work

I assumed that @janosh's requests was only to ease the query of the Flows in jobflow-remote's DB, not to add those metadata to the outputs. Is this correct?

that's correct. though in principle, i think both are useful. but adding metadata to the output seems like a pure jobflow feature and not something jf-remote needs to worry about

maybe we could have a call altogether to discuss options

@davidwaroquiers i could imagine @utf would prefer to discuss on GitHub but happy to do call and to flex to your schedules if i'm mistaken!

just to get everyone's temperature, is the preference here to go with the clunky but explicit option of adding a flow_metadata kwarg to submit_flow? or work with @utf on adding a metadata attribute to the Flow class which then gets picked up by jf-remote?

Hi @janosh, We are going to discuss with @utf about a few open points on jobflow-remote tomorrow. We will also consider this and get back to you soon about this.
As far as I am concerned I would prefer having the metadata attribute to Flow directly in jobflow.

Hi @janosh, we agreed that it would be better to add the metadata attribute to Flow in jobflow. The update_metadata should preferably be updated as well, in order to allow the modification of the Flow's own metadata, of the Jobs in Flow, or both. Changes to the API of update_metadata should be the same both in Flow and Job. Changes to jobflow-remote should be minimal once that is available. Would you be available to implement these changes?

@gpetretto thanks for the update, i like those decisions. i'll get on it

@gpetretto @utf @davidwaroquiers i have a first pass in materialsproject/jobflow#679. any feedback welcome!