How to add metadata to `flows` docs?
janosh opened this issue ยท 9 comments
There currently appears to be no way of adding metadata like material IDs, formulas, structure provenance and the like to documents added into the flows
collection.
i think it would make sense to e.g. look for a metadata attribute on a jobflow.Flow
and if found, add that to the flow_doc
prior to DB insertion in JobController.add_flow
:
if hasattr(flow, 'metadata'):
flow_doc['metadata'] = flow.metadata
happy to submit a PR for this if there's interest
Hi @janosh, indeed there is no way of setting the metadata in the flow document, thanks for reporting that.
I am not 100% convinced by the solution though. I can see a few potential minor issues:
- the fact that jobflow's
Flow
object does not have ametadata
could make this a bit hackish - There may be a confusion between this
metadata
attribute and theupdate_metadata
method in Flow. One calling this method on a Flow may be tricked in thinking that also the Flow metadata is set. - In principle a user could run a code like this:
and no Flow metadata will be added to the DB. Since only the top layer Flow is added in the DB it would be ambiguous what to do with those metadata.
flow1 = Maker1().make() flow1.metadata = {"a": 1} flow2 = Maker2().make(flow1.output) flow = Flow([flow1, flow2]) submit_flow(flow)
- If jobflow introduces a
metadata
attribute toFlow
in the future this may break something.
Adding the metadata attribute to Flow
now could help, although will leave point 3 open and point 2 could still be tricky, but at least the behaviour of update_metadata
could be documented with respect to the Flow
's metadata
.
An alternative solution could be to to pass a metadata
(or flow_metadata
, to be more explicit) argument to submit_flow
. This would solve the points above, but would probably feel a bit more clunky.
What do you think?
Thanks @janosh for opening this issue. Indeed, as @gpetretto mentioned, it is currently not possible. Concerning point 1., I propose to include @utf in the discussion. Maybe there is a need (or at least a wish) to have flow metadata. Not sure how (and how easy it would be) this could be added to jobflow itself in the first place and "passed down" to jobflow-remote. The tricky point is that in jobflow, the Flow exists at definition time but not anymore at execution (nor in the database). I think this was done in order to avoid duplication of outputs of jobs in outputs of flows (if they existed). If there is a strong push towards that, maybe we could have a call altogether to discuss options ?
Thanks @janosh for opening this issue. Indeed, as @gpetretto mentioned, it is currently not possible. Concerning point 1., I propose to include @utf in the discussion. Maybe there is a need (or at least a wish) to have flow metadata. Not sure how (and how easy it would be) this could be added to jobflow itself in the first place and "passed down" to jobflow-remote. The tricky point is that in jobflow, the Flow exists at definition time but not anymore at execution (nor in the database). I think this was done in order to avoid duplication of outputs of jobs in outputs of flows (if they existed). If there is a strong push towards that, maybe we could have a call altogether to discuss options ?
I am not sure that it would be particularly tricky to handle this in jobflow and jobflow-remote. Adding the metadata
attribute to Flow
should not pose particular problems, except that it should be clarified the behaviour of update_metadata
. The fact that the Flow stops existing after the Flow is stored in the jobflow-remote DB is not really a problem, since that would be enough to add the metadata to the DB in the way suggested by @janosh.
I assumed that @janosh's requests was only to ease the query of the Flows in jobflow-remote's DB, not to add those metadata to the outputs. Is this correct?
thanks for the quick replies!
@gpetretto 1 - 4 are excellent points and should be handled intuitively and without pitfalls. i should have formulated my issue more like an RFC (which is what this is now anyway ๐).
i tried the update_metadata
first and was mostly expecting that to be reflected in the submitted flow
documents in the database. my hacky solution was step 2 after that didn't work
I assumed that @janosh's requests was only to ease the query of the Flows in jobflow-remote's DB, not to add those metadata to the outputs. Is this correct?
that's correct. though in principle, i think both are useful. but adding metadata to the output seems like a pure jobflow
feature and not something jf-remote
needs to worry about
maybe we could have a call altogether to discuss options
@davidwaroquiers i could imagine @utf would prefer to discuss on GitHub but happy to do call and to flex to your schedules if i'm mistaken!
just to get everyone's temperature, is the preference here to go with the clunky but explicit option of adding a flow_metadata
kwarg to submit_flow
? or work with @utf on adding a metadata
attribute to the Flow
class which then gets picked up by jf-remote
?
Hi @janosh, we agreed that it would be better to add the metadata
attribute to Flow
in jobflow. The update_metadata
should preferably be updated as well, in order to allow the modification of the Flow's own metadata, of the Jobs in Flow, or both. Changes to the API of update_metadata
should be the same both in Flow
and Job
. Changes to jobflow-remote should be minimal once that is available. Would you be available to implement these changes?
@gpetretto thanks for the update, i like those decisions. i'll get on it
@gpetretto @utf @davidwaroquiers i have a first pass in materialsproject/jobflow#679. any feedback welcome!