pyiron/ironflow

Workflow class in pyiron_contrib as basis for ironflow

Opened this issue · 9 comments

JNmpi commented

Summary

The functionality of @liamhuber workflow class in pyiron_contrib could be a good starting point for interesting extensions and merging pyiron_base and ironflow.

@liamhuber, I had only now a chance to install your workflow demo class and play with it. It is really great work. Even though it is presently only a demo I see a lot of potential. I tried in the attached Jupyter Notebook to briefly summarize and sketch some of my ideas. It is only a very preliminary sketch and it would be good to talk about it. The notebook contains many ideas and suggestions. Some of them are probably straightforward to implement, others may be hard or impossible. It is more meant as a collection of ideas rather than a list of tasks.

Node-based pyiron concept

Based on the workflow example implemented in pyiron_contrib by Liam a few suggestions/ideas.

%config IPCompleter.greedy=True

import sys
sys.path.insert(1, '..')

sys.path;

import matplotlib.pylab as plt
from pyiron_contrib.workflow.node import Node, node
from pyiron_contrib.workflow.workflow import Workflow
from pyiron_contrib.workflow import nodes

Make Workflow the next generation pyiron Project object

wf = Workflow('my_first_workflow')

Design concepts

A node with a single output should view this output directly at object level

  • modify __repr__ etc.
  • allow short notation of link
  • see example below
wf.structure = nodes.bulk_structure(repeat=3, cubic=True, element="Al")

This node object should behave as similarly as possible to the output object, i.e., its representation should be the same

wf.structure.outputs.structure.value;
wf.structure; # has same representation (__str__, __repr__, etc.)
It should be also shorter to access the output object
wf.structure.value.plot3d()
wf.structure.plot3d()  # nice, but may be too much overloading

To set a node link the node reference should be sufficient

  • wf.calc = nodes.calc_md(job=wf.engine.outputs.job) can be replaced by
  • wf.calc = nodes.calc_md(job=wf.engine)

The original code block can then be simplified:

wf.structure = nodes.bulk_structure(repeat=3, cubic=True, element="Al")
wf.engine = nodes.lammps(structure=wf.structure.outputs.structure)
wf.calc = nodes.calc_md(job=wf.engine.outputs.job, 
                        update_on_instantiation=False, 
                        # run_automatically=False
                       )
wf.plot = nodes.scatter(
    x=wf.calc.outputs.steps, 
    y=wf.calc.outputs.temperature
)

New version (pseudocode)

wf = Workflow('my_first_workflow', domain='pyiron.atomistic') # set default domain for nodes

structure = wf.create.structure.bulk(repeat=3, cubic=True, element="Al")  # adds node to wf, structure.wf stores wf
engine = wf.create.engines.lammps(structure=structure.outputs.structure)
calc = wf.create.calc_md(job=engine, 
                        update_on_instantiation=False  # should be set as default in node definition for longer jobs
                       )
wf.create.plot.scatter(
    x=wf.calc.outputs.steps,      # more than one output, explicit statement required
    y=wf.calc.outputs.temperature
)

Execute workflow

from pyiron_contrib.workflow import Executor

exe = Executor(server='my_server', cores=24, queue='my_queue')  # taskmanager
exe.list_servers(), exe.task_table(), ...  # provide utility functions

Utility functions

wf.run()    # corresponds to update in the original version (modal)
wf.submit(executor=exe) # run in background, submit to queue, etc., requires de-(serialization)
wf.status # finished, aborted, unfinished nodes etc.
wf.task_table() # filtered for specific workflow
wf.visualize() # show (static) graph network
wf.ironflow()  # show workflow as graph in ironflow 
wf.save()  # save also local nodes (provide all info to run on any machine)
wf.load()
wf.__repr__() # show output of final node (in the above case the matplotlib graph), 
              # maybe with 'signature' of the workflow (name, status, etc.)

Create new workflow with same workflow but modified parameters

wf(structure=wf.create.structure.bulk(repeat=3, cubic=True, element="Cu"), name='wf_Cu')

Extend data objects

In nodes such as calc_MD the number of output objects/values is rather large. It would be therefore highly attractive to replace them by data objects, which could provide additional functionality (e.g. ontology). The decorator should automatically translate the object to provide all fields. Alternatively, we could use ontolgy to connect individual elements (e.g. positions) with such a data object (ontology knows that positions is part of MD_data.

from dataclasses import dataclass
import numpy as np

@dataclass
class MD_data:
    positions: list | np.ndarray = None
    temperature:int = 0 
    # ...
out = MD_data
out.temperature
out.positions = np.linspace(0, 100, 10)
out.positions.shape
(10,)

Extend input and output

File objects

Make sure that files are correctly transfered, copied, etc.

file_out_1 = node1(file_in)
file_out_2 = node2(file_out_1)  

Storage

Include (HDF5) storage in workflow object

When node gets output, put (filtered) data in workflow data store. Should replace our present HDF5 storage.

Hi @JNmpi, thanks for the feedback!

The functionality of @liamhuber workflow class in pyiron_contrib could be a good starting point for interesting extensions and merging pyiron_base and ironflow.

Agreed. In terms of timeframe, I think I want to wait until the stuff in contrib actually gets moved to base before translating ironflow on top of it. Currently it is still evolving too quickly and missing key features the ryven+ironflow set has (e.g. ontological typing).

I am finding it very useful when developing the workflow stuff in contrib to keep the eventual GUI representation in mind -- it helps to streamline the logic.

...Some of them are probably straightforward to implement, others may be hard or impossible. It is more meant as a collection of ideas rather than a list of tasks.

So good news: some are just already implemented! After our last conversation, worked on the Node class to introduce new FastNode(Node) and SingleValueNode(FastNode) classes. There is some trouble with conda on contrib, so these aren't merged yet -- fixing this is on my TODO list today. Because of the conda issues I'm not confident that Binder will work, but you can see the updated demo with executed cells here

wf.structure.outputs.structure.value;
wf.structure; # has same representation (__str__, __repr__, etc.)

Good idea! I didn't implement this yet, but it's easy. For __str__ I'd probably preface the representation with something like "SingleValueNode output value:\n" so people get a reminder that it's actually a node.

wf.structure.plot3d()  # nice, but may be too much overloading

This works! So does item access, e.g.:

wf.my_array_or_list_node[some_starting_index:some_ending_index]

To set a node link the node reference should be sufficient

structure = wf.create.structure.bulk(repeat=3, cubic=True, element="Al")  # adds node to wf, structure.wf stores wf
engine = wf.create.engines.lammps(structure=structure)

I also like this. It will muddy the source code a little bit, because the code that interprets connections needs an extra elif to handle the assigned object being of class SingleValueNode, but IMO this is a worthwhile complexity to pay for getting that syntax. Should be easy to implement.

calc = wf.create.calc_md(job=engine, 
     update_on_instantiation=False  # should be set as default in node definition for longer jobs
)

This is also done! The core Node class now has update_on_instantiation and run_on_updates both set to False, while FastNode (which requires type hints and defaults for all input) sets them to True.

Execute workflow

...

Utility functions

...

Create new workflow with same workflow but modified parameters

...

I like these ideas and they seem achievable, but I haven't done enough work yet on these topics to have anything really concrete to say. The only difficult thing I see is here:

wf.__repr__() # show output of final node (in the above case the matplotlib graph), 
                       # maybe with 'signature' of the workflow (name, status, etc.)

In many graphs, I think it's impossible to say what the "final" node is, so this will be difficult to fully automate. However, we could probably handle edge cases, like when the workflow has only a single un-connected output, or allow users to specify which output to use here manually when creating the workflow/macro.

Extend data objects

In nodes such as calc_MD the number of output objects/values is rather large. It would be therefore highly attractive to replace them by data objects

Long story short: I agree with the principle, and I think it's possible, but the implementation will take some thinking/cleverness.

This is something that Marvin and I have talked about in this tinybase PR, e.g. in this comment. Overall, I am in complete agreement that collecting IO together in composable classes is desirable. The catch is to make this compatible with a lazy graph-based approach.

E.g. this works quite nicely:

plt.scatter(wf.my_md_node.outputs.md_output.temperatures)

But, if md_output is merely an instance of a @dataclass, this is problematic for defining graphs a priori:

wf.create.node.scatter_node(y=wf.my_md_node.outputs.md_output.temperatures)

In such an example simply pointing to y=wf.my_md_node.outputs.md_output works by creating a connecting to the md_output channel. Similarly, if my_md_node had already been executed and had data, we could statically assign the value like y=wf.my_md_node.outputs.md_output.value.temperatures.

I think the trick will be to get the md_output object to be more like an IO object, where it is itself actually just a collection of Channels and we can make valid connections to its children before they're populated with data. I'm optimistic this is possible, but it will take some thinking to do the definitions in a way that is readable/maintainable.

Extend input and output

File objects

Make sure that files are correctly transfered, copied, etc.

file_out_1 = node1(file_in)
file_out_2 = node2(file_out_1) 

I know we talked about taking file paths as IO, which is a good and necessary idea, but I'm afraid this example is a little sparse for me to get what you're driving at. Could you expand?

Storage

Yep, absolutely! This is another thing where I agree 100% and think it should be very doable, but just haven't done enough to have a particular comment.

Ok, the __repr__ and easier connection features are merged in pyiron_contrib #654 👍

JNmpi commented

Hi @liamhuber, thanks for your super fast reply and translating some of the ideas so quickly into code.

Below some thoughts/comments regarding the discussions above:

Agreed. In terms of timeframe, I think I want to wait until the stuff in contrib actually gets moved to base before translating ironflow on top of it. Currently it is still evolving too quickly and missing key features the ryven+ironflow set has (e.g. ontological typing).

Jan is pushing towards keeping pyiron_base small and actually reducing its functionality to make it easier to maintain and less complex. Thus, the best strategy would be to develop these concepts independently from pyiron_base until it reaches the functionality and can replace pyiron_base or integrate it.

So good news: some are just already implemented!

Thanks! I tried the new features and the ones you implemented yesterday and it is really fun using them.

But, if md_output is merely an instance of a @DataClass, this is problematic for defining graphs a priori:

An idea would be that the dataobject is used in the function definition but 'translated' by the decorator into a real object containing all the subparts of the output. Thus, node.outputs.x would be directly available, no magic is needed in the function call.

Extend input and output

Below a sketch of the functionality that I had in mind. Essentially, the workflow/transfer of data should be extended to files.

     working_directory_node1:
         file_node1_in_1
         file_node1_in_2
         node1.exe
         file_node1_out_1
         file_node1_out_2
         
     working_directory_node2:
         file_node2_in_1
         file_node2_in_2
         node1.exe
         file_node2_out_1
         file_node2_out_2   
         
  To achieve the desired behavior the following file tasks have to be executed:
  - mkdir working_directory_node1
  - cp orig_path/file_node1_in_1 working_directory_node1
  - cp orig_path/file_node1_in_2 working_directory_node1
  - run node1.exe 
  
  - mkdir working_directory_node2
  - cp working_directory_node1/file_node1_out_1 working_directory_node2/file_node2_in_1 
  - cp orig_path/file_node1_in_2 working_directory_node2
  - run node2.exe
  
  - *remove intermediate files and working directories (like garbage collection for variables)*
  - rm -r working_directory_node1
  - rm file_node2_in_* 
  - rm *.exe
  - maybe store the interesting output files to a central repository (like our hdf5-files)
  
  For data objects these routines are taken for granted, whereas for file objects we have to introduce/code them. 
JNmpi commented

@liamhuber, one more issue. When playing with your latest workflow_example notebook and extending it I encountered a strange behavior.

I first extended your plot node to show also labels (it is really easy now to make such extensions):

    @single_value_node("fig")
    def scatter(
            x: Optional[list | np.ndarray] = None, y: Optional[list | np.ndarray] = None, plt_label: str = None
    ):
        print (x, y, plt_label)
        plt.scatter(x, y, label=plt_label)
        plt.legend()
        return plt

Using this modified function I run then your pyiron example workflow:

      wf = Workflow("with_prebuilt")
      
      wf.structure = atomistics.bulk_structure(repeat=3, cubic=True, element="Al")
      wf.engine = atomistics.lammps(structure=wf.structure)
      wf.calc = atomistics.calc_md(
          job=wf.engine, 
          run_on_updates=True, 
          update_on_instantiation=True
      )
      wf.plot = standard.scatter(
          x=wf.calc.outputs.steps, 
          y=wf.calc.outputs.temperature,
          plt_label = wf.structure.inputs.element.value
      )

If I then change the element to Cu I get a superposition of the Al and Cu graph. This is a super-nice feature. However, in the print statement and in the label this change is somehow not reflected, which is really strange since the correct element and structure are entering the workflow:

 wf.structure.inputs.element='Cu'

The job JUSTAJOBNAME was saved and received the ID: 291
[ 0 100 200 300 400 500 600 700 800 900 1000] [600. 226.848439 302.84353588 294.59303623 327.19569679
320.24427303 331.775404 305.76311216 272.52029955 290.02094578
297.84092042] Al
[ 0 100 200 300 400 500 600 700 800 900 1000] [600. 249.26134615 350.97254999 267.22978955 341.10557608
258.27106016 264.96153672 275.23710414 321.16565472 316.65413348
285.08166452] Al

Any idea what is going on and how to cure it?

JNmpi commented

Another small idea/suggestion:

To effectively work with the new objects/nodes it would be helpful that they have nice repr infos. For example, when calling

 wf.structure.inputs

it would be nice to see which labels are available. This applies also for the outputs as well as some other objects. Generally, for every user-accessible object, it would be nice to get not the cryptic default message but the relevant info about the content of the object.

One last observation: When using label as parameter name (e.g. in the plot example above) things were not working. I guess this is related to label being a function for inputs. Since label is a rather common name for a parameter it would be nice to enable it or at least through an error message if the user wants to use this name.

@JNmpi, thanks for all the great feedback!

Timeframe

Yes, I really like this direction for base. I think we're on the same page then. I mostly meant that while layering ironflow on top of this workflow stuff is absolutely the plan, it's not imminent.

IO classes

Yes, wrapping the raw data classes is a great idea! I wasn't thinking in this direction yet so it's good you brought it up, but the wrappers/decorators are working so well for packaging nodes that I'm very optimistic for a similar attack here.

File-based functionality

Ok, I think I see where you're going now. This sort of thing is definitely necessary for operations like Vasp, and I see also how it reflects your ideas on a SnakeMake node.

Plot extension

When playing with your latest workflow_example notebook and extending it I encountered a strange behavior.
...
If I then change the element to Cu I get a superposition of the Al and Cu graph. This is a super-nice feature. However, in the print statement and in the label this change is somehow not reflected, which is really strange since the correct element and structure are entering the workflow: wf.structure.inputs.element='Cu'

Yep, I see the issue! The catch is that the plot argument is set to a value rather than a channel: plt_label = wf.structure.inputs.element.value, thus it gets the current string value there permanently.

You're actually disallowed from connecting the input of one node to the input of the other, so just removing .value won't do the trick either.

If you want to re-use the chemistry information, there are two generic possibilities: create an upstream source for this information and pass it to both the structure and plot nodes, OR create an intermediate node that parses the structure output and generates plot label input.

Here's some example code that uses the former approach (should work with the current HEAD of pyiron_contrib):

@Workflow.wrap_as.single_value_node("fig")
def scatter(
    x: Optional[list | np.ndarray] = None, 
    y: Optional[list | np.ndarray] = None, 
    plt_label: Optional[str] = None
):
    print (x, y, plt_label)
    plt.scatter(x, y, label=plt_label)
    plt.legend()
    return plt

@Workflow.wrap_as.single_value_node("element")
def element(element: str = "Al"):
    return element

wf = Workflow("with_prebuilt")

wf.element_chooser = element()
wf.structure = wf.add.atomistics.BulkStructure(
    repeat=3, 
    cubic=True, 
    element=wf.element_chooser
)
wf.engine = wf.add.atomistics.Lammps(structure=wf.structure)
wf.calc = wf.add.atomistics.CalcMd(
  job=wf.engine, 
  run_on_updates=True, 
  update_on_instantiation=True
)
wf.plot = scatter(
  x=wf.calc.outputs.steps, 
  y=wf.calc.outputs.temperature,
  plt_label=wf.element_chooser
)

Then update the species to generate the layered plot with wf.element_chooser.inputs.element = "Cu"

The fact the plots get layered was actually not intentional on my part. And now you'll see it can really be a bit of a pain, because we actually get three plots! This is perfectly sensible, albeit inconvenient, because the plot node updates on every input update and now the label and data are being updated separately.

We could avoid this by manually managing the flow control, e.g. setting wf.plot.run_on_updates = False and later manually calling wf.plot.run(). However, what I really want is an extra option for the channels, something like data_update_required which stops the node from being ready (a requirement to run using run_on_updates=True) until after all inputs so marked have received an update. This way you could easily flag sets of data that you know all need to update together before a new node run will be useful/possible.

Representations

...For example, when calling wf.structure.inputs it would be nice to see which labels are available....

This actually surprised me, because I thought it was already there! It turns out there were two problems:

First, Atoms (and I guess some other things) don't JSON well, so I had to stringify them first. Easy.

Second, Channel, IO, Node, and Workflow all inherit from HasToDict that defines a _repr_json_() method BUT somehow this is only triggering for Channel and Node!!! I have no idea why, because _repr_json_ simply returns self.to_dict() and when I manually call JSON(wf.to_dict()) or JSON(wf.some_node.inputs.to_dict()) it works beautifully -- so it's not that the representation is bad, just that Jupyter is not invoking it!

As a temporary stop I've added an extra method .repr_json() to force the more beautiful representation.

These fixes are here, and are already available on HEAD.

"label" and naming

One last observation: When using label as parameter name (e.g. in the plot example above) things were not working. I guess this is related to label being a function for inputs. Since label is a rather common name for a parameter it would be nice to enable it or at least through an error message if the user wants to use this name.

IMO, this is a little bit tricky from a UX perspective. I actually switched to using "label" from "name" exactly because I thought it wasn't so common! I still think that's true, but obviously I wasn't in matplotlib-state-of-mind 🤣

The trouble is that this variable name needs to be: short, descriptive, easy to understand/remember, and not commonly used by any other developers. We could solve this pretty easily by adding specificity, i.e. label --> channel_label, node_label, workflow_label. But, users need to invoke this as a kwarg in a number of places, so having multiple different variable names is in tension with the need for it to be easy to remember.

I'm not closed to changing this, I just don't currently have any ideas I like better than "label".

In the meantime, you're absolutely right that a better error is needed when users try to define a node with protected channel names. This is already the case for, e.g., assigning nodes to workflows with names that conflict with workflow, so I'll do something similar for nodes in the next PR EDIT: here.

JNmpi commented

@liamhuber, thanks again for your fast and very helpful reply. Some more thoughts regarding the pyiron plot example. Based on your explanation and example I came up with the following code that behaves as intended:

      @Workflow.wrap_as.single_value_node("fig")
      def scatter(
          x: Optional[list | np.ndarray] = None, 
          y: Optional[list | np.ndarray] = None, 
          plt_label: Optional[str] = None
      ): 
          if y is None:
              return None
          plt.scatter(x, y, label=plt_label)
          plt.legend()
          return plt.show()
      
      @Workflow.wrap_as.single_value_node("element")
      def element(element: str = "Al"):
          return element
      
      wf = Workflow("with_prebuilt")
      
      wf.element_chooser = element()
      wf.structure = wf.add.atomistics.BulkStructure(
          repeat=3, 
          cubic=True, 
          element=wf.element_chooser
      )
      wf.engine = wf.add.atomistics.Lammps(structure=wf.structure)
      wf.calc = wf.add.atomistics.CalcMd(
        job=wf.engine, 
        run_on_updates=True, 
        update_on_instantiation=False
      )
      wf.plot = scatter(
        x=wf.calc.outputs.steps, 
        y=wf.calc.outputs.temperature,
        plt_label=wf.element_chooser
      )
      wf.plot.run_on_updates = False

and calling the workflow for different elements via:

      def run_wf(wf, el):
          wf.element_chooser.inputs.element = el
          return wf.plot.run() 

      run_wf(wf, 'Ni')

While the above code is working, the syntax is for my taste not as simple and flexible as I would like. An idea would be to overload call to generate new instances:

      wf_new = wf(label='new_wf_name')
      wf_new.input.structure.element = 'Ni'
      wf_new.run()

The above pseudocode has in my opinion several advantages. It allows to access all input parameters in any node of the workflow in a highly intuitive manner. An additional bonus is that this the syntax is similar to our present pyiron notation. Where I am not so sure is the first line. Effectively it combines a copy statement

    wf_new = wf.copy()(label='new_wf_name')

(to create a new workflow instance) and giving it a name. It thus uses a workflow instance as a class definition. This may be a bit unusual but I think I could get quickly used to it. If the input is on the node level we could consider the following pseudocode:

      new_structure = wf.structure
      new_structure.input.element =  'Ni'
      wf_new = wf(label='new_wf_name', structure=new_structure)
      wf_new.run()

While the above code is working, the syntax is for my taste not as simple and flexible as I would like. An idea would be to overload call to generate new instances:
...
It thus uses a workflow instance as a class definition. This may be a bit unusual but I think I could get quickly used to it.
...

I like your wrapper def run_wf(wf, el):, and in particular I think it nicely highlights how this infrastructure is just plug-and-play with regular python development, but I agree that we want to provide our own, more convenient solution.

Using the workflow instance as a class might be possible, e.g. by overloading the __call__ magic method, but since users need to import Workflow anyway, I think we can keep things more pythonic without sacrificing convenience. E.g. what about something like this:

from pyiron_contrib.workflow import Workflow

@Workflow.wrap_as.node("y")
def some_node(x):
    return x

some_node_instance = some_node(label="foo")

wf = Workflow(label="original", some_node_instance, bar=some_node())
# Initializes the workflow with two nodes
# the arg parsing is already working, and the node keeps the label "foo"
# kwarg parsing is not implemented, but I imagine it adds the node and re-labels it,
# in this case to "bar"

wf.input.foo_x = 5
# Currently, workflows dynamically look for unconnected channels
# and expose them as {node.label}_{channel.label}, so this works


new_wf = Workflow(workflow=wf, label="new", bar_x=6, baz=some_node())
# Where...
# the workflow kwarg lets us know we're initializing as a deep copy
# the label kwarg is optional and obvious 
# (might be mandatory later, e.g. if labels set working directories or something)
# the bar_x kwarg gets found in inputs, and is used to initialize the value
# the baz kwarg is _not_ found in inputs and its value is a node, so we add it like usual

One thing you couldn't do this way is something like new_wf = Workflow(workflow=wf, label="new", bar_x=baz.output.y, baz=some_node()), because the baz node doesn't even exist yet. IMO this is a fair constraint, and after instantiating the new workflow you can always do re-wiring like new_wf.inputs.bar_x = new_wf.outputs.baz_y (or, equivalently, new_wf.bar.inputs.x = new_wf.baz.outputs.y)

So for your example above it would just be

new_wf = Workflow(workflow=wf, label="new_wf_name", element_chooser_element="Ni")
new_wf.run()
# (although the Workflow.run stuff still needs to be implemented)

Under the hood Workflow(workflow=some_wf_instance, label="some_new_label") is probable calling a method some_wf_instance.copy_to(new_label="some_new_label") and using the return value to init the new workflow.

I think this gives a relatively tight and understandable syntax without the implementation headaches of overloading __call__ to return a class. Of course we can keep iterating on it!

I'm a bit under the weather today, so unfortunately I don't think I'll get to the implementation of these ideas until tomorrow or Friday.

JNmpi commented

There is no urgency in discussing these issues. The discussion mainly serves to write down and sort some of my ideas and to check them with you.

With respect to instantiating a new workflow I see your points. What I am not happy yet is that a node has a different look and feel in the above-suggested implementation. The picture/use case I have in mind is that the user constructs a workflow with a specific set of default parameters for each node it contains. This workflow can be uploaded to a database (e.g. node/workflow repository by providing a semantic path). The user could then access the node via

   wf_new = wf.create.semantic_wf_path(label='my_new_wf') 

and providing structured input down to each node:

   wf_new.structure.element = 'Ni'

What I would like to have is that all parameters are available without having to define a single_node decorator for each variable that I want to expose to the user. The original workflow with its specific parameter settings acts then as a default template, where to modify the workflow template I have to modify only a few elements. This structure corresponds to our present job structure, only making it much easier to write and implement new job types. A further advantage is that workflow templates can be created both graphically (via ironflow) as well as directly from code. Like you were mentioning we probably will need a few iterations to come up with a intuitive and flexible solution.