Confusion caused by parameter + multiple execution of global section.
BoPeng opened this issue · 19 comments
A user reported a case in gitter which boils down to
[global]
parameter: A=[1, 2]
parameter: B=[]
B.extend(A)
[default]
print(B)
input: for_each=dict(i=range(2))
print(B)
and the output is
[1, 2]
[1, 2, 1, 2]
[1, 2, 1, 2]
We should investigate the behavior of parameter
in global
section more closely.
I think I've raised the concern a few times in the past that global section got executed multiple times. It is expected that global section should only be executed once. But I recall you gave some good reasons why it executes multiple times ... clearly this now causes troubles. Maybe let's revisit those reasons?
The easiest example is
[global]
import time
[1]
input: for_each=dict(i=range(2))
time.sleep(0)
because time
module is not pickleable and the substep would not get time
unless the global section is executed by the substep each time.
Cannot recall immediately the reason for ignoring parameter
in some cases though.
I see, now I recall the import
this exception. But it looks this is an exception we can possibly hack around? Are there other exceptions?
Yes, we can hack and separate import
statements, but things can get complicated. this document shows what can and what cannot be pickled.
For example
import some_class from some_module
a = some_class()
and a
might not be pickleable.
We can enforce the pickleable
requirement for global
section though. Basically, we can say, other than module
, which we can import separately, everything else needs to be pickleable....
Basically, we can say, other than
module
, which we can import separately, everything else needs to be pickleable....
This is a more acceptable behavior than the example above that B
get accumulated. I do not see it very much limiting.
For global objects pickled, are they shared to substeps? In other words each substep will use the environment loaded by the step rather than loading and unpickling saved object on their own? If objects are shared, are they read-only?
Right now we send global
statement to each substep and execute their, and there is also a background
variable obtained from the step level...
Let me revisit this after I completes #1218
Use the following to test the pickleability of things:
import pickle
stmt = '''
from time import sleep
def a():
print('hi')
'''
bundle = {}
exec(stmt, bundle. bundle)
print(bundle.keys())
pickle.dumps(bundle)
So
import time
is not, but can be easily separated.from time import time
is ok, but other objects might not.def a()
cannot, and this can be a problem.class ()
cannot.
def
and class
can be huge limitations ...
Can we then instead only pickle parameters and variables, and run the rest of the code -- I guess this is not easy to do?
We can do something complicated but I am not sure how reliable it can be... Basically, we can parse the global section and
- Pick out
import
,def
, andclass
from the statement. - Run these statements and get a dictionary
A
- Run the entire global section and get another dictionary
B
- Remove the elements in
A
fromB
, and we require the rest of the stuff to be pickleable.
Now the global section is separated into import/def/class
statements and a pickleable dictionary. Now for each place when global statement is needed, we
- Execute the
imports
and definitions. - Restore from the pickled dictionary.
This should work most of the time but
- It assumes the order of imports and definitions do not matter, which I am sure is not sure for some corner cases.
- It will prevent users from the creation of non-pickleable objects using something like
A = function_call()
but I guess we can tolerate such limitations.
Not sure if you are still using it but I will have to remove the %include
and %from
magic from the syntax because it includes global
section from another script to the current one and makes the handling of this ticket even more troublesome.
I do not use these any more. I think you managed to convinced me to stop using these at some point.
Yes, I think I was under the influence of snakemake which can manage multiple files, but frowned upon the complexity and the implementation and especially the using of these features, especially because they are not in line with our design philosophy ...
I am removing them now.
Done. This also resolves #1155 (execution of global statement on remote host) because now the global statement will only be executed once (locally) and be pickled to all steps, substeps, and tasks.
Great! Since it is pickled, how does each substep use this pickled data? Do they each load it on their own, or the step loads it and copies to all substeps -- by reference or by value?
As far as I can tell, each substep will load it because everything is mixed up on the same worker. This could be improved but right now correctness is more important.
The trunk is not ready as there are still randomly failed tests and I can now reproduce the DAG problem but I believe the key parts are done.
I can now reproduce the DAG problem
This can be reproduced reliably now on your end with some example? I do not think I get it every time, at least I did not see it in versions before 0.18.7.
No. Only with the complicated example you have with 20+k jobs and many nested workflows. I can possibly try to trim down the workflow to create a test case though.