Disallow parameters in tasks and store common information from subtasks to master task
BoPeng opened this issue · 6 comments
There has been a reportedcase when tasks are slow to create and submit. The problem traces down to the following scenario:
[sub]
parameter: par=list
input: loop
task:
[default]
sos_run('sub', par=very_long_list)
SoS passes the very_long_list
to sub
so that parameter: par
will get the value directly instead of reading from command line. The parameters are then sent to substeps and then to tasks.
There are several problems here. First, do we need to parse parameters in tasks?
[1]
output: 'a.txt'
task:
parameter: value = 'a'
sh: expand=True
echo {value} > {_output}
actually works but conceptually speaking we are not allowing
sos run task_id --par A
because tasks should encapsulate all information, which lead to the unique ID, and allowing --par A
beats the purpose of this mechanism.
Therefore we should disallow the use of parameters in tasks. This help reduce the size of task files because par
is no longer passed to tasks.
eb82512 disallows parameters in tasks.
Another problem:
In the case of
[sub]
parameter: par=list
input: for_each='par'
print(_par)
[default]
sos_run('sub', par=list(range(100)))
The parameter is used by the step, but not in any of the substeps, so passing var
to all substep will be a potentially substantial waste of zmq bandwidth and slowdown the substeps.
The last problem, in case the long variables are really needed in tasks:
[sub]
parameter: par=list
input: for_each='par'
task: trunk_size=5
print(_par)
print(par)
[default]
sos_run('sub', par=list(range(10)))
and then we are creating a jumbo task with 100 copies of the par
variable. It would be nice to somehow share the variable at the master task level so that there is no need to save several copy of it.
Therefore we should disallow the use of parameters in tasks.
But the following scenario will still work,
[1]
parameter: a = 1
task:
sh: expand = True
echo {a}
because it is not to use of parameters in tasks, right? I think the answer is Yes from your "last problem" statement although we can be smarter with it (not to save many copies)
so passing var to all substep will be a potentially substantial waste of zmq bandwidth and slowdown the substeps.
Indeed. Looks like this is fixed?
Yes, the example will work because a
is passed as "used signature var", not as parameter.
The last patch improves efficiency of hopefully not a corner case of large variables in subtasks
[sub]
parameter: par=list
input: for_each=dict(_par=range(1000))
task: trunk_size=500
print(_par)
assert par
[default]
sos_run('sub', par=[f'a_{i}' for i in range(10000)])
For this particular example, the task file reduced from 6M to 66K, and run time from 90s to 23s. The 90s was mostly spent on the compression of the pickled dictionary.