Add parallel task option
Closed this issue · 5 comments
Though it is not certain that this will increase the performance much, because the fix itself takes only very little time, it would be good to include a parallel task option. With that option we can test in practice its performance.
So adding an option like in ece2cmor with an argument:
--npp N Number of parallel tasks
This can be tested in the parallel_proc branch
I tried to test the parallel_proc
branch with --npp > 1, but without success so far.
What works is with -npp 1
on the head node:
./cmor-fixer.py --verbose --dry --forceid --olist --npp 1 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/
However, for instance:
./cmor-fixer.py --verbose --dry --forceid --olist --npp 2 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/
gives the error:
Traceback (most recent call last):
File "./cmor-fixer.py", line 165, in <module>
main()
File "./cmor-fixer.py", line 153, in main
modifications = pool.map(worker, considered_files)
File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 768, in get
raise self._value
File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.worker'
and requires an interruption.
A better test would be to use the batch nodes, however there I face a netcdf4 error, which I don't understand so far because this should work the same as with ece2cmor3
itself.
Hi @goord, yes the fix seems to work. At least at the head node when I run it directly and it also seems in that case to accelerate for --npp 8
for instance. Unfortunately I have still the trouble on the compute nodes, which would provide the best test.
The Parallel proc branch for parallel computing has been merged and seems to scale nicely.
The earlier mentioned trouble with the submit script in order to test the script on the compute nodes are solved.