EC-Earth/cmor-fixer

Add parallel task option

Closed this issue · 5 comments

Though it is not certain that this will increase the performance much, because the fix itself takes only very little time, it would be good to include a parallel task option. With that option we can test in practice its performance.

So adding an option like in ece2cmor with an argument:

--npp N                     Number of parallel tasks
goord commented

This can be tested in the parallel_proc branch

I tried to test the parallel_proc branch with --npp > 1, but without success so far.

What works is with -npp 1 on the head node:

./cmor-fixer.py --verbose --dry --forceid --olist --npp 1 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/

However, for instance:

./cmor-fixer.py --verbose --dry --forceid --olist --npp 2 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/

gives the error:

Traceback (most recent call last):
  File "./cmor-fixer.py", line 165, in <module>
    main()
  File "./cmor-fixer.py", line 153, in main
    modifications = pool.map(worker, considered_files)
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.worker'

and requires an interruption.

A better test would be to use the batch nodes, however there I face a netcdf4 error, which I don't understand so far because this should work the same as with ece2cmor3 itself.

goord commented

Hi @treerink I pushed a fix for this, hopefully it works now

Hi @goord, yes the fix seems to work. At least at the head node when I run it directly and it also seems in that case to accelerate for --npp 8 for instance. Unfortunately I have still the trouble on the compute nodes, which would provide the best test.

The Parallel proc branch for parallel computing has been merged and seems to scale nicely.

The earlier mentioned trouble with the submit script in order to test the script on the compute nodes are solved.