Add parallel task option

Question

Add parallel task option

Closed this issue 5 years ago · 5 comments

Though it is not certain that this will increase the performance much, because the fix itself takes only very little time, it would be good to include a parallel task option. With that option we can test in practice its performance.

So adding an option like in ece2cmor with an argument:

--npp N                     Number of parallel tasks

Answer 1 · 2019-12-16T12:01:41.000Z

This can be tested in the parallel_proc branch

Answer 2 · 2019-12-31T12:51:43.000Z

I tried to test the parallel_proc branch with --npp > 1, but without success so far.

What works is with -npp 1 on the head node:

./cmor-fixer.py --verbose --dry --forceid --olist --npp 1 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/

However, for instance:

./cmor-fixer.py --verbose --dry --forceid --olist --npp 2 /lustre2/projects/model_testing/reerink/cmorised-results/cmor-cmip-test-all-11/t002/ifs/001/CMIP6/

gives the error:

Traceback (most recent call last):
  File "./cmor-fixer.py", line 165, in <module>
    main()
  File "./cmor-fixer.py", line 153, in main
    modifications = pool.map(worker, considered_files)
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/lustre2/projects/model_testing/reerink/miniconda3/envs/cmorfixer-par/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.worker'

and requires an interruption.

A better test would be to use the batch nodes, however there I face a netcdf4 error, which I don't understand so far because this should work the same as with ece2cmor3 itself.

Answer 3 · 2020-01-06T10:41:24.000Z

Hi @treerink I pushed a fix for this, hopefully it works now

Answer 4 · 2020-01-06T16:26:00.000Z

Hi @goord, yes the fix seems to work. At least at the head node when I run it directly and it also seems in that case to accelerate for --npp 8 for instance. Unfortunately I have still the trouble on the compute nodes, which would provide the best test.

Answer 5 · 2020-01-06T16:27:38.000Z

The Parallel proc branch for parallel computing has been merged and seems to scale nicely.

The earlier mentioned trouble with the submit script in order to test the script on the compute nodes are solved.