beetbox/confuse

config file not properly closed

epifanio opened this issue · 1 comments

Hi,
I maybe mis-suing the confuse library but I am running a function which uses a method like:

import confuse
import logging

def get_logpath():
    try:
        config = confuse.Configuration("mmdtool", __name__)
        logfilepath = config["paths"]["logs"].get()
    except NotFoundError:
        logfilepath = "./logs/"
    if not pathlib.Path(logfilepath).exists():
        pathlib.Path(logfilepath).mkdir(parents=True, exist_ok=True)
    return logfilepath

it works fine for a while but as the number of files processed increase, when I run my script in parallel over thousands of record, at some point the parallel job breaks with the following error:

ubuntu@pycsw-prod:/mnt/csw/dev/py-mmd-tools/script$ python3 convert_all.py -i /mnt/csw/metadata/nbs -t /mnt/csw/dev/mmd/xslt/mmd-to-iso.xsl -o /mnt/csw/metadata/nbs_iso/
os.walk("/mnt/csw/metadata/nbs")
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/confuse/yaml_util.py", line 85, in load_yaml
OSError: [Errno 24] Too many open files: '/home/ubuntu/.config/mmdtool/config.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
  File "/usr/local/lib/python3.8/dist-packages/parmap/parmap.py", line 104, in _func_star_single
  File "convert_all.py", line 33, in writerecord
  File "/mnt/csw/dev/py-mmd-tools/py_mmd_tools/mmd_to_csw_iso.py", line 40, in mmd_to_iso
  File "/mnt/csw/dev/py-mmd-tools/py_mmd_tools/mmd_util.py", line 31, in setup_log
  File "/mnt/csw/dev/py-mmd-tools/py_mmd_tools/mmd_util.py", line 21, in get_logpath
  File "/home/ubuntu/.local/lib/python3.8/site-packages/confuse/core.py", line 558, in __init__
  File "/home/ubuntu/.local/lib/python3.8/site-packages/confuse/core.py", line 600, in read
  File "/home/ubuntu/.local/lib/python3.8/site-packages/confuse/core.py", line 574, in _add_user_source
  File "/home/ubuntu/.local/lib/python3.8/site-packages/confuse/yaml_util.py", line 88, in load_yaml
confuse.exceptions.ConfigReadError: file /home/ubuntu/.config/mmdtool/config.yaml could not be read: [Errno 24] Too many open files: '/home/ubuntu/.config/mmdtool/config.yaml'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "convert_all.py", line 56, in <module>
    main(metadata=args.input_dir, mmd2iso_xslt=args.input_xslt, outdir=args.output_dir)
  File "convert_all.py", line 42, in main
    y = parmap.map(writerecord, xmlfiles, mmd2iso_xslt=mmd2iso_xslt, outdir=outdir, pm_pbar=False)
  File "/usr/local/lib/python3.8/dist-packages/parmap/parmap.py", line 304, in map
    return _map_or_starmap(function, iterable, args, kwargs, "map")
  File "/usr/local/lib/python3.8/dist-packages/parmap/parmap.py", line 248, in _map_or_starmap
    output = result.get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
confuse.exceptions.ConfigReadError: file file /home/ubuntu/.config/mmdtool/config.yaml could not be read: [Errno 24] Too many open files: '/home/ubuntu/.config/mmdtool/config.yaml' could not be read

I tried to replace my code with:

with confuse.Configuration("mmdtool", __name__) as config:
    logfilepath = config["paths"]["logs"].get()

with the hope to get the config file closed, but that didn't work as I got a AttributeError: __enter__

Hi! Here's that load_yaml function that shows up in your traceback:

with open(filename, 'rb') as f:
return yaml.load(f, Loader=loader)

We are in fact closing the file after reading it. You mentioned that you are running this program many times in parallel:

when I run my script in parallel over thousands of record

So it seems likely to me that these thousands of parallel processes are simultaneously opening the same file—even if it will shortly be closed again by all of them.

Any chance you can instead find a way to load your config once and share it across all the processes?