Satpy doesn't offer a way to close files, which causes issues on some filesystems

Question

Satpy doesn't offer a way to close files, which causes issues on some filesystems

Opened this issue 2 years ago · 4 comments

Describe the bug
This is the same underlying issue as #1889

Satpy Scenes do not offer a mechanism to manually close their open files when you're done interacting with a Scene object. This manifests issues on Windows, as noted by #1889, but also on Linux when using NFS to store the Scene's files.

To Reproduce

Clone https://bitbucket.org/nsidc/sscce-satpy-file-handles/
Set up an NFS mount (see README of the SSCCE repository)
Set up Earthdata Login credentials envvars (see README of the SSCCE repository)
Run docker-compose run sscce-satpy-file-handles

Expected behavior
Satpy offers a mechanism to close files open by a Scene so we can avoid errors or using the garbage collector manually, e.g.:

New method: scene.close()
New context manager: with Scene(...) as scene:

Actual results
An error is raised when the code attempts to delete the directory containing the already-deleted granules:

OSError: [Errno 16] Device or resource busy: '.nfs00000000000018990000000a'

A workaround can be used to avoid this:

del scene
import gc; gc.collect()

This is necessary because satpy offers no mechanism to manually close all files held by a Scene.

Environment Info:

OS: Ubuntu 20.04
Satpy Version: 0.36.0
PyResample Version: 1.23.0
Readers and writers dependencies (when relevant): modis_l1b reader

Answer 1 · 2022-04-19T18:03:25.000Z

Thanks for making an entire repository for reproducibility. As for the issue, I'm not sure this is strictly possible, at least not in the way your example code wants it to work. I've pasted a stripped down version of your script below:

if __name__ == '__main__':
    download_data()
    scene = create_scene()
    remove_data_dir()  # error

    del scene
    import gc; gc.collect()

    remove_data_dir()  # try again, successful

The main problem is that Satpy is using dask. When you create your Scene, the reader object under the hood may open some files for metadata, but in most cases it hasn't actually loaded any of the image data from disk. Dask has a reference to that file in one form or another. In the most basic case it has the low-level pyhdf/netcdf4/h5py/numpy object wrapping the file handle. In more complex cases it may only have the filename. Either way, the data isn't actually loaded into memory or used until you actually compute something (.compute() on the DataArray object or Scene.save_datasets()).

So while we could add a .close() or similar context manager workflow, it wouldn't actually solve your problem. You can't delete the data until you've used it. Other options or topics:

We could update specific readers to copy the data into memory and then give it to dask, but this would increase memory usage (possibly a lot) and likely hurt overall execution performance.
Similar to 1, we could add the .close() method which would tell the reader(s) to close all file handlers, but in your case would also recommend doing Scene.persist() just before that. This would load all data into memory in a dask and satpy friendly way at the cost of memory. Theoretically the files could then we closed and deleted.
Functionality could be added to read your data files similar to how xarray/fsspec read S3 stored files so that they are downloaded when they are used, streamed directly to the processing, and then deleted. They never exist on disk or at least don't have to. This is reader and file format specific on if it would work or how well it would perform.

I may be missing something, but I think this is a good point to start. Maybe you could provide more information about what your end goal is? Why are you deleting the files from NFS before any processing has occurred?

Answer 2 · 2022-04-22T03:47:22.000Z

Hey, thanks so much for your response!

I think the thing that I didn't communicate clearly is that the workflow really looks like this. In my example repository I omitted the "processing" step for the sake of making a small and complete example. I've since updated it! What we want to do is close all open file handles after we know we're done using them, but before deleting those files. We're interested in your option 2, but we have no need to persist the data in memory after deletion.

if __name__ == '__main__':
    download_data()
    scene = create_scene()

    # Do some processing... as an example, maybe inside this function `scene.save_dataset` is called.
    create_images_from_scene(scene)

    # At this point, we know for sure all procesing is complete, and we want to clean up. But we get an error:
    remove_data_dir()  # error

    del scene
    import gc; gc.collect()

    remove_data_dir()  # try again, successful

I was hoping to do this instead:

download_data()

with create_scene() as scene:
    create_images_from_scene(scene)

remove_data_dir()  # Woohoo!

Answer 3 · 2022-04-22T15:08:55.000Z

I think this gives even more of an argument to some design discussions that the @pytroll/satpy-core maintainers have had in the past regarding separating the reader creation/data loading portion of the Scene from the rest of the Scene. It would make this workflow mirror what you would do with other file reading tools and the stdlib open. For example:

with satpy.create_data_source(reader='abi_l1b', filenames=...) as abi_src:
    scn = Scene(abi_src)
    scn.load([...])
    scn.save_datasets()

Or something similar. But this logic doesn't exist, this is just me brainstorming for the future. For your specific near-term request(s), this all sounds fine by me. The Scene methods would have to be clearly documented to not only declare what they are doing (telling Readers to delete all file handlers) but also what that means for the user going forward (no data loading and no computation).

@MattF-NSIDC How would you feel about starting a pull request for this type of functionality?

Answer 4 · 2022-04-25T15:40:38.000Z

@djhoese I would love to! However, I'm a fairly new satpy user and I'm not sure how much, if any, time $DAY_JOB will give me to work on this, so I may not move very fast. I'll post here once I have time to get started, or once I know more about my availability.

If someone else is motivated to do this work in the near term, please let me know how I can help out.