pangeo-forge/pangeo-forge-recipes

Strange Interactions with Transposition and StoreToZarr

ranchodeluxe opened this issue ยท 5 comments

Versions:

pangeo-forge-runner==0.10.2
recipe.py
recipe versions

Problem:

Putting up the bat signal on this one ๐Ÿฆ‡ ๐Ÿ“ก b/c it's kept us confused for days. On the LocalDirectRunner and Flink we've noticed that this recipe with transposing coordinates will either hang/stall or dump zero useful tracebacks about where it's failing.

Looking for ideas about the finicky nature of this beast if you have any ๐Ÿ™‡

Unfortunately, the source data is in a protected s3 bucket ๐Ÿ˜ž and the recipe is written to leverage the implied AssumeRole behind the scenes but there's a JH cluster you can be added to if you want to test it out

Related to #709 and #715 and h5py/h5py#2019

Doing an issue dump of what we've learned and a thread with great detail from the past that is related โœจ

MRE:

Multiple different jobs (beyond the one in the this issue) seem to hang in apache.beam. The first successful attempt to get out of the situation was to remove layers of interference. We turned off fsspec's "readahead" cache. All jobs that had hangs were able to get quite a bit further before hangs happened again. In some cases (like this issue) that change possibly led to useful stack traces that were being swallowed. But we need to verify that. Eventually however there were still hangs.

Getting an MRE was hard and instead we decided to let things hang and take the opportunity to inspect thread traces and dumps to see what we can learn. The environment we were working on didn't give us privilege to install gdb or gcore, strace. We used py-spy instead.

Investigation:

  • we kicked off another job that always hangs on beam's LocalDirectRunner with a single process reading form s3fs

  • using ps ax -F --forest I inspected the most nested process commands until I knew when the final process for beam had kicked off and was running (even though we set the runner to use one process there are still os forks of bash, pangeo-forge-runner and in fact two beam processes to think about ๐Ÿ˜“ )

  • we waited for memory and cpu to fall to guess when it was hung

  • I ran ps -L --pid <PID> on the most nested PID from above to get some thread ids that I wanted to match in the next step

  • Then using py-spy (which is a great tool) I pointed to the same PID above and did a thread stack dump py-spy dump --pid <PID>

  • The thread stack dump output is great b/c it shows all the idle threads for grpc and two idle threads trying to talk with s3. One thread is the fsspec event loop where we can see xarray's CachingFileManager.__del__ and subsequent h5netcdf close calls happening. And the other is a thread related to beam work that is trying to fetch byte ranges from s3 using xarray

  • The docstring for CachingFileManager mentions GC events triggering it via the __del__. This got us thinking about how #709 open file handlers could exacerbate a problem where GC wants to run more

  • we forked xarray and naively added gc.disable() to the xarray/backends/file_manager.py module and hangs stopped happening while disabling gc in other spots didn't quite work

  • Then through a series of related issue threads we wound up on this comment that smells about right

Next Steps and Ideas:

  • reduce the MRE even further now that we have a good theory:

    • however, we need find which lib or interactions are causing the cycles that we think get GC'd. The older MRE's from the thread above work fine b/c the circular references were removed

    • we know we have reference cycles from open fsspec file-like handlers, so confirm that with a good visual

  • work around: move gc disabling as close as possible to the xarray operations (it was attempted before but didn't help, so let's try again)

  • work around: confirm that runs using fsspec non async file systems such as local work

  • work around: use beam's pythonsdk versions of cloud provider storage APIs which seem synchronous โœ…

  • work around: maybe return the old sync fsspec and see if we can use that or build our own

  • work around: add weakreferences in spots so GC isn't so active (once we find where cycles are)

Could try with copy_to_local=True in OpenWithXarray.

With my new s3fs fork and S3SyncFileSystem and S3SyncFile I can get the StoreToPyramid workflows to run without hanging. I did a rough "literalist" translation of the async code which isn't what we'll want in the end. A second pass will be needed to go through and figure out best approaches for dealing with async patterns such as asyncio.gather(*[futures]) beyond looping possibly:

Also, look into using this: https://www.hdfgroup.org/solutions/cloud-amazon-s3-storage-hdf5-connector/ instead of any repurposed synchronous tooling