Strange Interactions with Transposition and StoreToZarr
ranchodeluxe opened this issue ยท 5 comments
Versions:
pangeo-forge-runner==0.10.2
recipe.py
recipe versions
Problem:
Putting up the bat signal on this one ๐ฆ ๐ก b/c it's kept us confused for days. On the LocalDirectRunner
and Flink
we've noticed that this recipe with transposing coordinates will either hang/stall or dump zero useful tracebacks about where it's failing.
Looking for ideas about the finicky nature of this beast if you have any ๐
Unfortunately, the source data is in a protected s3 bucket ๐ and the recipe is written to leverage the implied AssumeRole
behind the scenes but there's a JH cluster you can be added to if you want to test it out
Related to #709 and #715 and h5py/h5py#2019
Doing an issue dump of what we've learned and a thread with great detail from the past that is related โจ
MRE:
Multiple different jobs (beyond the one in the this issue) seem to hang in apache.beam
. The first successful attempt to get out of the situation was to remove layers of interference. We turned off fsspec
's "readahead" cache. All jobs that had hangs were able to get quite a bit further before hangs happened again. In some cases (like this issue) that change possibly led to useful stack traces that were being swallowed. But we need to verify that. Eventually however there were still hangs.
Getting an MRE was hard and instead we decided to let things hang and take the opportunity to inspect thread traces and dumps to see what we can learn. The environment we were working on didn't give us privilege to install gdb
or gcore
, strace
. We used py-spy
instead.
Investigation:
-
we kicked off another job that always hangs on beam's
LocalDirectRunner
with a single process reading forms3fs
-
using
ps ax -F --forest
I inspected the most nested process commands until I knew when the final process for beam had kicked off and was running (even though we set the runner to use one process there are still os forks of bash,pangeo-forge-runner
and in fact twobeam
processes to think about ๐ ) -
we waited for memory and cpu to fall to guess when it was hung
-
I ran
ps -L --pid <PID>
on the most nestedPID
from above to get some thread ids that I wanted to match in the next step -
Then using
py-spy
(which is a great tool) I pointed to the samePID
above and did a thread stack dumppy-spy dump --pid <PID>
-
The thread stack dump output is great b/c it shows all the idle threads for grpc and two idle threads trying to talk with s3. One thread is the
fsspec
event loop where we can see xarray'sCachingFileManager.__del__
and subsequenth5netcdf
close calls happening. And the other is a thread related to beam work that is trying to fetch byte ranges from s3 usingxarray
-
The docstring for
CachingFileManager
mentions GC events triggering it via the__del__
. This got us thinking about how #709 open file handlers could exacerbate a problem where GC wants to run more -
we forked
xarray
and naively addedgc.disable()
to thexarray/backends/file_manager.py
module and hangs stopped happening while disablinggc
in other spots didn't quite work -
Then through a series of related issue threads we wound up on this comment that smells about right
Next Steps and Ideas:
-
reduce the MRE even further now that we have a good theory:
-
however, we need find which lib or interactions are causing the cycles that we think get GC'd. The older MRE's from the thread above work fine b/c the circular references were removed
-
we know we have reference cycles from open
fsspec
file-like handlers, so confirm that with a good visual
-
-
work around: move gc disabling as close as possible to the xarray operations (it was attempted before but didn't help, so let's try again)
-
work around: confirm that runs using
fsspec
non async file systems such as local work -
work around: use
beam
's pythonsdk versions of cloud provider storage APIs which seem synchronous โ -
work around: maybe return the old sync
fsspec
and see if we can use that or build our own -
work around: add
weakreferences
in spots so GC isn't so active (once we find where cycles are)
Could try with copy_to_local=True
in OpenWithXarray
.
With my new s3fs
fork and S3SyncFileSystem
and S3SyncFile
I can get the StoreToPyramid
workflows to run without hanging. I did a rough "literalist" translation of the async code which isn't what we'll want in the end. A second pass will be needed to go through and figure out best approaches for dealing with async patterns such as asyncio.gather(*[futures])
beyond looping possibly:
-
this feedstock
-
use this
s3fs
branch https://github.com/ranchodeluxe/s3fs/commits/gc/fetch_sync -
these
pangeo-forge-recipes
changes (also on thendpyramid
branch) -
makes sure my config targets
c.TargetStorage.fsspec_class = "s3fs.S3SyncFileSystem"
Also, look into using this: https://www.hdfgroup.org/solutions/cloud-amazon-s3-storage-hdf5-connector/ instead of any repurposed synchronous tooling