Avoid "Disk cache database error" when running multiple instances on HPC
dekuenstle opened this issue · 8 comments
On our attempt to render many images simultaneously with mitsuba on an HPC system, Dr.Jit crashes if multiple renderings are scheduled to run on the same compute node:
Critical Dr.Jit compiler failure: jit_optix_check(): API error 7012 (OPTIX_ERROR_DISK_CACHE_DATABASE_ERROR): "Disk cache database error" in /project/ext/drjit-core/src/optix_api.cpp:382.
I assume that Dr.Jit tries to write a cache to a location where a previously started process has written its cache (and locked it).
Could you please help us debugging, i.e. by answering (a) Where are the caches stored? and (b) is there any way to configure the cache location?
Thanks in advance!
Looks like your ~/.drjit/optix7cache.db
file was corrupted (which should not happen since OptiX locks the file when it is concurrently accessed). Is it possible that you are using NFS or a similar network file system? Possibly that defeats the mechanism.
Thanks for the prompt response!
I assume the problem is, that ~/.drjit
is shared across all nodes and our FS is not handling the locking properly––is there a configuration (environment variable for the cache location) such that every instance can write its own directory?
The path is computed here: https://github.com/mitsuba-renderer/drjit-core/blob/master/src/init.cpp#L88, and we don't provide a good way of customizing it atm. You could try overwriting the HOME
environment variable as a workaround.
Thanks, overwriting HOME
appears to be a workaround for us, but could have side effects for others.
You might consider introducing a custom environment variable for the cache because this shared-home caching is the only issue that we observed with the massive parallelization of mitsuba/Dr.Jit on HPC clusters (and many HPC clusters have such a shared HOME
). Otherwise, it works like a charm! Thanks for your work and the quick support :-)
Out of interest, this is an HPC system with OptiX-capable GPUs? It sounds really fancy!
It's a cluster that is typically used with deep learning frameworks (TensorFlow, PyTorch), so the nodes are equipped with 2080TI or V100 GPUs; they work for rendering many variations of a scene as well ;-)
Hello, I'm facing the same problem, in more or less the same HPC setting. I was wondering if you're planning to address this in a future mitsuba/dr jit release, or should I try a workaround instead?
Thanks,
NK