LLNL/scr

Restart with a different number of ranks

Opened this issue · 0 comments

SCR currently allows an application to restart with a different number of ranks. However, one cannot call the SCR restart API in that case.

https://scr.readthedocs.io/en/latest/users/integration.html#restart-without-scr

This is awkward for applications that can otherwise use the SCR restart API when restarting with the same number of ranks, since they then need to have two code paths:

  1. if restarting with same number of ranks --> use SCR restart API
  2. if restarting with different number of ranks --> do not use the SCR restart API

It would be nice to merge these. It should be possible when leaving the files on the parallel file system, but there are checks and logic in the fetch process that currently do not support it.

One known problem is in reading the rank2file map. This scatters the files using kvtree, and it currently requires the exact same number of ranks to read the file which wrote it.

if (kvtree_read_scatter(rank2file, filelist, scr_comm_world) != KVTREE_SUCCESS) {

We could work around that to distribute the file info to the ranks in the current run. We could just have kvtree decide how the info gets spread out, or we'd need to modify the kvtree API so that the calling ranks can specify the new mapping.

For the remainder of the function, we stat each file to verify that it exists. It would be nice to keep that, and it's easy to handle.

scr/src/scr_fetch.c

Lines 251 to 258 in 79ff7ed

/* just stat the file to check that it exists */
for (i = 0; i < num_files; i++) {
if (access(src_filelist[i], R_OK) < 0) {
/* either can't read this file or it doesn't exist */
success = 0;
break;
}
}

The trickier part is that we then fill in the local filemap data structure with info about each file that a rank "owns". It's not clear what to do in this case. One option would be to have each rank register every file as though all files are shared by all ranks. This is not exactly scalable, but perhaps it's the safest option, since we don't know how they will be accessed.

scr/src/scr_fetch.c

Lines 267 to 272 in 79ff7ed

/* create a filemap for the files we just read in */
scr_filemap* map = scr_filemap_new();
for (i = 0; i < num_files; i++) {
/* get source and destination file names */
const char* src_file = src_filelist[i];
const char* dest_file = dest_filelist[i];