upenn-acg/ProcessCache

Directory Handling

gatoWololo opened this issue · 1 comments

Example Problem

Consider the following workload. make is executed with the Makefile:

*.c:
    gcc -o $@

This says "For all C files, compile them and name them the same as the C file without the extension". Assume our directory has files: one.c, two.c, three.c, four.c. IOTracker/ProcessCache would detect these four files as inputs to our computation. Now imagine we add an additional file: five.c. The correct thing to do is to reexecute this computation as there is now an additional input to our make process, but with our current design, ProcessCache would see none of the inputs to the computation have changed (all inputs were recorded on the first execution), so it would skip running this process entirely, wrong!


The problem is that we're missing one input to our computation, the reading of all files in the directory. Which maps down to the getdents/getdents64 (short for get directoy entry) system call. In general any FS system call that does a "for all" read has this issue, thankfully, this is the only one that comes to mind right now.

Notes on getdents

The getdents API is very low level. It doesn't open file descriptors or allow you to manipulate file. Instead it just gives you an array with linux_dirent` structs, which contain the inode and name of files. This doesn't help us directly but it is worth thinking about.

getdents is non-recursive, which is nice. This means we don't have to consider subdirectories or recursively read directories. From a syscall interception point of view, they will look like separate getdents calls.

Solution?

We could handle this is by considering a directory that gets read to be an input to a computation. So when a directory is modified (like in the example above) this will be considered an input change and thus the computation will be reexecuted. Inotify has support for being notified on directory modification which is nice.

This approach may end up being too conservative and unnecessarily reexecuting. For example, if a someone adds an unrelated file that is never used by the program touch omar.txt, under this approach we will end up having to reexecute the program even though there was no need to (since the input directory has been modified).

I don't know if this will work... When do we start checking for changes to a directory? After the respective exec computation is done? This won't work, as I expect any multi-exec computation will probably write output files to the relevant directory, making it seem like the directory has been modified... This issue is long enough so we can talk about it "in-person" but something to think aboutl

krs85 commented

closing as it is no longer relevant.