upenn-acg/ProcessCache

Put the $ in P$ (Caching Discussion)

Closed this issue ยท 4 comments

krs85 commented

UPDATED

With IOTracker being fairly robust to clustal, and the skipping of executions being worked on separately, it is time to start brainstorming about caching. So now we know what the inputs and outputs are of an execution, and we can skip the execution. But should we? I think that's what left to figure out.

What makes something skippable?

  1. We have seen this execution before and have the appropriate output files, output file hashes, and input file hashes stored in our cache.
  2. Its inputs have not changed since the first time we saw them.
  3. It was successful. Executions that failed previously will not be skipped; they will be allowed to run and fail again (literal execve system calls that immediately fail is what I'm talking about here).
  4. It is deterministic / conflict free***** (I don't know what the definition of this is really yet, visions of dependency graphs dance in my head ๐Ÿ’ƒ)

How do we know when the inputs have changed?

We hash the version the execution is trying to use and compare it against our original hash for the input file. If they match, this execution can be skipped, i.e. we copy appropriate output files to appropriate absolute paths. Or maybe we don't even have to do that, if the output files are already there and also match their hashes we have in our cache.

What is an input?

A resource whose starting state contents are utilized in some way by the execution (like a file of data being read in).
Also cwd, environment variables, executables, command line arguments, etc...

What is an output?

A resource whose contents are changed by the execution (like a file being written to or created).

When do we start tracking an input?

When the execution opens a file as read only, we can imagine it's going to be used for reading the contents. We want to hash the file now, as this is the state we are looking for in the future when we see this execution again and are considering skipping it.

When do we start tracking an output?

When the execution opens a file as write only (creating a file also means opening it for write only), we can imagine the contents of that file are going to be changed by this execution. We hash the file at the end of the execution.

I opened another discussion based issue with a "simple" example as a walkthrough of the preliminary cache design (#47).

We might eventually want to build a background inotify daemon to track the inputs and outputs. This is pretty complicated, and we are not familiar with the inotify API, so this is a solution that will be explored if we need the performance boost.

I also think about things in this way:
A unique execution has:

  • inputs and outputs (not going into detail here)
  • "root process" executions (execve calls made by the original process)
  • child process executions (execve calls made by child processes the original process spawned)

Further Thoughts:

Different kinds of programs have different implications for cache implementation:

  1. Programs that just call fork after their first execve. This will be treated like one big execution. So either skip the whole thing or run the whole thing. Kind of a simple case from a caching view.

  2. Next up: a single process program. It can do one or more execve calls.

  • If it only has one execve call or this is the final execve call, we could use the replace with exit method to skip the execution.
  • Otherwise we have to use some other method, maybe replacing the call with a noop? It's kind of weird. I don't think the "replace execve with exit and continue" method will work because the process will just exit. Definitely has to be handled in a different way from fork + execve. It feels like we should be able to pick and choose which execve calls to run in this scenario, just have to get the method right. I think programs that have a mix of "root execution" calls and "child execution" calls will be the trickiest (see 3 below).
  1. Programs that fork + execve (what I expect the most). These programs can come with some mix of fork and execve, like the parent (root) process calling a couple execve calls of its own. But looking just at a straight fork + execve, we skip this by stopping the execve in the prehook and replacing it with exit, and continuing, setting the correct exit code in the correct register for the child process and copying the correct outputs from the cache.

Finally...

Okay, let's say it's a nice fork + execve multiprocess program (i.e. the perfect candidate). What if it has some nondeterminism, how would we know, and what would we do about it? Luckily, the bioinformatics workloads involve each process reading and writing to its own files, and no others. In the future, we may need to do conflict tracing or determinism enforcement, depending on benchmark choice and "best effort" vs. perfect coverage (like DetTrace). I think that perfect coverage will cause performance degradation, so we will have to figure out what "best effort" means for Process Cache.

it would be evicted if an input or output changed

I haven't thought about evictions too much. I thought it would be hard and complicated but this might work pretty well!

It was successful. Executions that failed previously will not be skipped; they will be allowed to run and fail again.

I'm actually very curious about this. In my mind, I think we should cache failing executions just like we do successful executions. Any reason why we treat failing executions special?

It is deterministic***** (I don't know what the definition of this is really yet, visions of dependency graphs dance in my head ๐Ÿ’ƒ)

Yeah. At this point I'm very confused about our deterministic story %)

IOTracker will forward the info about inputs and outputs of the execution to the inotify daemon.

I think there is a question here about when we notify the daemon to avoid race conditions between when we notify the daemon and when the process keeps running. Someone could maybe modify the file before the daemon starts checking? Maybe not.

functionality in IOTracker to communicate with the inotify daemon for sending I/O and other metadata info about new executions to the daemon.

I suspect this will be pretty easy. We can use some IPC, like ipc-channel with some serialized struct to communicate!

The API for interaction between the cache and the IOTracker part of ProcessCache probably won't need to change too much anyway, and this solution doesn't involve teaching myself inotify to start.

Yes defining a cache API which we can first implement with just hashing and writing to a file for now and later implement using inotify is the way to go.

I will comment on the rest later!

Caching and execvs

You have clearly been thinking about how caching works with parents and children!

Maybe the dependency I want to represent isn't the inputs and outputs of child executions, but I would need to know what child executions the parent one will spawn?

Yes. I also believe this is the right way do it.

I think only caring about processes who execve gives us a simpler conceptual framework to think about this question. With execve-ing processes (lets call it an e-process) we can truly think of these e-processes as black boxes with clearly defined inputs and outputs. The fact that a parent e-process spawned a child e-process is mostly irrelevant (except for, as you said, remembering to still exec the child even if skipped the parent). This way we don't have to care about the "dataflow" relations between different e-processes, that is, we don't really care if the child generated a file which the parent uses, or vice versa, or whatever. All we have to consider is: what are our inputs/outputs? Have they changed? (This whole paragraph is me agreeing with you and elaborating, I hope it makes sense ๐Ÿ˜…).

In general being able to separately consider the parent and child for caching will give us the greatest flexibility!

This is feels conservative, and can lead to us never skipping that program's execution, making P$ useless for this program. After some discussion, it probably is not worth trying to handle programs like this, so we'll just say for now that they are bad candidates for P$.

Unfortunately yes.

Determinism

As far as the determinism questions goes. I think it is useful to think about two different problems: One is races the other is nondeterminism. Races are two processes racing to create a file or a read/write conflict, etc. Nondeterminism is reading the time, etc. (I know that races as I have loosely defined them here can lead to nondeterminism but let's keep the concepts separate for now).

Races we can check with with some conflict detection. We will see two processes write/write or read/write to the same file in the same "logical time". I think that is easier to handle.

"Real" nondeterminism is harder to handle. We could certainly do a lightweight version of Dettrace style determinism? I would argue 90% of deterministic issues are caused by the same few sources. So no sense in implementing a bullet proof solution when a lightweight one would get us 90% of the way there.

As far as e-processes that do things that Dettrace couldn't handle, like network IO, threads, etc. We can simply chose not to cache those computations and always execute them?

krs85 commented

I'm actually very curious about this. In my mind, I think we should cache failing executions just like we do successful executions. Any reason why we treat failing executions special?

Allow me to clarify:
A "failing execution" is an execve call that fails immediately.
A "successful execution" is an execve call that succeeds and does stuff. The process that spawned it can exit with an error code, that's fine, but it's still a "successful execution".

The reason I differentiate is they are a little different, and I have them represented differently in the cache. If we look up an execution we want to run in the cache, and it's a "failed execution", so long as everything checks out (no changes) we run it because it will fail immediately essentially doing nothing. If we look up and see a "successful execution", then we use the approach of swapping the first system call with an exit and putting the appropriate exit code in the appropriate register. This approach doesn't even make sense for a "failed execution" so I think it makes sense to have these represented in different ways.

krs85 commented

closing because this issue is old and the problem was solved long ago.