datajoint/datajoint-python

Offline computing

Opened this issue · 3 comments

Several users have expressed the need for offline computation. This may be necessary for cluster jobs where cluster nodes do not have access to the database or when users want to work remotely without access to the database.

I propose a feature that allows performing auto-populate jobs without access to the database during the computation. This will first fetch the data and stash it. Then a computing job can perform the computation in the stash. Then a separate job will collect the results and insert them in the database. All this will need to be done with the same level of transaction integrity as in the regular populate.

I can envision a few solutions for this making the process as unobtrusive as possible.

Using a context manager for stashing fetch results

One solution could rely on a special context manager to be used within the make callback.

The make method might look something like:

    def make(self, key):
        with self.checkout():
            a, b = (Ancestor & key).fetch('a', 'b')        
        c, d = compute_something(a, b)
        self.insert1(dict(key, c=c, d=d)

The context manager must wrap the part of the make call that does the fetching. That's all that the user needs to do.

Stashing for offline computation

Then, calling populate with compute_offline=True will cause the populate call to execute the fetching code within the context and stash all variables in that context into a file for offline computation. The context manager will throw an exception on exit and will not perform the actual computation.

Computing offline

Then, while offline, calling schema.compute_offline() will trawl through the stash directory and call make again for all uncomputed datasets. The insert comments will be modified to store the result back into the stash. Quitting the make call will mark the stashed dataset as completed.

Collecting results

Then when online again, calling schema.collect_computed_results() will trawl through the stash directory, call make again for each dataset, verify the checksum of the fetched data, quit make and save the stashed results into the database all inside a transaction.

After the result is collected and inserted, the reserved job is cleared and the stashed dataset is deleted.

Note that when populate is called with compute_offline=False (default), then populate will work as before.

ixcat commented

The general idea sounds good -

Would issues w/ upstream changes happening in the database or to other related values while in disconnected mode be handled somehow? Or would this be a caveat of using the 'disconnected' mode?

This is addressed by refetching the data and checking the hash at collection time. If the upstream data have changed, then the computation results are discarded.