jonysy/parenchyma

Transfer Matrix

Opened this issue · 1 comments

There is the need to handle transfers between devices more easily.

The current attempt to sync from backend to another is not sufficient/does not scale with more backends.

There are two things to think of (and fallback each):

  1. Inter Framework
  2. Fallback to do a Framework A -> Native -> Framework B
  3. Inter Device (if the framework does not handle it, i.e. CUDA afaik)
  4. Fallback to do a Framework A/Device A -> Native -> Framework A/Device B

Note that the matrix is supposedly symmetrical, but the transfer functions are not identical! Read is not write after all.

Note that this allows to scale very quickly, basically if this becomes a bottleneck, special functions can be registered. If not, and host memory is sufficient, this will default.

Note that: To and from Native is obviously always populated.

Note that: Maybe a big framework matrix is best suited, and then, if necessary a inter device matrix within the framework.

In addition:

The body of the SharedTensor::autosync method contains the logic in question.

@alexandermorozov's original comment:

Backends may define transfers asymmetrically; for example, CUDA may know how to transfer to and from Native backend, while Native may know nothing about CUDA at all. So if the first attempt fails, we change the order and try again.

Removing that would require moving the logic to the Sync implementations, which could increase complexity. Although that's a disadvantage, the advantage of transferring the responsibility to frameworks would make adding other frameworks less of a hassle as the core codebase wouldn't need to be aware of individual frameworks (i.e., transferring from CUDA to OpenCL).

This may be a case of over-engineering, though. Transferring from framework-x to framework-y is rarely, if ever, done.