Implementation of Metal/Accelerate/DirectX backend

Question

Implementation of Metal/Accelerate/DirectX backend

Closed this issue 3 years ago · 0 comments

I'm splitting this into a third issue for organization purposes. This conversation stems from #1185, but that conversation is going to get crowded with multiple diverging topics otherwise. The following is a comment I copied and deleted from the parent thread (Brad Larson may have been pinged twice as a result):

Quoting this reply by @BradLarson from way above: #1185 (comment)

Broader accelerator support would most likely be possible only through a replacement of the TensorFlow layer with a different underlying runtime. That would open up many options for accelerators that would be challenging to support via TensorFlow.

I now see the meaning behind that statement. My goal is to allow acceleration through Metal, Accelerate, DirectX, and custom backends. I'm currently debating whether I should go for PluggableDevice or something entirely custom that removes the dependency on CTensorFlow.

Using PluggableDevice would be the easiest option, only if I could get it to work. I'm scanning over TensorFlow's documentation, and there seems to be some pain points with using it for S4TF. One is the LoadLibrary function, which must load a file in a Python-like file structure. Another is that it seems to make assumptions that you're using Python TensorFlow. Furthermore, I have to gain a lot of expertise with TensorFlow's C API, which would cost time. Using PluggableDevice means 100% investing in CTensorFlow, so no chance for deep learning on iOS like with DL4S.
The alternative option is to remove as many connections as possible to CTensorFlow. This is more flexible, but has a greater chance of bugs; I'm maintaining it alone while TensorFlow has dozens or hundreds of maintainers. There are many ops that I can't synthesize (e.g. cholesky_grad), so I might fall back to TensorFlow eager for them. I have to assume the backend has a graph compiler, which puts more burden on the backend creator. That's true for Metal/Accelerate and might be true for DirectML. If there's a way to separate X10 from XLA and use a custom graph compiler instead of XLA, then a custom runtime would be preferable to PluggableDevice.

One more limiting factor is that I can't personally cache new CTensorFlow/X10 builds (e.g. TF 2.7.0) and download them from SwiftPM. X10 binaries seem to be ~40 MB, but the maximum file upload size to GitHub is 25 MB. Google Drive seems like an even worse option. Removing CTensorFlow entirely means I don't need to store any pre-compiled executables online, and it fits more nicely with SwiftPM. I would rely on the existing 2.4.0 CTensorFlow binaries for Google Colab, Linux with CUDA, and nothing else.