In Colab using 'x10_training_loop' leads to "error: Couldn't lookup symbols:"
mikowals opened this issue · 6 comments
Opening a blank notebook which I expect is running S4TF v0.9 and entering the following:
import TensorFlow
import x10_training_loop
Device.trainingDevices // Same error for runOnThreads(), HostStatistics(), ... etc.
Gives the error:
error: Couldn't lookup symbols:
static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
Code completion works correctly after import x10_training_loop
has been run once so I think the import line for 'x10_training_loop' is correct. The error appears to occur for everything included in 'x10_training_loop' but I didn't try them all.
I should have added that importing 'x10_training_loop' works as expected using the MacOS v0.9 toolchain or the June 12 toolchain for MacOS. So I think it is either specific to Colab or the linux toolchain.
@mikowals I noticed that examples in the swift-models
repo, such as the BERT-Cola
and others, set the calculations to be run on an accelerator via XLA on the X10 backend with:
...
let device = Device.defaultXLA
...
And eager mode is done via device = Device.defaultTFEager
.
So, if you run Device.defaultXLA
instead, it should be error-free and recognizing you have a Colab GPU/TPU with an XLA backend:
import TensorFlow
import x10_training_loop
Device.defaultXLA
Output:
...
▿ Device(kind: .GPU, ordinal: 0, backend: .XLA)
- kind : TensorFlow.Device.Kind.GPU
- ordinal : 0
- backend : TensorFlow.Device.Backend.XLA
However, as you already mentioned, Device.trainingDevices
will return:
error: Couldn't lookup symbols:
static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
(I'm sure @BradLarson @saeta and others can explain this better.)
Hi @8bitmp3. Yes, my reduced example with Device.trainingDevices
can be worked around or I can train in x10 using other code. I demonstrated the problem with .trainingDevices
only because it was the first line in my code that referenced the 'x10_training_loop' module and showed an error with that module.
I created the issue because I think it shows that something is going wrong in the building or use of the toolchain in Colab leading to a problem using 'x10_training_loop'.
Thank you for reporting! I can reproduce this in the latest Colab build, which does not occur in the linux nightlies. I suspect there is something in CMake that's not happening in Colab, so I'll hunt down this code and identify a fix.
We narrowed this down to a problem with the static linking of the CX10 library. This appears to work fine when used with the swift
binary, but Colab uses a different dynamic execution approach. @compnerd has a few ideas about how to resolve this for all platforms.
For the sake of documenting bugs:
#1177 (comment) - similar problem, but not the same.