tensorflow/swift-apis

In Colab using 'x10_training_loop' leads to "error: Couldn't lookup symbols:"

mikowals opened this issue · 6 comments

Opening a blank notebook which I expect is running S4TF v0.9 and entering the following:

import TensorFlow
import x10_training_loop 
Device.trainingDevices // Same error for runOnThreads(), HostStatistics(), ... etc.

Gives the error:

error: Couldn't lookup symbols:
      static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
      static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>

Code completion works correctly after import x10_training_loop has been run once so I think the import line for 'x10_training_loop' is correct. The error appears to occur for everything included in 'x10_training_loop' but I didn't try them all.

I should have added that importing 'x10_training_loop' works as expected using the MacOS v0.9 toolchain or the June 12 toolchain for MacOS. So I think it is either specific to Colab or the linux toolchain.

@mikowals I noticed that examples in the swift-models repo, such as the BERT-Cola and others, set the calculations to be run on an accelerator via XLA on the X10 backend with:

...
let device = Device.defaultXLA
...

And eager mode is done via device = Device.defaultTFEager.

So, if you run Device.defaultXLA instead, it should be error-free and recognizing you have a Colab GPU/TPU with an XLA backend:

import TensorFlow
import x10_training_loop 

Device.defaultXLA

Output:

...
▿ Device(kind: .GPU, ordinal: 0, backend: .XLA)
  - kind : TensorFlow.Device.Kind.GPU
  - ordinal : 0
  - backend : TensorFlow.Device.Backend.XLA

However, as you already mentioned, Device.trainingDevices will return:

error: Couldn't lookup symbols:
  static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
  static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>

(I'm sure @BradLarson @saeta and others can explain this better.)

Hi @8bitmp3. Yes, my reduced example with Device.trainingDevices can be worked around or I can train in x10 using other code. I demonstrated the problem with .trainingDevices only because it was the first line in my code that referenced the 'x10_training_loop' module and showed an error with that module.

I created the issue because I think it shows that something is going wrong in the building or use of the toolchain in Colab leading to a problem using 'x10_training_loop'.

Thank you for reporting! I can reproduce this in the latest Colab build, which does not occur in the linux nightlies. I suspect there is something in CMake that's not happening in Colab, so I'll hunt down this code and identify a fix.

We narrowed this down to a problem with the static linking of the CX10 library. This appears to work fine when used with the swift binary, but Colab uses a different dynamic execution approach. @compnerd has a few ideas about how to resolve this for all platforms.

For the sake of documenting bugs:

#1177 (comment) - similar problem, but not the same.