philipturner/swift-colab

Can't compile Swift for TensorFlow quickly

philipturner opened this issue ยท 14 comments

The main reason I made the overhauls present in Swift-Colab 2.0 was so that in the future, I could run S4TF code without facing bottlenecks that make it virtually unusable. However, I am unable to compile S4TF for use in the interactive experience. This is after avoiding the problems described in #14.

The test notebook S4TF with TF 2.4 shows my effort to compile S4TF for use in the Swift interpreter. Even though that failed, I can technically compile it using %system flags like in s4tf-on-colab-example-1.ipynb and add custom code to the test suite. But that isn't ergonomic or reproducible in any way.

Specifically, the debugger shows an error when I run the following code. Back in the swift-jupyter era, the TensorFlow module was embedded in the toolchain. So the error below was likely never encountered.

import TensorFlow
print(Tensor<Float>.self)
<Cell 1>:2:7: error: cannot find 'Tensor' in scope
print(Tensor<Float>.self)
      ^~~~~~

One simple solution to #14 and #15 is a new magic command: %install-x10. But I have to be 100% sure it is necessary. If I change my mind, it's source-breaking.

It actually does load, but you need to restart the runtime first. I haven't tested it yet because I got sidetracked with a bug on the Python side. Either way, I need to investigate why this won't load into the Swift interpreter without restarting the runtime first. That restriction is not present on PythonKit, SwiftPlot, and other libraries.

Hi @philipturner . I can get past your error above by removing some of the flags you set to install s4tf.

I comment these flags:

//%install-swiftpm-flags -c release -Xswiftc -Onone

And then TensorFlow is available to import. I believe setting those flags actually breaks the import of any package, not just TensorFlow. I tried some other packages without clearing the flags and they also silently failed to import.

Sadly though problems still remain. The specific error is:

import TensorFlow
let x = Tensor(0)

Produces:

Couldn't lookup symbols:
  TensorFlow.Tensor.init(_: ฯ„_0_0, on: TensorFlow.Device) -> TensorFlow.Tensor<ฯ„_0_0>
  TensorFlow.Tensor.init(_: ฯ„_0_0, on: TensorFlow.Device) -> TensorFlow.Tensor<ฯ„_0_0>

It looks similar to swift-apis issue 1016 which I don't believe was ever fixed. The error is a generic linking or runtime availability problem though so it is likely a different cause.

The env var LD_LIBRARY_PATH in Colab looks a bit strange and points to /usr/local/nvidia/lib. I don't think any TensorFlow files end up there so maybe that is the cause.

Thanks for the work you are doing on this. You are making impressive progress!

Thanks for investigating! I should be able to narrow this problem down to a small reproducer. Other packages like PythonKit behave just fine, there's some specific reason S4TF is being uncooperative.

I have encountered your error "Couldn't lookup symbols" multiple times today when using PythonKit. It always happens when I forget to execute the %install command after restarting the runtime. Did you execute the command that does %install .package(...) TensorFlow before receiving that error?

I have also used PythonKit multiple times with the -c release -Xswiftc -Onone flags. What packages didn't work when you used those flags? Also, remember to $clear the SwiftPM flags when appropriate.

Success! I added -rpath flag.

%install-swiftpm-flags -Xlinker "-rpath=/content/Library/tensorflow-2.4.0/usr/lib"

The full colab is here.

Now this:

import TensorFlow

print(Device.default) 
let x = Tensor(0)
print(x)
print(x.device)

let y = Tensor(0, on: .defaultXLA)
print(y.device)

Shows this:

Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER)
0.0
Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER)
Device(kind: .CPU, ordinal: 0, backend: .XLA)

I also did some fiddling around the -c release -Swiftc -Onone and determined it is -c release causing the problem. The output shows the flag working correctly - building for production when included and building for debug when excluded. But the production build leads to the import not actually working.

I actually ran through the entire Model Training Walkthrough tutorial on tensorflow/swift, using -c release -Xswiftc -Onone. That specific set of flags makes it take 2 minutes to compile, while standard debug mode compiles in 3 minutes. I haven't tested compiling it in debug mode. You're saying that if it's in debug mode, you don't have to restart the runtime to load the library?

I will definitely narrow this down and find the culprit, because I believe that is a bug with SwiftPM or the Swift compiler. SwiftPlot depends on C dependencies and doesn't have that issue.

%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_optimizers_optimizer.so /usr/lib/libx10_optimizers_optimizer.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_optimizers_tensor_visitor_plan.so /usr/lib/libx10_optimizers_tensor_visitor_plan.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10.so /usr/lib/libx10.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_training_loop.so /usr/lib/libx10_training_loop.so

%install-swiftpm-flags $clear
%install-swiftpm-flags -c release -Xswiftc -Onone
%install-swiftpm-flags -Xswiftc -DTENSORFLOW_USE_STANDARD_TOOLCHAIN
%install '.package(url: "https://github.com/philipturner/s4tf", .branch("fan/resurrection"))' TensorFlow

I haven't tried using -Xlinker or -rpath yet; I just copied the binaries to system include paths. If I can use your workaround to fix the issue with linking the binary files, then that solves half of my problem. The other half is copying the headers' paths into Clang modulemap files, so that they don't have to be copied into system header directories. I'm working on narrowing a SwiftPM bug affecting the latter task right now.

One fruit of this effort, although not the bug I'm tracking down: swiftlang/swift-package-manager#5482 (comment)

The bug I'm tracking is (from #14):

Two module.modulemap files that declare the same Clang module can overwrite each other, even if one is part of the documentation of a Swift package and never actually involved in the build process. This happened with the modulemap currently in the Utilities directory of s4tf/s4tf.

Utilities/module.modulemap shouldn't appear in the "build.db", and whether it appears or not is highly fickle.

The reason I initially decided to compile S4TF with the old TF 2.4 binary was to narrow down the source of s4tf/s4tf#14, not to make it accessible on Colab. You are welcome to see if that bug exists on the older X10 binary, or even better - help me fix that bug :)

You're saying that if it's in debug mode, you don't have to restart the runtime to load the library?

Maybe. I did not try restarting when compiled with -c release because TensorFlow was not listed in the list of libraries in the output instructions about restarting. Building in debug mode definitely doesn't require a restart.

The reason I initially decided to compile S4TF with the old TF 2.4 binary was to narrow down the source of s4tf/s4tf#14, not to make it accessible on Colab. You are welcome to see if that bug exists on the older X10 binary, or even better - help me fix that bug :)

Yes, I can see that the methods done in this Colab aren't ideal. It should allow me to run some old S4TF models using X10 on TPU if I want. I haven't tried this yet but it would be handy. However elaborate the methods to get there are...

I actually ran through the entire Model Training Walkthrough tutorial on tensorflow/swift, using -c release -Xswiftc -Onone.

I have no doubt that those flags can work and are useful. In this instance though there appears to be some interaction with the runtime in Colab, the %install command, or the build process. There are many moving pieces here. Not sure what to say other than that it consistently works with those flags commented and fails with them.

Also, I'm planning to un-comment out x10_training_loop from the package manifest on both the head branch and this TF 2.4 branch. It was deactivated in January 2021 because of some build failure with SwiftPM, but I hypothesize that has long since been fixed.

SwiftPlot has started failing to import on the first try if you use -c release -Xswiftc -Onone. You have to restart the runtime and rerun the %include command. This strange import behavior some time appeared between the 2021-12-06 and 2021-12-23 toolchains. This is a different time frame from when S4TF started experiencing the behavior (??? to 2021-11-12). To clarify, SwiftPlot and S4TF started failing to import at different times chronologically.

This is confirmation that the behavior is a bug. Something incorrect started happening in the Swift compiler before 2021-11-12. It was exposed to a greater extent in December, causing SwiftPlot to fail. Hopefully I can fix the compiler bug and integrate a patch into the 5.7 or 5.7.1 release.

Even wierder - you now have to restart 2 times to use S4TF on s4tf/s4tf:main! You only need to restart once when using fan/resurrection. Both branches were tried on the same 2022-05-11 toolchain and with a factory reset Colab instance, but I need to double-check that there are no confounding variables. Doing so is time-intensive because each test takes around 3 minutes, so I don't feel like it right now.

Something is very off here, and I'll instruct the user to avoid -c release -Xswiftc -Onone until this narrowed down. The nature of the bug (interacting with LLDB, not yet reproducible on macOS, reproducers exist in massive code bases) makes it time-intensive to narrow down. The -c release -Xswiftc -Onone flags only reduce compile time by 33%, so the user will just have to deal with it.

v2.2 was released, and the README has instructions for compiling Swift for TensorFlow. I noted the issue with -c release -Xswiftc -Onone, keeping the SwiftPM flags directive commented out. Let me know whether this works for you!

It should allow me to run some old S4TF models using X10 on TPU if I want. I haven't tried this yet but it would be handy. However elaborate the methods to get there are...

@mikowals I just got S4TF to run on a TPU. Look at the "TPU Tests" notebook at the bottom of the README. It was 8 TPUs at once, on the Colab free tier! I had never experienced using a TPU before. Could you provide some old X10 models designed for TPU, so that I can include them in the test suite?