huggingface/candle

bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10

evilsocket opened this issue · 19 comments

I'm running candle with metal acceleration on iOS via uniffi, specifically:

candle-core = { version = "0.6.0", features = ["metal"] }
candle-nn = { version = "0.6.0", features = ["metal"] }
candle-transformers = { version = "0.6.0", features = ["metal"] }

But the call to Device::new_metal panics with:

-[MTLDebugDevice newBufferWithBytes:length:options:]:670: failed assertion `Buffer Validation
Illegal MTLStorageMode 0x10'

After some debugging, I figured this happens on this line due to iOS rejecting the metal::MTLResourceOptions::StorageModeManaged option.

I could replicate with this simple test:

#[uniffi::export]
pub fn test_metal() {
    let device = metal::Device::all().swap_remove(0);

    println!("device: {:?}", &device);

    println!(
        "MTLResourceOptions::StorageModeManaged = 0x{:x}",
        metal::MTLResourceOptions::StorageModeManaged
    );

    // it panics here
    let seed = device.new_buffer_with_data(
        [299792458].as_ptr() as *const std::ffi::c_void,
        4,
        metal::MTLResourceOptions::StorageModeManaged, // <-- 0x10
    );

    println!("seed: {:?}", &seed);
}

Which prints and then panics:

device: <CaptureMTLDevice: 0x105207820> -> <MTLDebugDevice: 0x105207580> -> <AGXG16Device: 0x10800f400>
    name = Apple A17 Pro GPU
MTLResourceOptions::StorageModeManaged = 0x10
-[MTLDebugDevice newBufferWithBytes:length:options:]:670: failed assertion `Buffer Validation
Illegal MTLStorageMode 0x10
'

Other values such as StorageModeShared do not panic, but i'm not sure of the implications also because StorageModeManaged is used in several other places in the metal implementation.

also mentioned in #1841

After a bit of digging into Apple MLX and especially how they handle buffer allocation on both macOS and iOS, I found this https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/allocator.cpp#L207

You will see that all allocations are centralized there and they always and only use ResourceStorageModeShared (you won't find references to other storage modes in their metal backend). So it seems like on iOS at least, managed buffers are not needed. It makes sense if we think about it as Metal on macOS must support both Intel (where GPU and RAM are not unified) and Apple Silicon, while Metal on iOS only cares about unified memory, hence no syncing needed / supported.

Unless I've got it all wrong, which is possible.

Also related, only the precompiled macOS library is present in this repo #1759

working on it on my fork

My branch compiles and runs on iOS:

Screenshot 2024-07-10 alle 16 08 46

I'm still getting weird errors when I try to have my macOS M1 doing computations with an iPhone (Error: A weight is invalid in distribution), but it works.

"weight is invalid in distribution" usually means that the model generated a nan, so somehow one of the metal kernel probably did not work in the same way on iphone as it would have on macos. It might be good to print the tensors after each op to get a sense of where the nan is coming from.

I have a feeling it comes from this https://codebrowser.dev/tokio/crates/rand-0.8.5/src/distributions/weighted_index.rs.html#454 .. i'm trying to narrow it down but to be honest I have very little experience with Candle so it's taking a while ...

Yes it's coming from the sampling, and the input of the sampling is based on the logits generated on the model, I would think that it's almost always caused by one of these values being a nan (or an inf).

tensor printf debugging ftw

@LaurentMazare you were 100% right, the error is in the logits output vector being full of NaN, the sampling doesn't like that ... disabling the kv_cache on both workers and master seem to be improve things, the point being, this is a problem of my tool, the porting of candle by itself seems to work

@LaurentMazare it seems to depend on some (version?) discrepancy between the libMetalFlashAttention.metallib bundled in this repo vs the iOS and macOS libMetalFlashAttention.metallib bundled here https://github.com/philipturner/metal-flash-attention/releases/tag/v1.0.1

In this issue #1759 @ivarflakstad suggests to use the iOS version from that release - however I found out that if I replace the the macOS metallib in this repo with that release, these tests fail https://github.com/evilsocket/candle-metal-tester/blob/main/core/src/lib.rs (even on macOS), suggesting that candle-metal-kernels makes some assumptions on the implementation of that library, that clearly changed over time.

Bottom line, it would be great if the last person to update the lib on this repo ( @FL33TW00D via b6afb46 ) could document which specific version/commit that library was compiled from and if possible also compile the same version for iOS (or point to the right precompiled release). I suspect that aligning both macOS and iOS libMetalFlashAttention versions would solve the issue.

We're currently using @FL33TW00D 's fork of mfa which iirc has a tiny change in how buffers are accessed.

My plan was actually to fix all these issues as soon as I'm back from vacation.
No need to use xcode 14. No longer be fully dependant on mfa. Introduce candle.metallib.

@ivarflakstad understood, i'll try to compile this for iOS https://github.com/FL33TW00D/metal-flash-attention and wait for proper fix, thanks

@ivarflakstad just tried to compile @FL33TW00D fork (last commit), that is not the one:

found <AGXG13XDevice: 0x13300ca00>
    name = Apple M1 Max

using <AGXG13XDevice: 0x13300ca00>
    name = Apple M1 Max
thread 'main' panicked at core/src/lib.rs:80:5:
assertion `left == right` failed
  left: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 right: [20.0, 23.0, 26.0, 29.0, 56.0, 68.0, 80.0, 92.0]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I've literally tried to compile all branches of that repo and all of them fail the tests

@ivarflakstad i became a little bit obsessed with this so I tried to compile and test every single commit of every branch of both forks of libMetalFlashAttention.metallib, both for macOS and iOS. My results are a bit discouraging:

macOS:

  • 0.1.0 (precompiled from main repo): fails tests and hangs bad the gpu state
  • 0.2.0 (precompiled from main repo): fails tests and hangs bad the gpu state
  • 1.0.0 (precompiled from main repo): fails tests and hangs bad the gpu state
  • 1.0.1 (precompiled from main repo): fails tests
  • batches-2 (branch from FL33TW00D's fork): fails tests
  • larger-batches: (branch from FL33TW00D's fork): fails tests
  • original bundled binary: passes all tests, indicating this version comes from somewhere else.

iOS:

  • 0.1.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
  • 0.2.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
  • 1.0.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
  • 1.0.1 (precompiled from main repo): fails tests
  • batches-2 (branch from FL33TW00D's fork): fails tests
  • larger-batches: (branch from FL33TW00D's fork): fails tests
  • original bundled binary: does not exist :/

It's so frustrating because this single library is the only obstacle to iOS compatibility.

Right. I tried to nudge you away from the rabbit hole, but I get the curiosity hehe.
We won’t be dependent on the precompiled MFA library for much longer.

@ivarflakstad i am so looking forward to it - i don't think i have to tell you guys that there's a lot of potential in unlocking iOS. To my knowledge, Candle is currently the closest thing to a single framework supporting pretty much anything in a single codebase.