hsivonen/simd

any() on boolean vectors on 32-bit ARM likely broken

Closed this issue · 5 comments

Steps to reproduce

  1. Use and ARMv7 + NEON Linux host.
  2. git clone https://github.com/hsivonen/encoding_rs
  3. cd encoding_rs
  4. git checkout 3049251cd80bb8eebc7d8c96057480d4e84fffef
  5. RUSTFLAGS=' -C target-feature=+neon' cargo test --features simd-accel

Expected results

Expected tests to pass, since encoding_rs contains no 32-bit ARM-specific code and the same code that only uses simd-crate facilities and cross-architecture LLVM shuffles works on Aarch64.

Actual results

Various tests fail. Since it's unlikely that LLVM is broken and unlikely that the rustc-to-LLVM part is broken just for 32-bit ARM, I suspect that the implementation for any() on boolean vectors is broken.

The implementation seems to assume that it's OK to transmute a 128-bit vector into a pair of 64-bit vectors (the 128-bit registers are aliased with two 64-bit registers). This is not what clang's arm_neon.h does, so maybe the assumption that the transmute is OK is no longer valid with rustc and LLVM updates.

To use an aliased half-register, arm_neon.h does this:

__ai uint8x8_t vget_high_u8(uint8x16_t __p0) {
  uint8x8_t __ret;
  __ret = __builtin_shufflevector(__p0, __p0, 8, 9, 10, 11, 12, 13, 14, 15);
  return __ret;
}

AFAICT, to fix this, a compiler RFC to extend SIMD shuffles so that the parameter and return value lane number doesn't need to be the same is needed. I'm thinking adding simd_shuffle16to8, etc.

@hsivonen I suspect you don't need an RFC for that. Instead, you can probably just submit a PR. The raw shuffle intrinsics are unstable today and probably will be for the foreseeable future. (So long as the spectre of integer generics looms, I suspect that will be true.) I think the quickest way to stabilization is to provide a layer above the shuffle in std, e.g., by defining vget_high_u8 (assuming that's a vendor intrinsic?) in std (errrmmm, I mean core), which would internally use the appropriate shuffle.

OK. I'll try to go with the direct rustc PR route.

AFAICT, to fix this, a compiler RFC to extend SIMD shuffles so that the parameter and return value lane number doesn't need to be the same is needed. I'm thinking adding simd_shuffle16to8, etc.

I was wrong. rustc already supports N to M shuffles. The number is the shuffle name is the output lane count and does not limit the input lanes.