emscripten-core/emscripten

missing rotl/rotr in wasm

Closed this issue · 9 comments

Using the C code from https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant , I'll see no rotl/rotr instructions after decompiling wasm with wasm2wat ;

but when using assemblyscript

// ...
x0 += x4; x12 ^= x0; x12 = rotl<u32>(x12, 16);
// ...

it does generate rotl in wasm & is 10 times faster than emcc -O3 .

I guess maybe the llvm backend is not generating these instructions. Can you share the C code where you expect them to be generated?

#include <stdio.h>
#define uint32_t unsigned

// https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant

#define ROTL(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
#define QR(a, b, c, d) (             \
	a += b, d ^= a, d = ROTL(d, 16), \
	c += d, b ^= c, b = ROTL(b, 12), \
	a += b, d ^= a, d = ROTL(d,  8), \
	c += d, b ^= c, b = ROTL(b,  7))
#define ROUNDS 20

uint32_t chacha_block(uint32_t out[16], uint32_t const in[16])
{
	int i;
	uint32_t x[16];

	for (i = 0; i < 16; ++i)
		x[i] = in[i];
	// 10 loops × 2 rounds/loop = 20 rounds
	for (i = 0; i < ROUNDS; i += 2) {
		// Odd round
		QR(x[0], x[4], x[ 8], x[12]); // column 1
		QR(x[1], x[5], x[ 9], x[13]); // column 2
		QR(x[2], x[6], x[10], x[14]); // column 3
		QR(x[3], x[7], x[11], x[15]); // column 4
		// Even round
		QR(x[0], x[5], x[10], x[15]); // diagonal 1 (main diagonal)
		QR(x[1], x[6], x[11], x[12]); // diagonal 2
		QR(x[2], x[7], x[ 8], x[13]); // diagonal 3
		QR(x[3], x[4], x[ 9], x[14]); // diagonal 4
	}
	for (i = 0; i < 16; ++i)
		out[i] = x[i] + in[i];

	return out[0];
}

uint32_t in[16]={0x61707865, 0x3320646e, 0x79622d32, 0x6b206574,
	0xfb2e460e, 0xb00a2090, 0x700a1db2, 0x50c0c43f,
	0x713d2225, 0x43334aac, 0x54aff28d, 0x8f1feb3d,
	0, 0, 0, 0};
uint32_t out[16];
int main(int argc, char **argv) {

	while(chacha_block(out, in) > 0x100) in[12]++;

	for(int i=0; i<16; i++) {
		printf("%2d: %08x -> %08x\n", i, in[i], out[i]);
	}
	printf("in[12] = %d\n", in[12]);
	return 0;
}

please note the #define ROTL ...

I see, so you are expecting the compiler (clang/llvm) to lower the #define ROTL(a,b) (((a) << (b)) | ((a) >> (32 - (b)))) sequence into wasm rotl/rotr instructions but it isn't?

@dschuff @tlively do you know if/why LLVM doesn't do this?

It looks like we do have at least some lowering to rotl and rotr in llvm: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/WebAssembly/rotate-i3264.ll

And more in llvm/test/CodeGen/WebAssembly/i32.ll.

As to why the particular sequence in #define ROTL is not getting lowers this far I don't know. Do any other llvm targets lower that macro to single rotl/rotr instructions?

This might be something we can look into, whether this should pattern match. To work around this, can you see if the __builtin_rotateleft clang builtin works for you?

Helps!

> wasm2wat chacha20.wasm|egrep 'rot[lr]'|uniq -c
     32         i32.rotl

Thanks!

This could still be a feature request for LLVM to pattern match this. Do you mind filing an issue against the wasm backend in https://github.com/llvm/llvm-project ? Do you happen to know if this works on other platforms such as ARM or x86? There might very well be a generic pass somewhere that will match this code into an fshl intrinsic (which we would then lower to the wasm rotate instruction).