Just showing what I'm doing with AssemblyScript

Question

Just showing what I'm doing with AssemblyScript

Mudloop opened this issue a year ago · 6 comments

Question

Not sure if this is allowed here, but I wanted to show one of my AssemblyScript-based projects :

The main brain of the synth is handled with AssemblyScript. It handles parameter management, oscillators (wavetables), modulation and routing, and the filters and effects are done with Faust (with a little help from the host).

It's currently running in the browser (actually an eletron app atm), but the plan is to use iPlug2 and wasmer to turn it into a vst/au/... plugin. It's all set up with that in mind, and I've done some tests to make sure that would work.

The main challenge has been performance - AS is pretty good at that, but there's a ton going on. 44100 samples per second with 12 voices, 2 engines with 7 voices unison and 2 generators each that have multiple stages, it adds up. But after a couple of iterations of the audio engine, it's in a good place. The first time I tried, it took just under a second to generate a second of audio, which isn't acceptable, but the current version manages this in about 100ms, so that's good. And I haven't even vectorized anything yet.

I used Lit to make the UI. Still needs some polish, and there's a lot of placeholders still. I'm not a designer, but I'm pretty happy with what I managed to come up with.

Answer 1 · 2024-07-15T21:18:43.000Z

my gods you've done it, haven't you

Answer 2 · 2024-07-25T22:54:36.000Z

@bitnom, looks like he killed it 👏

Answer 3 · 2024-07-30T17:44:08.000Z

Thanks for the responses!

I spent some time updating the design :

This is all css, the only image used is the X.

Processing time was increasing towards 180ms / 1s of audio (that's with max polyphony and max unison), so it was time to vectorize some stuff, and now it's ~70ms again.

I now also do the filters from assemblyscript, which saves a lot of time. Converting compiled faust cpp filters to assemblyscript is trivial, and I plan to vectorize those as well.

So yeah, AssemblyScript is a real champ, if you know how to (ab)use it. Lots of pointer arithmetic going on etc.

Here's some Simd util code I wrote, I find the interpolateTowards method especially nifty, for things like phasors, and interpolating params across a vector.
The lerp3 method could potentially be done more efficiently, but it does the trick and isn't bottlenecking me at the moment.

export const interpolator: v128 = f32x4(0, 0.25, .5, .75);
export const splat0_5 = f32x4.splat(0.5);
export const splat0 = f32x4.splat(0);
export const splatMinusOne = f32x4.splat(-1);
export const splat1 = f32x4.splat(1);
export const splat2 = f32x4.splat(2);

export class SimdUtil {
	@inline static lerp(a: v128, b: v128, t: v128): v128 {
		return f32x4.add(f32x4.mul(t, f32x4.sub(b, a)), a);
	}
	@inline static interpolateTowards(from: f32, to: f32): v128 {
		return f32x4.add(f32x4.mul(interpolator, f32x4.splat(to - from)), f32x4.splat(from));
	}
	@inline static lerp3(a: v128, b: v128, c: v128, t: v128): v128 {
		const ratio = f32x4.mul(t, splat2);
		const firstRatio = f32x4.max(splat0, f32x4.min(splat1, ratio));
		const secondRatio = f32x4.max(splat0, f32x4.min(splat1, f32x4.sub(ratio, splat1)));
		const first = this.lerp(a, b, firstRatio);
		const second = this.lerp(b, c, secondRatio);
		const firstMul = f32x4.mul(first, f32x4.sub(splat1, secondRatio));
		const secondMul = f32x4.mul(second, secondRatio);
		return f32x4.add(firstMul, secondMul);
	}
	@inline static makeBipolar(v: v128): v128 {
		return f32x4.sub(f32x4.mul(v, splat2), splat1);
	}
	@inline static makeUnipolar(v: v128): v128 {
		return f32x4.mul(f32x4.add(v, splat1), splat0_5);
	}
	@inline static clamp(v: v128, min: v128, max: v128): v128 {
		return f32x4.min(f32x4.max(min, v), max);
	}
	@inline static normalize(v: v128): v128 {
		return f32x4.sub(v, f32x4.floor(v));
	}
	@inline static gather_v<T>(pointers: v128): v128 {
		let ret: v128 = v128.load_zero<T>(i32x4.extract_lane(pointers, 0));
		ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 1), ret, 1);
		ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 2), ret, 2);
		ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 3), ret, 3);
		return ret;
	}
	@inline static gather<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize): v128 {
		let ret: v128 = v128.load_zero<T>(ptr1);
		ret = v128.load_lane<T>(ptr2, ret, 1);
		ret = v128.load_lane<T>(ptr3, ret, 2);
		ret = v128.load_lane<T>(ptr4, ret, 3);
		return ret;
	}
	@inline static scatter_v<T>(pointers: v128, value: v128): void {
		v128.store_lane<T>(i32x4.extract_lane(pointers, 0), value, 0);
		v128.store_lane<T>(i32x4.extract_lane(pointers, 1), value, 1);
		v128.store_lane<T>(i32x4.extract_lane(pointers, 2), value, 2);
		v128.store_lane<T>(i32x4.extract_lane(pointers, 3), value, 3);
	}
	@inline static scatter<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize, value: v128): void {
		v128.store_lane<T>(ptr1, value, 0);
		v128.store_lane<T>(ptr2, value, 1);
		v128.store_lane<T>(ptr3, value, 2);
		v128.store_lane<T>(ptr4, value, 3);
	}

}

The main issue now is that the UI has become a bit sluggish, there's a lot of filters and dropshadows, so will probably need to make some images to make re-rendering stuff easier. Everything you see there is done with css, the only image currently is the X in the master section, so lots of opportunities to optimize - the issue being that it will be harder to tweak once I replace styled elements with images.

Answer 4 · 2024-07-30T20:13:36.000Z

Interesting find, performance greatly improved by replacing this (which is used in wavetable lookups) :

return i32x4(
	load<i32>(i32x4.extract_lane(pointers, 0)),
	load<i32>(i32x4.extract_lane(pointers, 1)),
	load<i32>(i32x4.extract_lane(pointers, 2)),
	load<i32>(i32x4.extract_lane(pointers, 3))
);

by this :

let ret: v128 = f32x4.splat(0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 0), ret, 0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 1), ret, 1);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 2), ret, 2);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 3), ret, 3);
return ret;

A bit less pretty, but way faster - made my entire thing about 20% more performant.

This might still not be the optimal way to load a vector from pointers stored in another vector, due to the extraction of the 4 lanes. I tried extracting them all at once into a memory.data slot and using that, but that made things worse.

There's not much info out there on simd in assemblyscript, so I'll share what I find in case it helps someone looking for simd tips.

EDIT : I added scatter / gather methods to the above simd util class, which do this.

Answer 5 · 2024-08-21T12:54:56.000Z

Here are some more simd methods that might be useful to someone :

	@inline static getPreviousPowerOfTwo(n: v128): v128 {
		n = v128.or(n, v128.shr<i32>(n, 1));
		n = v128.or(n, v128.shr<i32>(n, 2));
		n = v128.or(n, v128.shr<i32>(n, 4));
		n = v128.or(n, v128.shr<i32>(n, 8));
		n = v128.or(n, v128.shr<i32>(n, 16));
		return i32x4.sub(n, v128.shr<i32>(n, 1));
	}
	@inline static clz_i32(v: v128): v128 {
		return i32x4(
			clz<i32>(i32x4.extract_lane(v, 0)),
			clz<i32>(i32x4.extract_lane(v, 1)),
			clz<i32>(i32x4.extract_lane(v, 2)),
			clz<i32>(i32x4.extract_lane(v, 3))
		);
	}
	@inline static log2_i32(v: v128): v128 {
		return i32x4.sub(splat31_i, this.clz_i32(v));
	}

If anyone has a better idea on how to handle the clz operation without extracting the lanes, I'd love to hear it. Or the getPreviousPowerOfTwo method for that matter. It's pretty fast, but I suck at bitwise shenanigans, so it might be doable in some better way.

I use these for a wavetable anti-aliasing algorithm, to find "mipmaps" with a max frequency closest to the current frequency.

Edit : might as well share a recent screenshot :

Answer 6 · 2024-09-20T23:33:24.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions!