browsermt/bergamot-translator

ARM Support for bergamot-translator matrix-multiplies for Mozilla

jerinphilip opened this issue · 15 comments

[Will be edited as more information is available]

Fitting ruy to C intgemm interface

One clean angle appears to be to obtain int8*int8 -> int32 convert it to float32 after to match intgemm.

Apparently marian rough edges which rely on x86 arch calls have already been worked around somehow, so integrating ruy into marian is also an option (connecting at where Mozilla wants it).

Playground: https://github.com/jerinphilip/arm-playground

Noticing the following:

  1. Multiply two matrices 8 bit, probably some row-major/column major order we only have one matrix that arrives in col major format Everything else is row major. intgemm offers two variants shifted and not shifted. ARM probably has some superior to avoid a hacky something (shifted) in intel.
  2. There are choices based on flags (bias, relu) among:
    a. UnquantiseAndAddBiasAndRelu
    b. UnquantiseAndAddBias
    c. JustUnquantiseRelu
    d. JustUnquantise

Integration Background

The approach currently undertaken is fastest to Mozilla dictated interface as Mozilla is the primary customer right now (wasm_intgemm_interface.h: browsermt/marian-dev, jerinphilip/arm-playground.

The way I understand, https://github.com/mozilla-extensions/firefox-translations/issues/75#issuecomment-881543045 attempts to bring a relay through WebAssembly to call the native implementation of intgemm (AVX2+ depending on what's available on hardware) which from meetings I understand to be already integrated into gecko-dev. The WebAssembly VM or whatever intercepts the calls from JS and relays it to these AVX2+ functions for better speed.

There is a fallback path, which allows an intgemm path compile on WebAssembly generating SSSE3 codepath, same as the slower WebAssembly implementation before.

I have found the following so far tracked across several issues:

Some autovectorization seems to be helping in ARM, still unable to enable neon:

https://godbolt.org/z/EsKr7ajvb

what autovectorisation you would see depends on what the the WebAssembly VM allows for.

kpu commented

Ok but the point of this work is to have code that runs natively in Firefox (gecko), not WebAssembly.

The implementation is being prepared in multiple parts, with concrete details starting to materialize now. The parts include

When the CI in pull request above ultimately becomes green we will have ARM compile. Target is my android phone, so I expect to be able to test it. If somebody has an M1 device and can lend a hand, please feel free to help out with testing. There is a bergamot-translator part of this, but I hope just a submodule update here will be enough.

The position this task finds itself in is weird. The C-interface is written based on intgemm. intgemm assumes x86 with a lot of registers, intrinsics in source not guarded behind an ifdef.

  1. We may try to make ARM a new CPU in intgemm, but I don't expect the outcome to be too pleasant. This will bifurcate intgemm source x86 and arm, then the wiring in intgemm is not necessarily reusable properly for ARM.
  2. The alternative is to hide intgemm behind ifdefs, bring ruy in to marian-dev and the implementation of the Mozilla dictated API through ruy. A dumbed down version of this is available at https://github.com/jerinphilip/arm-playground/blob/87e3c51f4a3f5e71a3b3ff019ca49d4fa4018eee/src/impl_ruy-export.cpp. The next step is to create ifdef guards for platform creating branches in marian-dev (which is what is about to be pursued in jerinphilip/marian#1). This is not much pleasant either. Marian-dev is already high entropy, so we're probably not making it much worse than it already is. Such an approach will involve someone at Mozilla (cc @abhi-agg, @andrenatal) to get ruy into gecko-dev, same as what was done with intgemm.

Among the above, we're going with (2) for now. This is a mess of ifdef which I will start now, slowly bringing CI to green. Might have to redo this after a first round of experimentation.

Please let know if the original authors of the API have a cleaner entry-point / cut point.

v0

Ruy based backend with a slower implementation is succeeding builds on ARM. The relevant implementation of the firefox interface via Ruy is:

All except int8PrepareBFromTransposed are currently covered in tests with references coming from intgemm based implementation in x86. int8PrepareBFromTransposed is not covered because the current fallback reference I'm using is aborting with not-implemented.

Tests involve comparing intgemm path ruy path on the same firefox API (on x86) and are currently succeeding as well.

As WebAssembly is intercepting int8* calls to be relayed to functions in firefox binary according to what I understand from https://gist.github.com/yurydelendik/cc02ba86128ed46d622e8c3099c8a510 and discussions with @kpu, marian ARM builds are not a strict requirement.

Next steps is improving performance (better transpose, vectorization etc).

FWIW support for ARM 32-bit is not really needed, we are looking for Aarch64 specific support at the moment, which might be easier to test and develop.

As WebAssembly is intercepting int8* calls to be relayed to functions in firefox binary according to what I understand from https://gist.github.com/yurydelendik/cc02ba86128ed46d622e8c3099c8a510 and discussions with @kpu, marian ARM builds are not a strict requirement.

@jerinphilip That's right. The plan is to compile bergamot-translator on wasm and use native code only for gemm calls. Therefore, the only native code that we need is the gemm implementations for intel and arm architecture. => Marian doesn't have to be compiled on arm for our use case.

We are in the process of landing intel implementation (i.e. intgemm) in gecko. Once, the arm implementation is ready, we plan to land that as well.

I just to clarify the ARM term here. The 32-bit ARM support will not be needed at all in nearest future. The Aarch64 support (aka ARM64) is needed in support of the Begamot in Firefox. I cannot see that any differentiation between these platforms is made above.

Target is my android phone, so I expect to be able to test it. If somebody has an M1 device and can lend a hand, please feel free to help out with testing.

Is your phone Aarch64? I see github actions can support "linux/arm64". Can we use that instead of Android NDK?

Raspberry PI4 is a cheap alternative for M1, if the latter is not available.

We're not working at this level, we're consuming ruy. So everything that is available at https://github.com/google/ruy/blob/8c3fd3f266b4a22d542d4aa41329b5018d6b87e1/ruy/path.h#L24-L91 will be available to the firefox-interface implementation (including alternate intel paths, and a fallback implementation). The CI configuration is armv8-a (aarch64) with neon intrinsics (for simd). This is the device I intend to run stuff for a sanity check.

I'll try to add a CI for:

Is your phone Aarch64? I see github actions can support "linux/arm64". Can we use that instead of Android NDK?

My initial attempts had missing libraries (math etc) troubles, so I started off with android which looked far better tooled and complete.

kpu commented

ruy abstracts over several ARM variants, so we're just ensuring ruy is called / adapted correctly. Phone support seems to come for free here, so might as well do it if we're doing M1. We should eventually test on M1.

@jerinphilip I can make you 4 cores of ARM-based Ubuntu in the cloud for free if that helps. https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier_topic-Always_Free_Resources.htm

All except int8PrepareBFromTransposed are currently covered in tests with references coming from intgemm based implementation in x86. int8PrepareBFromTransposed is not covered because the current fallback reference I'm using is aborting with not-implemented.

@jerinphilip Could you please also implement int8PrepareBFromTransposed? I will add the corresponding implementation for intel. The reason I kept it unimplemented was because currently prepareBtransposed doesn't get called anywhere from marian but @XapaJIaMnu pointed out that he will backport his change to start using it in marian (as per comment https://github.com/mozilla-extensions/firefox-translations/issues/75#issuecomment-884060757 and https://github.com/mozilla-extensions/firefox-translations/issues/75#issuecomment-884047347).

https://github.com/jerinphilip/MozIntGemm has been taken to finish by @kpu. This abstracts over two libraries intgemm/ruy and switches library based on the target platform. There is additional automation providing a source tarball for integration into Mozilla, optimizing for minimizing dependencies. Save a few final adjustments that can be carried over into issues - the Mozilla relevant bits of the task here are completed, and therefore closing this issue.

It is possible to make the source-transform feed directly into a gecko-dev directory structure if someone could communicate and let me know how to do this best. I prefer keeping the source transform so the debugging mechanisms etc (which depends on googletest) are required to be stripped to minimize hurdles for Mozilla partners. The source transform automated via GitHub actions should potentially be able to continuously move the build system from CMake to Mozilla's mach build system simplifying manual labour that would otherwise be required.

By construction, due to the reliance on functions visible only inside Firefox via intrinsic added into WASM, this source cannot be used for anything else like #324, which will be pursued therefore independently. I expect because software - there will be some issues - but most of this repository is walking in the dark given the absence of certain development elements and tooling. Please file maintenance issues/bugs in the MozIntGemm repository.

From #205 the following now remains checked:

  • Test cases for this (intgemm) implementation
    Via google-test, ground truths from either a slow reference or intgemm
  • Native implementations of the interface for other architectures (Armv8.0/AVX2/Armv8.5? + additional to be decided later)
    Via ruy.
  • Test cases for the (arm) implementations
    Via googletest ground truths from slow reference or intgemm