pytorch/multipy

Future ARM support?

saareliad opened this issue · 6 comments

Hi, do you plan to support ARM in the future?

d4l3k commented

We'd love to add ARM support but I don't have any specific timelines for when that might happen right now. We'd be happy to work with someone to add support if there's any community interest in contributing

Can you share more details of what type of hardware you want to use multipy/deploy with? Are you targeting mobile/embedded devices or desktop style hardware i.e. Macs/Graviton/etc?

The main changes would be to improve the loaders depending on the environment:

I believe this was adapted from the Android linker implementation with the non-x86 bits removed. It's feasible to add it back in though may be quite a bit of work when dealing with the full E2E PyTorch/Python build

https://android.googlesource.com/platform/bionic/+/android-6.0.1_r1/linker/linker.cpp

Hi @d4l3k , I'm actually targeting Datacenter / near-edge, server.
The ARM cores will mostly run very lightweight pre/post processing functions (e.g., tokenization in NLP) which will be added as a custom op (e.g., as done in libraries like torchtext/torchaudio/... ) while the rest of compute will be offloaded to computational accelerators.

d4l3k commented

I got multipy working on aarch64 in a bit of a hacky way but we can polish this up so it does things correctly.

(venv-multipy) ubuntu@ip-172-31-38-182 ~/m/m/r/build (main)> ./interactive_embedded_interpreter
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
[W OperatorEntry.cpp:150] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::get_gradients(int context_id) -> Dict(Tensor, Tensor)
    registered at /home/ubuntu/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: (catch all)
  previous kernel: registered at /home/ubuntu/pytorch/torch/csrc/jit/runtime/register_distributed_ops.cpp:278
       new kernel: registered at /home/ubuntu/pytorch/torch/csrc/jit/runtime/register_distributed_ops.cpp:278 (function registerKernel)
--Return--
> /home/ubuntu/.pyenv/versions/3.9.13/lib/python3.9/pdb.py(1626)set_trace()->None
-> pdb.set_trace(sys._getframe().f_back)
(Pdb) import torch
(Pdb) torch.zeros(100)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])
(Pdb) import platform
(Pdb) platform.machine()
'aarch64'

Caveats:

  • this only works with python/torch/c extensions that are compiled with -mtls-dialect=trad since we don't have support for ARM64's TLSDESC support.
  • DTP* relocations are also pretty hacky (just sets module_id to 0) but don't seem to be causing any issues

Links:

I've been testing on a Graviton3 instance

d4l3k commented

@saareliad can you share what ARM architecture you're using? 64-bit? v8?

@saareliad can you share what ARM architecture you're using? 64-bit? v8?

Yes, 64-bit v8.2 (N1).

d4l3k commented

@saareliad sounds good, that should work with this prototype code -- 32bit won't