Illegal instruction (core dumped) on Raspberry Pi 4B
unwind opened this issue · 11 comments
When running with the latest (1.9.0) wheel from here, as per the installation instructions, my project's Torch code crashes every time with an illegal instruction exception.
The top 10 stack levels looked like this:
(gdb) where
#0 0x0000ffffd286dfc8 in exec_blas ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#1 0x0000ffffd283f150 in gemm_driver ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#2 0x0000ffffd283fbd0 in sgemm_thread_nn ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#3 0x0000ffffd28385bc in sgemm_ () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#4 0x0000ffffcfc38b8c in at::native::cpublas::gemm(at::native::cpublas::TransposeType, at::native::cpublas::TransposeType, long, lo
ng, long, float, float const*, long, float const*, long, float, float*, long) ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#5 0x0000ffffcfce5c48 in at::native::addmm_impl_cpu_(at::Tensor&, at::Tensor const&, at::Tensor, at::Tensor, c10::Scalar const&, c1
0::Scalar const&) () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#6 0x0000ffffcfce68d0 in at::native::mm_cpu_out(at::Tensor const&, at::Tensor const&, at::Tensor&) ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#7 0x0000ffffcfce6a34 in at::native::mm_cpu(at::Tensor const&, at::Tensor const&) ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#8 0x0000ffffd056c784 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFuncti
onPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_mm>, at::Ten
sor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call
(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#9 0x0000ffffd039b464 in at::redispatch::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#10 0x0000ffffd1b5659c in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tenso
r const&) () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
Looking at the disassembly at the indicated location I got:
(gdb) disassemble
Dump of assembler code for function exec_blas:
0x0000ffffd286df70 <+0>: adrp x2, 0xffffd40c6000
0x0000ffffd286df74 <+4>: stp x29, x30, [sp, #-80]!
0x0000ffffd286df78 <+8>: mov x29, sp
0x0000ffffd286df7c <+12>: ldr x3, [x2, #2376]
0x0000ffffd286df80 <+16>: mov x2, x0
0x0000ffffd286df84 <+20>: stp x19, x20, [sp, #16]
0x0000ffffd286df88 <+24>: mov x20, x1
0x0000ffffd286df8c <+28>: ldr w0, [x3]
0x0000ffffd286df90 <+32>: cbz w0, 0xffffd286e000 <exec_blas+144>
0x0000ffffd286df94 <+36>: cmp x2, #0x0
0x0000ffffd286df98 <+40>: ccmp x20, #0x0, #0x4, gt
0x0000ffffd286df9c <+44>: b.eq 0xffffd286dff0 <exec_blas+128> // b.none
0x0000ffffd286dfa0 <+48>: adrp x19, 0xffffd4150000 <memory+1984>
0x0000ffffd286dfa4 <+52>: add x1, sp, #0x38
0x0000ffffd286dfa8 <+56>: add x4, x19, #0x4f0
0x0000ffffd286dfac <+60>: mov w0, #0x1 // #1
0x0000ffffd286dfb0 <+64>: add x4, x4, #0x40
0x0000ffffd286dfb4 <+68>: nop
0x0000ffffd286dfb8 <+72>: nop
0x0000ffffd286dfbc <+76>: nop
0x0000ffffd286dfc0 <+80>: strb wzr, [sp, #56]
0x0000ffffd286dfc4 <+84>: mov w3, #0x0 // #0
=> 0x0000ffffd286dfc8 <+88>: casalb w3, w0, [x4]
0x0000ffffd286dfcc <+92>: cbnz w3, 0xffffd286dfc0 <exec_blas+80>
0x0000ffffd286dfd0 <+96>: adrp x0, 0xffffd286d000 <inner_thread+2192>
0x0000ffffd286dfd4 <+100>: stp x2, x20, [sp, #56]
0x0000ffffd286dfd8 <+104>: add x0, x0, #0xbcc
0x0000ffffd286dfdc <+108>: str xzr, [sp, #72]
This seems to indicate the the culprit is the CASALB instruciton, which as far as I can understand is ARM8.1, while the Raspberry Pi has an ARM8-compliant core.
I hope this can be fixed, since building Torch myself seems daunting (and also since, assuming I'm right above, this is not the intended behavior).
Thanks for making tihs avaialble.
Hi.
Currently PyTorch wheels for Python 3.6 - 3.9 are installed from the official PyPI source.
It's very likely that the PyTorch team used some enterprise cloud VM with ARM CPUs, which are ARMv8.2 based, so that wheels won't disable v8.1 instructions.
Do you have any sample code?
I don;t know much of C, but I'll try to compile from source if the problem reproduces, in a few days (my borad is not available this week).
Thanks!
Hi.
Thanks for the rapid response. Not sure if I have code I can share, perhaps I can stitch together something but it will be a while. This is my last week before going on vacation, and there's other things to do in the project.
Thanks!
Hi again.
Okay, here's an attempt at a reproduction case:
#!/bin/usr/env python3
import torch
w_bn = torch.randn(64,64)
w_conv = torch.randn(64,108)
w = torch.randn(64, 12, 3, 3)
w.copy_(torch.mm(w_bn, w_conv).view(w.size()))
This crashes with a core dump every time I run it. Apologies for the random-seeming dimensions, it's just what our project seemed to be using (I'm not the author if the PyTorch-using code in our project, so I lack deeper understanding).
I did not trace this down to the core dump, but I would say chances are pretty good this is the same crash. I now understand that the "mm" in the trace above refers to a matrix multiplication, and this line of code (which is the same as in our project, except of course the data has been replaced with random matrices) calls mm() and never returns.
Good luck!
Hi, thank you for your replies!
I just tried your sample code on Python 3.8 (official wheel) and Python 3.10 (wheel from this repo). The results are:
On Python 3.8, bash reported Illegal instruction (core dumped)
and exited,
and fish reported fish: Job 1, “python3” terminated by signal SIGILL (Illegal instruction)
then exited.
On Python 3.10, it printed
tensor([[[[ 1.0143e+01, -1.2882e+01, -1.1660e+00],
[ 1.1609e+01, 7.4942e+00, 3.3680e-01],
[ 8.7291e+00, 1.7029e+01, -1.6758e+01]],
...
successfully.
I don't really understand what the codes mean, but I think this might suggest the assumption above.
I'll build wheels for 3.6 - 3.9 asap. Thank you again!
Okay great, feel free to drop me a line when you have wheels available and hopefully I can test, too.
Thanks!
Hi @unwind, the wheels are updated. You may try if if works(link for Python 3.8 here)!
Hi,
I have the same problem on my Raspberry Pi 4B with python 3.8.10.
Sadly after installing the updated wheel, the situation is the same.
Hi,
I have the same problem on my Raspberry Pi 4B with python 3.8.10.
Sadly after installing the updated wheel, the situation is the same.
Hi, could you provide your error report(s) and sample code?
I've tried on the wheel with the code above, but it worked normally. Could it be some difference in your code that cause the problem?
Thanks!
hmm. I tested the same code and it is the problematic casalb
instruction. Official torch version 1.8.1
is working.
Oh, I see. After upgrading again it is working. Maybe not uninstalling official 1.9.0
before installing this version was the problem.
Thanks!
Hi!
It does seem to resolve the issue for me on my Raspberry target. I had to (as you say) download your wheel manually and pip install
it directly from the file, but that was expected and worked well.
Thanks!
Hi, I have the same issue here. But it happen when I try to feed an image to my model. I'm using one of the existing models in pytorch.
I tried to download the wheel manually and install it, I got an error indicating that.
torch-1.9.0-cp310-cp310-linux_aarch64.whl is not a supported wheel on this platform. I tired with cp36 linux and cp36 many linux but all are the same thing.
I have tried multiple models one of them is as follows
net = models.quantization.mobilenet_v2(pretrained=True)
@unwind perhaps you can let me know which wheel file did u try. Thanks
I'm using python 3.9.2