foss-for-synopsys-dwc-arc-processors/linux

HS58 Linux: sshd process crashes on "Invalid Read" when issuing a command using Python's Fabric package

Closed this issue ยท 21 comments

Sometimes when sending commands to Linux over SSH from Python scripts, sshd process on Linux crashes with "Invalid Read" message.

Minimal Python example:

from fabric import Connection

connection = Connection(
    host="10.10.10.10",
    user="root",
    port=22,
    connect_kwargs={"password": "mypassword"})

connection.open()

index = 0
while True:
    for _ in range(10):
        connection.run("uname", hide=True)
        index += 1
    print("Commands count:", index)

After a while of sending uname commands in a loop the sshd crashes. Linux shows the following message:

# potentially unexpected fatal signal 11.
Path: /usr/sbin/sshd
CPU: 2 PID: 22479 Comm: sshd Tainted: G           O      5.15.127 #2
Invalid Read @ 0x0000003c by insn @ 0x20228cee
  @off 0x164cee in [/usr/lib/libcrypto.so.3]  VMA: 0x200c4000 to 0x202f0000
ECR: 0x00050100 EFA: 0x0000003c ERET: 0x20228cee
STAT: 0x80080282 [IE U     ]   BTA: 0x2035fbd0
 SP: 0x5f85f520  FP: 0xbcb494e4 BLK: 0x20229542
r00: 0x400e9540 r01: 0x400e95a8 r02: 0xec61ff54
r03: 0x00000080 r04: 0x27f79973 r05: 0xd830f2ea
r06: 0x9db4cdba r07: 0x6d8e9534 r08: 0x0384c021
r09: 0x4a54ebd5 r10: 0x00000010 r11: 0xa48f14f4
r12: 0x33d6d26f r13: 0x00000000 r14: 0x5f85f588
r15: 0x202e6030 r16: 0x5f85f58c r17: 0x5d1b18c8
r18: 0x49bedbf8 r19: 0x5f85f584 r20: 0x0000001c
r21: 0x25869130 r22: 0x800001f6 r23: 0x81937f17
r24: 0x00000012 r25: 0x00000011

Knowledge so far:

  • issue doesn't reproduce when running Linux on a single core,
  • reproduces only when using Python's Fabric module or underlying Paramiko module, which is a Python SSHv2 implementation. No reproduction with standard SSH tools,
    • Paramiko by default uses aes128-ctr encryption and hmac-sha2-256 mac, the same parameters can be applied in SSH tool but the result is the same,
  • reproduction is somewhat random, sometimes first sent command will break sshd, sometimes it takes couple thousand commands, but most of the time it can be reproduced in a few seconds,
  • command used doesn't matter, in the example uname is sent and the reply is simply Linux,
  • seen on OpenSSH_9.3p2 with OpenSSL 3.0.12. Using OpenSSH_9.6p1 fixes the issue, though the root cause is unknown.

This issue looks similar to #163 but is much easier to trigger.

Setting affinity for the sshd seems to influence this issue. With default affinity set to 0-2 issue reproduces. With affinity set to just single core (0, 1, or 2), issue is not observed.
That's how I set affinity

taskset -apc <affinity> <pid>

So far I've seen the following values in the invalid read address. They are all divisible by 4 and < 0x40:
0x04, 0x08, 0x10, 0x14, 0x1c, 0x24, 0x28, 0x2c, 0x30, 0x34, 0x38, 0x3c.

Full stack traces:
crashes.txt

Using gdbserver I was able to grab the following information:

The invalid read exception is caused by the instruction
ld.as r17,[r16,-17]
with $r16 = 0x58, $r16+4*-17 = 0x14, the address is obviously invalid.

More context:

(gdb) bt
#0  sha256_block_data_order (ctx=0x400e9588, in=<optimized out>, num=<optimized out>)
    at crypto/sha/sha256.c:344
#1  0x2022962a in SHA256_Final (
    md=0x400e6500 "$\367\220\001D\237\221\321UH\274\326\177\375#\032\371\3321u\210Q\316h\310\035\376\001\315oF\215", c=0x400e9588) at include/crypto/md32_common.h:219
#2  0x2026f85a in sha256_internal_final (ctx=0x400e9588,
    out=0x400e6500 "$\367\220\001D\237\221\321UH\274\326\177\375#\032\371\3321u\210Q\316h\310\035\376\001\315oF\215", outl=0x5f8dd5a8, outsz=<optimized out>) at providers/implementations/digests/sha2_prov.c:72
#3  0x201c9c40 in BASIC_CONSTRAINTS_free (a=<optimized out>) at crypto/x509/v3_bcons.c:43
Backtrace stopped: frame did not save the PC

and the assembly:

  165292:	2044 2f10           	and	r16,r16,0x3c
  165296:	2415 2451           	add2	r17,r20,r17
  16529a:	2055 2450           	add2	r16,r16,0x11
  16529e:	11ef a616           	ld.as	r22,[r17,-17]
  1652a2:	2e43 2491           	ror	r17,r22,0x12
  1652a6:	2e43 21d5           	ror	r21,r22,0x7
  1652aa:	2000 2350           	add	r16,r16,r13
  1652ae:	2e41 20d6           	lsr	r22,r22,0x3
  1652b2:	2507 2455           	xor	r21,r21,r17
  1652b6:	10ef a611           	ld.as	r17,[r16,-17]

From the above we see that the r16 value is truncated by the "and" operation so the only way for it to become a valid pointer is the addition of r13, however gdb reports that $r13 = 0x0.

I gathered some debug information which hopefully helps. Linux was stopped in the following place and then the sshd crash was reproduced using the example script:

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index e16649811870..137ac5f56ce7 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -179,6 +179,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
                si_code = BUS_ADRERR;
        }
        else {
+               __asm__("flag 1");
                sig = SIGSEGV;
        }

call_stack_core1.txt
call_stack_core2.txt
call_stack_core3.txt
l2tlb_core1.txt
l2tlb_core2.txt
l2tlb_core3.txt
smart_core1.txt
smart_core2.txt
smart_core3.txt

I took a look at the ftrace log which was stopped inside do_page_fault function just before the bug occurred. There were calls to do_signal and sys_rt_sigreturn. These functions store/restore user registers, which is suspicious as we have r13 corrupted. I then looked at which regs are saved/restored and it's only r0 to r12. Inside user_regs_struct r13 is marked as callee register, which is inconsistent with the ARCv3 documentation which says that r13 should be caller-saved.

See:

struct user_regs_struct {
unsigned long pad;
struct {
unsigned long bta, lp_start, lp_end, lp_count;
unsigned long status32, ret, blink, fp, gp;
unsigned long r12, r11, r10, r9, r8, r7, r6, r5, r4, r3, r2, r1, r0;
unsigned long sp;
} scratch;
unsigned long pad2;
struct {
unsigned long r25, r24, r23, r22, r21, r20;
unsigned long r19, r18, r17, r16, r15, r14, r13;
} callee;
unsigned long efa; /* break pt addr, for break points in delay slots */
unsigned long stop_pc; /* give dbg stop_pc after ensuring brkpt trap */
};

Can You comment on that?

Hi @srusecki, thanks for your debug info and you are right. Pavel also found this issue with r13 reg save/restore and we are on the way of fixing it. We will prepare a patch soon.

I moved the r13 to appropriate place and saved it correctly. The issue seems to be fixed.
@mroz1k also created a test app that reproduces it without the use of ssh and it also works now.

Waiting for an official patch.

Here's the diff with changes I've made to move r13 into scratch registers region:
r13_scratch_diff.txt

Unfortunately this fix doesn't help for #163

And here is the simple program that reproduces the r13 corruption: r13-corruption-repro.zip, it must be compiled with -fzero-call-used-regs=all

The patch from @srusecki fixes the issue.

Is the patch from the other issue https://github.com/foss-for-synopsys-dwc-arc-processors/linux/files/15313249/regs.patch something official, so we can apply it to finally close the issue? Or you plan to deliver additional changes? We are planning SW release in 2 weeks, so would be good to have it applied sooner than later to have time to test it.
Let us know what should we apply to close it.

Please use this regs.patch in your flow right now. This is not official patch since it can impacts other code in other projects and we are discussing on how to prepare official patch, it won't take too much time. I will let you know about official patch.

I checked that this patch works with gdbserver: arcv3_r13.patch

But you also need to change 'callee' to 'scratch' for r13 in case of using the patch and gdbserver. https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L188C3-L188C69 and https://github.com/foss-for-synopsys-dwc-arc-processors/binutils-gdb/blob/arc-2022.09/gdbserver/linux-arc64-low.cc#L245

That is correct. I can tuck those changes in our ARCv3 gdbserver as soon as our downstream linux has the patch applied onto its arc64 branch.

Also the user_regs_struct of the ptrace.h inside the linux-headers of arc-gnu-toolchain needs to be adjusted. Those headers are used for building glibc and providing user space headers.

Hi @shahab-vahedi,

We have merged arcv3_r13.patch patch to arc64 branch. Could you fix also GDB? Thanks.

8fbed42

The changes are populated to

arc64 branch of binutils-gdb
arc-2024.06 branch of bintuilts-gdb
arc-2024.06-gdb branch of binutils-gdb
master branch of arc-gnu-tool-chain

Hello @shahab-vahedi,

I see you applied the patch on 3 branches of binutils-gdb repository.
Previously we were using released binutils-gdb in our solution, tag arc-2023.09-release.
Which one should we use now? Can you make a new release or we need to use a specific commit for now?

@srusecki we're currently finalizing work on arc-2024.06 release, which we plan to publish in late July. Given that fix will be included in arc-2024.06 I would suggest using exactly that upcoming release. Will it work for you?

Ok, we can wait for the release.

However, can you tell which branch we can use (arc64, arc-2024.06 or arc-2024.06-gdb) to check if the issue is fixed on our side?

@srusecki we'll use contents of arc-2024.06 branches for the release. At the time of release build generation we'll put git tags on corresponding commits so that the state is known and fixed.

Hello @shahab-vahedi,

I see you applied the patch on 3 branches of binutils-gdb repository. Previously we were using released binutils-gdb in our solution, tag arc-2023.09-release. Which one should we use now? Can you make a new release or we need to use a specific commit for now?

Hi @srusecki . The changes I've mentioned in binutils-gdb (and arc-gnu-toolchain) are about addressing the issue that gdb(server) reports 0 for the r13 register. See this comment from @mroz1k .

The fix regarding your original issue (SSHD crashes) are in arc64 branch of our linux repository.

If you know how to build the toolchain and a vmlinux image, then feel free to use all the arc64 branches of the tools (binutils, gcc, glibc, etc.).