SIGSEGV when using QemuForkExecutor in "arm" feature, and Unknown error: Unix error: ECHILD

Question

SIGSEGV when using QemuForkExecutor in "arm" feature, and Unknown error: Unix error: ECHILD

Opened this issue a month ago · 6 comments

The issue to be present in the current main branch

$ git log | head -n 1
commit dfd5609c10da85f32e0dec74a72a432acd85310a

Describe the issue
I am doing some fuzzing practice using an tenda VC 15 router httpd, which is 32-bit arm architecture. I use a QemuForkExecutor, but got an error when load the initial inputs:

Failed to load initial corpus at ["./seed/"]

I print the error,

if state.must_load_initial_inputs() {
    state
        .load_initial_inputs(&mut fuzzer, &mut executor, &mut mgr, &intial_dirs)
        .unwrap_or_else(|a| {
             println!("{}", a);
             println!("Failed to load initial corpus at {:?}", &intial_dirs);
             process::exit(0);
         });
    println!("We imported {} inputs from disk.", state.corpus().count());
 }

and it says:

Unknown error: Unix error: ECHILD

I debug the fuzzer, and find out that the fuzzer receives a SIGSEGV in trace_edge_hitcount_ptr:

   715 pub unsafe extern "C" fn trace_edge_hitcount_ptr(_: *const (), id: u64) {
   716     unsafe {
   717         let ptr = LIBAFL_QEMU_EDGES_MAP_PTR.add(id as usize);
 ► 718         *ptr = (*ptr).wrapping_add(1);
   719     }
   720 }
   
pwndbg> p ptr
$1 = (*mut u8) 0x4d55bbdb022cd456
pwndbg> p *ptr
Cannot access memory at address 0x4d55bbdb022cd456

It seems that the value of ptr cannot be dereferenced. I know that this function is used to record the coverage, but I don't know what "id" or "ptr" mean. So I read the related instrumentation code in qemu-libafl-bridge.

//$ git log | head -n 1
//commit 805b14ffc44999952562e8f219d81c21a4fa50b9

// in accel/tcg/cpu_exec.c, cpu_exec_loop
//// --- Begin LibAFL code ---

            bool libafl_edge_generated = false;
            TranslationBlock *edge;

            /* See if we can patch the calling TB. */
            if (last_tb) {
                // tb_add_jump(last_tb, tb_exit, tb);

                if (last_tb->jmp_reset_offset[1] != TB_JMP_OFFSET_INVALID) {
                    mmap_lock();
                    edge = libafl_gen_edge(cpu, last_tb->pc, pc, tb_exit, cs_base, flags, cflags);
                    mmap_unlock();

                    if (edge) {
                        tb_add_jump(last_tb, tb_exit, edge);
                        tb_add_jump(edge, 0, tb);
                        libafl_edge_generated = true;
                    } else {
                        tb_add_jump(last_tb, tb_exit, tb);
                    }
                } else {
                    tb_add_jump(last_tb, tb_exit, tb);
                }
            }

            if (libafl_edge_generated) {
                // execute the edge to make sure to log it the first execution
                // the edge will then jump to the translated block
                cpu_loop_exec_tb(cpu, edge, pc, &last_tb, &tb_exit);
            } else {
                cpu_loop_exec_tb(cpu, tb, pc, &last_tb, &tb_exit);
            }

            //// --- End LibAFL code ---

My understanding is: if a new translation block is generated by libafl_gen_edge, it is executed first, and then it is recorded on the coverage graph by jumping to trace_edge_hitcount_ptr through the hook. (I use StdEdgeCoverageChildModule, and I remember it used the edge type hook.)
Also, I debugged this part of codes. Considering the contents of the TranslationBlock structure, I found the specific contents of the edge variable:

// edge->tc.ptr
pwndbg> p/x *itb
$7 = {
  pc = 0x40a23030,
  cs_base = 0x480,
  flags = 0x0,
  cflags = 0x800010,
  size = 0x1,
  icount = 0x1,
  tc = {
    ptr = 0x710ee4e00740,
    size = 0x38
  },
  itree = {
    rb = {
      rb_parent_color = 0xfec7058d4840804b,
      rb_right = 0x48fffff959e9ffff,
      rb_left = 0x4de9fffffebd058d
    },
    start = 0x40a23030,
    last = 0xffffffffffffffff,
    subtree_last = 0x0
  },
  jmp_lock = {
    value = 0x0
  },
  jmp_reset_offset = {0x20, 0xffff},
  jmp_insn_offset = {0x1c, 0xffff},
  jmp_target_addr = {0x710ee4e00500, 0x0},
  jmp_list_head = 0x710ee4e002c0,
  jmp_list_next = {0x0, 0x0},
  jmp_dest = {0x710ee4e00440, 0x0}
}

pwndbg> x/16x 0x710ee4e00740
0x710ee4e00740 <code_gen_buffer+1811>:  0x3456be48      0x43f7dbc5      0xbf484d55      0x7f076fa0

Note the value of tc.ptr here. It is <code_gen_buffer+1811>. The machine code it points to is 0x43f7dbc53456be48, and gdb told me it means movabs rsi, 0x4d5543f7dbc53456.
While tracing the code flow later, I found that the fuzzer jumped to a small section of code hook to prepare parameters(moving to rdi and rsi), and then jumped to trace_edge_hitcount_ptr.

   0x5a457a1901df <cpu_exec_loop.isra+783>    mov    r12, qword ptr [r8 + 0x20]
   0x5a457a1901e3 <cpu_exec_loop.isra+787>    test   eax, 0x120
   0x5a457a1901e8 <cpu_exec_loop.isra+792>    jne    cpu_exec_loop.isra+1720     <cpu_exec_loop.isra+1720>
   0x5a457a1901ee <cpu_exec_loop.isra+798>    lea    rax, [rip + 0x3d2c0cb]         RAX => 0x5a457debc2c0 (tcg_qemu_tb_exec) —▸ 0x710ee4e00000 ◂— push rbp /* 0x5641554154415355 */

// R12 is 0x710ee4e00740 (code_gen_buffer+1811) ◂— movabs rsi, 0x4d5543f7dbc53456 /* 0x43f7dbc53456be48 */

   0x710ee4e00000                           push   rbp
   0x710ee4e00001                           push   rbx
   0x710ee4e00002                           push   r12
   0x710ee4e00004                           push   r13
   0x710ee4e00006                           push   r14
   0x710ee4e00008                           push   r15
   0x710ee4e0000a                           mov    rbp, rdi        RBP => 0x5a457f038920 ◂— 0x123fb400000000
   0x710ee4e0000d                           add    rsp, -0x488     RSP => 0x7ffffcc93560 (0x7ffffcc939e8 + -0x488)
   0x710ee4e00014                           jmp    rsi                         <code_gen_buffer+1811>
    ↓   
   0x710ee4e00740 <code_gen_buffer+1811>    movabs rsi, 0x4d5543f7dbc53456     RSI => 0x4d5543f7dbc53456
   0x710ee4e0074a <code_gen_buffer+1821>    movabs rdi, 0x5a457f076fa0         RDI => 0x5a457f076fa0 ◂— 0
►  0x710ee4e00754 <code_gen_buffer+1831>    call   qword ptr [rip + 0x16]      <libafl_qemu::modules::edges::trace_edge_hitcount_ptr>
        rdi: 0x5a457f076fa0 ◂— 0
        rsi: 0x4d5543f7dbc53456

This seems to indicate that the number following movabs rsi, will become the id. But the values I have here don't look right.

My issues now are as follows:

What does id actually represent?
How is it calculated?
How can I solve this problem?
Do I need to provide any additional information?

Thank you very much!

Answer 1 · 2024-10-28T02:47:49.000Z

Hi, I already know that id is generated through libafl_qemu_hook_edge_gen->create_gen_wrapper->gen_hashed_edge_ids(in StdEdgeCoverageChildModule). Now I am debugging this part of code...

Answer 2 · 2024-10-28T03:37:54.000Z

I found the process of calculating id and the intermediate value. The calculated id is indeed 0x4d5543f7dbc53456. Do you think there is a problem?

// src is 0x40a23030, dest is 0x40a23058
*RAX  0x4a265dc83567e8c8 hash_me(src)
*RAX  0x7731e3feea2dc9e  hash_me(dest)

 ► 0x5af412b8763e <libafl_qemu::modules::edges::gen_hashed_edge_ids+174>    xor    rax, rcx                        RAX => 0x4d5543f7dbc53456 (0x4a265dc83567e8c8 ^ 0x7731e3feea2dc9e)

Considering that LIBAFL_QEMU_EDGES_MAP_PTR is 0x761cc2856000, maybe it causes the SIGSEGV because it exceeds its range after the addition?

   715 pub unsafe extern "C" fn trace_edge_hitcount_ptr(_: *const (), id: u64) {
   716     unsafe {
   717         let ptr = LIBAFL_QEMU_EDGES_MAP_PTR.add(id as usize);
 ► 718         *ptr = (*ptr).wrapping_add(1);
   719     }
   720 }
$25 = 0x4d5543f7dbc53456
pwndbg> p/x LIBAFL_QEMU_EDGES_MAP_PTR
$26 = 0x761cc2856000

Answer 3 · 2024-10-28T04:48:59.000Z

I tried to change id as usize to (id as u16).try_into().unwrap(). This part is fine for now. (This is just a temporary solution.) But when I continued, the error Unknown error: Unix error: ECHILD still occurred. This seems to be because in the run_target method of GenericInProcessForkExecutorInner, the parent process does not correctly capture the exit of the child process. I will debug further.

Answer 4 · 2024-10-28T09:25:41.000Z

I found that in parent method of GenericInProcessForkExecutorInner, waitpid returned this error. This error, Unknown error: Unix error: ECHILD, means there is no child process.

pub(super) fn parent(&mut self, child: Pid) -> Result<ExitKind, Error> {
    let res = waitpid(child, None)?;
    //...
}

The waitpid in nix calls libc::waitpid. I searched on search engines and only this question seems to be somewhat related.
The first fork happens in SimpleRestartingEventManager::launch. The second fork happens in the run_target function of the Executor called inside load_initial_inputs.

pub fn launch(mut monitor: MT, shmem_provider: &mut SP) -> Result<(Option<S>, Self), Error>
    where
        S: DeserializeOwned + Serialize + HasCorpus + HasSolutions,
        MT: Debug,
    {
            // Client->parent loop
            loop {
                log::info!("Spawning next client (id {ctr})");

                // On Unix, we fork
                #[cfg(all(unix, feature = "fork"))]
                let child_status = {
                    shmem_provider.pre_fork()?;
                    match unsafe { fork() }? {
                        ForkResult::Parent(handle) => {
                            unsafe {
                                libc::signal(libc::SIGINT, libc::SIG_IGN);
                            }
                            shmem_provider.post_fork(false)?;
                            handle.status()
                        }
                        ForkResult::Child => {
                            shmem_provider.post_fork(true)?;
                            break staterestorer;
                        }
                    }
                };
}

fn run_target() -> Result<ExitKind, Error> {
        *state.executions_mut() += 1;
        unsafe {
            self.inner.shmem_provider.pre_fork()?;
            match fork() {
                Ok(ForkResult::Child) => {
                    // Child
                    self.inner.pre_run_target_child(fuzzer, state, mgr, input)?;
                    (self.harness_fn)(input, &mut self.exposed_executor_state);
                    self.inner.post_run_target_child(fuzzer, state, mgr, input);
                    Ok(ExitKind::Ok)
                }
                Ok(ForkResult::Parent { child }) => {
                    // Parent
                    self.inner.parent(child)
                }
                Err(e) => Err(Error::from(e)),
            }
        }
    }

I am still somewhat confused as to why this happened...

Answer 5 · 2024-10-28T17:08:57.000Z

thank you for the detailed report.

I found the process of calculating id and the intermediate value. The calculated id is indeed 0x4d5543f7dbc53456. Do you think there is a problem?
// src is 0x40a23030, dest is 0x40a23058
*RAX  0x4a265dc83567e8c8 hash_me(src)
*RAX  0x7731e3feea2dc9e  hash_me(dest)

 ► 0x5af412b8763e <libafl_qemu::modules::edges::gen_hashed_edge_ids+174>    xor    rax, rcx                        RAX => 0x4d5543f7dbc53456 (0x4a265dc83567e8c8 ^ 0x7731e3feea2dc9e)
Considering that LIBAFL_QEMU_EDGES_MAP_PTR is 0x761cc2856000, maybe it causes the SIGSEGV because it exceeds its range after the addition?
   715 pub unsafe extern "C" fn trace_edge_hitcount_ptr(_: *const (), id: u64) {
   716     unsafe {
   717         let ptr = LIBAFL_QEMU_EDGES_MAP_PTR.add(id as usize);
 ► 718         *ptr = (*ptr).wrapping_add(1);
   719     }
   720 }
$25 = 0x4d5543f7dbc53456
pwndbg> p/x LIBAFL_QEMU_EDGES_MAP_PTR
$26 = 0x761cc2856000

about the first bug, could you print the value of LIBAFL_QEMU_EDGES_MAP_MASK_MAX once in the gen_hashed_edge_ids function?

I found that in parent method of GenericInProcessForkExecutorInner, waitpid returned this error. This error, Unknown error: Unix error: ECHILD, means there is no child process.

pub(super) fn parent(&mut self, child: Pid) -> Result<ExitKind, Error> {
    let res = waitpid(child, None)?;
    //...
}

The waitpid in nix calls libc::waitpid. I searched on search engines and only this question seems to be somewhat related. The first fork happens in SimpleRestartingEventManager::launch. The second fork happens in the run_target function of the Executor called inside load_initial_inputs.

pub fn launch(mut monitor: MT, shmem_provider: &mut SP) -> Result<(Option<S>, Self), Error>
    where
        S: DeserializeOwned + Serialize + HasCorpus + HasSolutions,
        MT: Debug,
    {
            // Client->parent loop
            loop {
                log::info!("Spawning next client (id {ctr})");

                // On Unix, we fork
                #[cfg(all(unix, feature = "fork"))]
                let child_status = {
                    shmem_provider.pre_fork()?;
                    match unsafe { fork() }? {
                        ForkResult::Parent(handle) => {
                            unsafe {
                                libc::signal(libc::SIGINT, libc::SIG_IGN);
                            }
                            shmem_provider.post_fork(false)?;
                            handle.status()
                        }
                        ForkResult::Child => {
                            shmem_provider.post_fork(true)?;
                            break staterestorer;
                        }
                    }
                };
}

fn run_target() -> Result<ExitKind, Error> {
        *state.executions_mut() += 1;
        unsafe {
            self.inner.shmem_provider.pre_fork()?;
            match fork() {
                Ok(ForkResult::Child) => {
                    // Child
                    self.inner.pre_run_target_child(fuzzer, state, mgr, input)?;
                    (self.harness_fn)(input, &mut self.exposed_executor_state);
                    self.inner.post_run_target_child(fuzzer, state, mgr, input);
                    Ok(ExitKind::Ok)
                }
                Ok(ForkResult::Parent { child }) => {
                    // Parent
                    self.inner.parent(child)
                }
                Err(e) => Err(Error::from(e)),
            }
        }
    }

I am still somewhat confused as to why this happened...

maybe it's about the order in which processes die?

Answer 6 · 2024-10-29T03:11:47.000Z

@rmalmain Thank you for your reply.
For the first question, my LIBAFL_QEMU_EDGES_MAP_MASK_MAX is 0xffff.

About the second one, I write this fuzzer on an httpd program. There are many places in the program that use fork to process commands. Maybe I should find a more appropriate way to write the fuzzer.