nasa-jpl/jsd

SDO Thread refactor

alex-brinkman opened this issue · 1 comments

    I spent some time looking and thinking about this function and, although this works, I think we are overburdening the SDO thread with error handling. I think it would be cleaner to have a separate thread for the latter. It seems to me that handling SDO read/write requests is relatively independent of handling errors such as the EMCYs from the Elmos.

A couple of points that make me think this way:

  • Ideally, sdo_thread_cond should be an indication that the predicate (queue_is_empty(&self->jsd_sdo_req_cirq), i.e. state of queue) might have changed. This is the case when we signal from jsd_sdo_push_async_request, but it is not the case when we signal from the EGD driver in the fault state (indirectly through jsd_read).
  • When the signal comes from jsd_sdo_push_request we go over the error handling code before we can get to the point of servicing the SDO request queue, even if there are no errors present.

I am imagining something like this:

void* sdo_thread_loop(void* void_data)
{
  while (true)
  {
    pthread_mutex_lock(&self->jsd_sdo_req_cirq.mutex);

    while (queue_is_empty(&self->jsd_sdo_req_cirq))
      pthread_cond_wait(&self->sdo_thread_cond, &self->jsd_sdo_req_cirq.mutex);

    jsd_sdo_req_t req = queue_pop(&self->jsd_sdo_req_cirq);

    pthread_mutex_unlock(&self->jsd_sdo_req_cirq.mutex);

    if (self->sdo_join_flag) {
      return NULL;
    }

    // Submit the SDO request with ecx_SDOwrite/ecx_SDOread

    // Queue the response.
    jsd_sdo_req_cirq_push(&self->jsd_sdo_res_cirq, req);
  }
}

We can prescind from pthread_cond_timedwait above because even if we miss a signal while servicing the one received, the queue will not be empty after that second signal, so when we get to the while loop on the predicate, we will skip the wait.

Similarly for error handling:

void* error_thread_loop(void* void_data)
{
  while (true)
  {
    pthread_mutex_lock(&self->error_handling_mutex);

    while (!self->errors_available)
      // Set ts based on how frequent we want to finish servicing the errors beyond MAX_ERRORS_PER_LOOP
      pthread_cond_timedwait(&self->error_handling_cond, &self->error_handling_mutex, &ts);

    pthread_mutex_unlock(&self->error_handling_mutex);

    if (self->error_handling_join_flag) {
      return NULL;
    }

    // Parse through errors with ecx_mbxreceive, ecx_iserror, and ecx_poperror, and queue them with jsd_error_cirq_push(&self->slave_errors[err.Slave], err)

    if (handled_errors < MAX_ERRORS_PER_LOOP) {
      // We are sure we handled all errors present.
      pthread_mutex_lock(&self->error_handling_mutex);
      self->errors_available =  false;
      pthread_mutex_unlock(&self->error_handling_mutex);
    }
  }
}

Above, if ecx_mbxreceive is real-time safe, we could have a single pthread_mutex_unlock(&self->error_handling_mutex) at the end of the outer while loop (and one before the return NULL).

And in egd.c:
...

if(state->pub.actual_state_machine_state == JSD_EGD_STATE_MACHINE_STATE_FAULT){
  pthread_mutex_lock(&self->error_handling_mutex);
  self->errors_available =  true;
  pthread_cond_signal(&self->error_handling_cond);
  pthread_mutex_unlock(&self->error_handling_mutex);

...

In my opinion having the functions separate is simpler to read/understand and probably maintain.

Originally posted by @d-loret in #46 (comment)

I agree the current implementation is messy. I also thought of another third pthread for error handling but felt the risk of locking was too high so kept the ball-of-mud since it performs well.

I'd accept contribution here so long as it tests out well on hardware.