codership/galera

A deadlock bug due to the cyclic acqusition order?

ycaibb opened this issue · 0 comments

Hi, developers, thank you for your checking. It seems there is a (potential) deadlock bug in the below code. The function gu_to_grab is not thread-safe, because there is a cyclic acquisition order between the to->lock and w->mtx. However, it seems gu_to_grab could be concurrently executed by different theads?

long gu_to_grab (gu_to_t* to, gu_seqno_t seqno)
{
    ...;

    if ((err = gu_mutex_lock(&to->lock))) {
        gu_fatal("Mutex lock failed (%d): %s", err, strerror(err));
        abort();
    }
      
    ...;

    switch (w->state) {
    ...;
    case RELEASED:
        else { /* seqno > to->seqno, wait for my turn */
            ...;
            pthread_mutex_lock (&w->mtx); // to->mtx ---->w->mtx
            pthread_mutex_unlock (&to->lock);
            pthread_mutex_lock (&w->mtx); // wait for unlock by other thread
            pthread_mutex_lock (&to->lock);  // w->mtx ---->to->mtx
            pthread_mutex_unlock (&w->mtx);
            ...; 
        }
        break;
    }
    
    gu_mutex_unlock(&to->lock);
    return err;
}

Lock locations:

pthread_mutex_lock (&w->mtx);
pthread_mutex_unlock (&to->lock);
pthread_mutex_lock (&w->mtx); // wait for unlock by other thread
pthread_mutex_lock (&to->lock);
pthread_mutex_unlock (&w->mtx);

Thread locations:

galera/gcs/src/gcs_test.cpp

Lines 555 to 576 in d992932

switch (pool->type)
{
case GCS_TEST_REPL:
thread_routine = gcs_test_repl;
break;
case GCS_TEST_SEND:
thread_routine = gcs_test_send;
break;
case GCS_TEST_RECV:
thread_routine = gcs_test_recv;
break;
default:
fprintf (stderr, "Bad repl type %u\n", pool->type);
return -1;
}
for (i = 0; i < pool->n_threads; i++)
{
if ((err = pthread_create (&pool->threads[i].thread, NULL,
thread_routine, &pool->threads[i])))
break;
}

Best,