mpi-forum/mpi-issues

Concurrent start and completion of persistent requests

jprotze opened this issue · 6 comments

Problem

It is valid to pass an inactive persistent request into any completion function. In multi-threaded execution, the same request could be passed to a start function. This will result in similar issues as concurrently waiting on the same request (which is therefore defined erroneous).

This can lead to a situation where the request is initially inactive when the completion function is entered, but might be active at the time the completion function returns.

Proposal

Extend the restriction for concurrent use of a request to include starting functions. (Paragraph "Multiple threads completing the same request")

Changes to the Text

Impact on Implementations

Allow reasonable implementation of correct waiting calls.

Impact on Users

Set clear expectations and ensure semantically reasonable behavior.

References and Pull Requests

The issue is related to #846

The following test demonstrates the problematic semantics allowed without the restriction. Sending the message after the start ensures that the persistent recv is activated before the waitall can return. The test after the waitall checks whether the persistent request is active after return from waitall.

#include <assert.h>
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv) {
  int provided=0, flag=0;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  if (provided < MPI_THREAD_MULTIPLE)
    MPI_Abort(MPI_COMM_SELF, 1);
  int bufa=0, bufb=0, bufc = 4;
  MPI_Request req[2];
  MPI_Recv_init(&bufa, 1, MPI_INT, 0, 42, MPI_COMM_SELF, req);
  MPI_Irecv(&bufb, 1, MPI_INT, 0, 23, MPI_COMM_SELF, req + 1);
  MPI_Wait(req, MPI_STATUS_IGNORE);
  MPI_Test(req, &flag, MPI_STATUSES_IGNORE);
  if (flag) {
    printf("flag is true\n");
  } else {
    printf("flag is false\n");
  }
  assert(flag);
#pragma omp parallel sections
  {
#pragma omp section
    {
      printf("Thread %i of %i: start Section 1\n", omp_get_thread_num(),
             omp_get_num_threads());
      printf("Thread %i of %i: Waitall\n", omp_get_thread_num(),
             omp_get_num_threads());
      MPI_Waitall(2, req, MPI_STATUSES_IGNORE);
      printf("Thread %i of %i: Done waiting\n", omp_get_thread_num(),
             omp_get_num_threads());
      printf("Thread %i of %i: Test persistent req\n", omp_get_thread_num(),
             omp_get_num_threads());
      MPI_Test(req, &flag, MPI_STATUSES_IGNORE);
      if (!flag) {
        printf("flag is false\n");
      } else {
        printf("flag is true\n");
      }
      assert(flag);
      printf("Done? %i\n", bufa);
      printf("Thread %i of %i: end Section 1\n", omp_get_thread_num(),
             omp_get_num_threads());
    }
#pragma omp section
    {
      printf("Thread %i of %i: start Section 2\n", omp_get_thread_num(),
             omp_get_num_threads());
      usleep(100000);
      MPI_Start(req);
      usleep(1000);
      printf("Thread %i of %i: Sending bufc\n", omp_get_thread_num(),
             omp_get_num_threads());
      MPI_Send(&bufc, 1, MPI_INT, 0, 23, MPI_COMM_SELF);
      printf("Thread %i of %i: Sent bufc\n", omp_get_thread_num(),
             omp_get_num_threads());
    }
  }
  MPI_Send(&bufc, 1, MPI_INT, 0, 42, MPI_COMM_SELF);
  MPI_Wait(req, MPI_STATUSES_IGNORE);
  MPI_Request_free(req);
  printf("Done: %i\n", bufa);
  MPI_Finalize();
}

Surprisingly, the execution result for MPICH and OpenMPI on Ubuntu 24.4 has the same result:

flag is true
Thread 2 of 8: start Section 2
Thread 6 of 8: start Section 1
Thread 6 of 8: Waitall
Thread 2 of 8: Sending bufc
Thread 2 of 8: Sent bufc
Thread 6 of 8: Done waiting
Thread 6 of 8: Test persistent req
flag is true
Done? 0
Thread 6 of 8: end Section 1
Done: 4

The persistent request seems inactive in the test (flag=true), but the receive buffer of the persistent request is still 0. After finally sending the matching message, the receive buffer is actually updated. So, the MPI implementation writes to the buffer, when the application does not expected this according to MPI API semantics.

If there is any serious concern about allowing concurrently waiting on the same request (which has the restriction), the same concern should be raised for activating a request while waiting on that request.

Did some digging in OMPI. The culprit is here. We assume that a request stays inactive throughout the call to MPI_Waitall if we find it inactive in the beginning. We don't recognize the change in state we didn't cause and once we believe all requests have completed we simply set all active requests to inactive. We would have to remember that the request was inactive when we first saw it in this call and not touch it again afterwards.

Maybe that answers the question about who would benefit from adding a constraint that prohibits this behavior: implementations wouldn't have to code around a narrow edge case. @jprotze already answered the question who would stand to lose: no one, because both major implementation don't cover that edge case correctly so its unlikely that there is code out there doing such shenanigans.

I don't understand why you expect the MPI_Waitall to wait on your recv_init request to complete. With your timing constraint both the MPI_Waitall and the MPI_Test are testing an inactive request (one that has never been started).

But standard can be more clear by replacing "Multiple threads completing the same request ..." with "Multiple threads manipulating the same request ..."

I don't understand why you expect the MPI_Waitall to wait on your recv_init request to complete. With your timing constraint both the MPI_Waitall and the MPI_Test are testing an inactive request (one that has never been started).

You missed: MPI_Start -- happens before --> MPI_Send -- happens before --> req[1] completes --> MPI_Waitall can return

The call to MPI_Start is definitely before the Test in the middle of section1. The second thread makes the persistent request active before sending the message matching the second request. The test after the Wait is there to check the state of the persistent request. Given the true flag, the request must be inactive, although the Start must have happend before the test. The message was not sent at that point, so the MPI implementation has the request in the wrong state when testing.

Although the request is tested to be inactive, the receive nevertheless stores the data in the expected place once the matching send is called.

I did not, but I understand that the order in which MPI_Waitall handles the requests in the array of requests is implementation defined, and, as an implementer, monotonically increasing seems like a sensible choice. Which means that the testing done for req[0] will most likely precede the delayed MPI_Start.

Why would you expect MPI_Waitall to behave atomically with regard to the content of the array of request at a random moment in time (a moment that is not the moment of the call to MPI_Waitall)?

Also, your code allow for concurrent read/write into the array of requests (because MPI_Start can change the request pointed by the argument), leading to segfaults on any platform where an 8 bytes write is not atomic.

Also, your code allow for concurrent read/write into the array of requests (because MPI_Start can change the request pointed by the argument), leading to segfaults on any platform where an 8 bytes write is not atomic.

As an API consumer, I don't see a problem. The library cannot change the address. Anything that the library does with the opaque objects must comply with the semantics described for the API. If I request the implementation to be thread-safe, it should comply to such requirement. The implementation needs to do anything to make accessing the opaque objects thread-safe.