Handle conversions should be symbols

Question

Handle conversions should be symbols

jeffhammond opened this issue 2 years ago · 14 comments

Problem

An implementation is allowed to implement ... and the handle-conversion functions (MPI_Group_f2c, etc.) in Section 19.3.4, and no others, as macros in C.

If these are not symbols, they cannot be invoked by third-party languages directly, but instead require a C shim.

This is not a hypothetical issue. There is an mpi4py bug report from a user who wants to invoke Fortran MPI code directly from Python (link below). Even if mpi4py solves this, it should not be necessary every time someone writes a new language wrapper.

Proposal

Remove and the handle-conversion functions (MPI_Group_f2c, etc.) in Section 19.3.4 from the aforementioned text.

Changes to the Text

Obvious.

Impact on Implementations

MPICH implements these as macros and would need to change.

Impact on Users

Pro

Third-party languages know that the symbol exists and can call it directly, without writing their own C shim.

Con

Users of MPICH will find that the overhead of handle conversion is no longer zero cycles. It is unclear if this is noticeable in any real applications, as function call overhead is generally accepted by the C community as a necessary evil in complex applications. Furthermore, link-time optimizations may be able to eliminate most or all of the overhead in some cases.

References and Pull Requests

mpi4py/mpi4py#276

Answer 1 · 2022-11-22T08:07:10.000Z

I estimated the performance impact on MPICH-based implementations to be less than half a nanosecond per call (in a throughput test).

I can imagine a scenario where saving and restoring the stack is very expensive and the function call costs more, but that same function call overhead is going to be present as soon as the program which uses handle conversions passes those handles to some MPI library call. The MPI library call that consumes the handle is certain to be significantly more expensive than function call overhead.

Modern computing platforms optimize for function call overhead to be as small as possible. The last platform where function call overhead was detectable was Blue Gene/P, and the HPC market has made a clear decision in favor of CPUs with more general purpose performance.

jhammond@nuclear:~/abi-tests$ /opt/intel/oneapi/mpi/2021.7.1/bin/mpirun -n 1 ./c2f2c.intel.macro 10000000
n=10000000
c[i] = MPI_DOUBLE : 0.002706
MPI_Type_contiguous : 0.632046
MPI_Type_commit : 0.509431
MPI_Type_c2f : 0.010041
MPI_Type_f2c : 0.002142
MPI_Type_free : 0.354253

jhammond@nuclear:~/abi-tests$ /opt/intel/oneapi/mpi/2021.7.1/bin/mpirun -n 1 ./c2f2c.intel.nomacro 10000000
n=10000000
c[i] = MPI_DOUBLE : 0.002674
MPI_Type_contiguous : 0.634852
MPI_Type_commit : 0.513848
MPI_Type_c2f : 0.013281
MPI_Type_f2c : 0.007788
MPI_Type_free : 0.354782

# c2f2c.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

double t0, t1;

#if NO_MACRO
MPI_Datatype ABI_Type_f2c(MPI_Fint);
MPI_Fint ABI_Type_c2f(MPI_Datatype);
#else
#define ABI_Type_f2c MPI_Type_f2c
#define ABI_Type_c2f MPI_Type_c2f
#endif

int main(int argc, char* argv[])
{
    MPI_Init(&argc,&argv);

    const int n = (argc > 1) ? atoi(argv[1]) : 1000*1000;
    printf("n=%d\n", n);

    MPI_Datatype * c = malloc( n * sizeof( MPI_Datatype ) );
    MPI_Fint     * f = malloc( n * sizeof( MPI_Fint ) );

    for (int i=0; i<n; i++) c[i] = MPI_BYTE;

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) c[i] = MPI_DOUBLE;
    t1 = MPI_Wtime();
    printf("c[i] = MPI_DOUBLE : %lf\n", t1-t0);

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) {
        MPI_Type_contiguous(1, MPI_DOUBLE, &c[i]);
    }
    t1 = MPI_Wtime();
    printf("MPI_Type_contiguous : %lf\n", t1-t0);

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) {
        MPI_Type_commit(&c[i]);
    }
    t1 = MPI_Wtime();
    printf("MPI_Type_commit : %lf\n", t1-t0);

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) {
        f[i] = ABI_Type_c2f(c[i]);
    }
    t1 = MPI_Wtime();
    printf("MPI_Type_c2f : %lf\n", t1-t0);

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) {
        c[i] = ABI_Type_f2c(f[i]);
    }
    t1 = MPI_Wtime();
    printf("MPI_Type_f2c : %lf\n", t1-t0);

    t0 = MPI_Wtime();
    for (int i=0; i<n; i++) {
        MPI_Type_free(&c[i]);
    }
    t1 = MPI_Wtime();
    printf("MPI_Type_free : %lf\n", t1-t0);

    free(c);
    free(f);

    MPI_Finalize();
    return 0;
}

libc2f2c.c

#include <mpi.h>

MPI_Datatype ABI_Type_f2c(MPI_Fint f) { return MPI_Type_f2c(f); }
MPI_Fint ABI_Type_c2f(MPI_Datatype c) { return MPI_Type_c2f(c); }

build and run

/opt/intel/oneapi/mpi/2021.7.1/bin/mpicc -Os -c libc2f2c.c
/opt/intel/oneapi/mpi/2021.7.1/bin/mpicc -Ofast -DNO_MACRO c2f2c.c libc2f2c.o -o c2f2c.intel.nomacro
/opt/intel/oneapi/mpi/2021.7.1/bin/mpicc -Ofast  c2f2c.c  -o c2f2c.intel.macro
/opt/intel/oneapi/mpi/2021.7.1/bin/mpirun -n 1 ./c2f2c.intel.macro   10000000
/opt/intel/oneapi/mpi/2021.7.1/bin/mpirun -n 1 ./c2f2c.intel.nomacro 10000000

Answer 2 · 2022-11-22T08:35:51.000Z

For the sketched use case, it would be sufficient to require a linkable implementation of these functions, while the MPI implementation still can define macros in mpi.h that provide the current performance.

Alternatively, the function could also be defined as static function in mpi.h, which should allow the compiler to inline the function and generate the same code as with the macro. The latter approach might improve the maintainability as it avoids the duplication of code.

Answer 3 · 2022-11-22T15:02:37.000Z

I am not opposed to that, but is half a nanosecond worth the effort to design that solution? It sounds reasonable to say there is a macro in the header and a symbol in the library, but I have not considered every aspect of this.

Answer 4 · 2022-11-22T18:38:28.000Z

I agree with Jeff here - and I appreciate the understanding of the history. As Jeff notes, function calls weren't always so relatively cheap, and optimization for these cases was a consideration. Times have changed, and we should take advantage of that to simplify MPI just a little.

Answer 5 · 2022-11-30T15:32:44.000Z

I don't have a problem with this, but I want to note that this would be a backward incompatibility, which requires calling out in the appropriate chapter (Chapter 18 in MPI 4.0/4.1).

I believe this would be the first one that would actively require changes from some or all mainstream MPI implementations.

Answer 6 · 2022-11-30T15:58:56.000Z

Combined with the other ticket to make timing routines symbols as well, this would then only leave AINT_ADD and AINT_DIFF as macros. At this point, is it worth having this exception?

Answer 7 · 2022-11-30T16:21:20.000Z

I agree - given the motivation for making everything else a symbol instead of a macro, and that these are unlikely to be in a critical performance path, lets eliminate the exception.

Answer 8 · 2022-11-30T18:52:27.000Z

@jdinan added the AINT macros because Fortran can't do unsigned arithmetic, and when we did it, we all knew nobody would ever use those functions anyways, so no one will notice if they get slower.

Would folks prefer I do another ticket for those, or combine everything into one ticket and kill all the macros at once?

Answer 9 · 2022-11-30T19:04:00.000Z

Would folks prefer I do another ticket for those, or combine everything into one ticket and kill all the macros at once?

Given that there's already a reading scheduled for this issue and the other one, I think you should keep what's here and add another ticket for March to finish the series. I'd hate to miss 4.1 for the stuff that's already in progress because the last thing is holding it up.

Answer 10 · 2022-12-04T23:10:43.000Z

I agree with Wesley - we should hold a straw-vote, though, on the entire removal

Answer 11 · 2022-12-08T16:48:02.000Z

Had a reading on 2022-12-08, but may need to be re-read to add text in Chapter 18.