madgraph5/madgraph4gpu

DIVBYZERO and INVALID FPEs in CMS tests (pp_dy012j.mad in P2_uc_epemuc/G2)

valassi opened this issue · 22 comments

This is an issue reported by Jin @choij1589 (thanks!) during yesterday's meeting with CMS https://indico.cern.ch/event/1373473/ (issue not in the slides, it was reported in the later discussion)

Details are here https://github.com/choij1589/madgraph4gpu/tree/dev_cms_integration

See in particular this commit master...choij1589:madgraph4gpu:dev_cms_integration


Description and analysis

IIUC there were some DIVBYZERO (and INVALID?) FPEs during the CMS madevent tests - note in particular that this was during CUDA runs, so they cannot come from vectorized C++ code.

My initial guess is that this comes from Fortran code, probably from auto-vectorized Fortran code. We saw something similar in #855 for rotxxx, worked around in #857 with a volatile.

It would be annoying if these issues keep popping up, because decorating the full fortran code with volatile's does not look like a scalable solution.

@oliviermattelaer : is there an easy way in which @choij1589 can remove -O3 from fortran and replace it with -O1 when running madgraph in his CMS environment? (just as a test to see if this makes it disappear)

@oliviermattelaer : is there an easy way in which @choij1589 can remove -O3 from fortran and replace it with -O1 when running madgraph in his CMS environment? (just as a test to see if this makes it disappear)

@choij1589 I imagine that you probably have this in the runcard

-O3 -ffast-math -fbounds-check = global_flag ! build flags for all Fortran code (for a fair comparison to cudacpp; default is -O)

just try removing the -O3 and see if there is a difference when you have it or not have it

Let me know also about if it goes much slower without -O3... (it should not, if you only do this in the cuda or cpp runs, but we never know)

In my run_card.dat, there is no global_flag assigned, so it's using the default one

#*********************************************************************
# Compilation flag.
#*********************************************************************
  -O    = global_flag ! fortran optimization flag use for the all code.
  --fast-math   = aloha_flag ! fortran optimization flag for aloha function. Suggestions: '-ffast-math'
  -O3   = matrix_flag ! fortran optimization flag for matrix.f function. Suggestions: '-O3'
  16384 = vector_size ! size of fortran arrays allocated in the multi-event API for SIMD/GPU (VECSIZE_MEMMAX)

I have tried both -O3 -ffast-math -fbounds-check = global_flag and -O -ffast-math -fbounds-check = global_flag flags, but the FPE_* warning remains.

Thanks Jin! Can you check if you have a GLOBAL_FLAG in Source/make_opts anyway please?

Yes, I can find it in process/madevent/Source/make_opts file:
Screenshot 2024-08-01 at 2 57 58 PM

Thanks Jin I am having a look.

One point Jin (also for Sapta): you are aware of ccache in the cudacpp? I saw that DY012j is now running builds since more than 1h, it might take 3h, so I do not even want to imagine DY+4jets... I imagine that normally you will run these builds only once, or at least only once per software version, but I can very much imagine that sometimes you will rerun the same test with the same build. In that case, using ccache helps enormously. You should have ccache installed, export USECCACHE=1, and have CCACHE_DIR point to a directory that contains your build caches. I never use AFS or EOS because (with or without cacche) builds are horrible there, so CCACHE_DIR is a local disk on my machine. I never managed to configure this with network CCACHE, but in principle also possible. Maybe one thing to investigate for your gen productions.

I have tried both -O3 -ffast-math -fbounds-check = global_flag and -O -ffast-math -fbounds-check = global_flag flags, but the FPE_* warning remains.

Thanks Jin again. A couple more questions

  • You mention 'FPE warning'. Is this just a warning 'FPE_xx have been reported' or is this a crash?

  • Do you get this FPE warning only in cuda runs or also in fortran runs? (You said cpp does not build right? What was the problem?)

  • Would you be able to tell me in which subdirectory this happens by any chance? I mean which P0, P1 or P2 subprocess, and possibly which G* subdirectory, and what is the input_app.txt file in that G?

Thanks
Andrea

Thanks @valassi , regarding the FPE warning, there is no such warnings in fortran, only happening in CUDA. if I comment out fpeEnable() function, than it gives warnings 'FPE_xx' which are triggered from MatrixElementKernel.cc. Otherwise, it would give me crash in runtime.

Regarding the subprocess, this happens in most of the subdirectory from DY+0j to DY+4j, I have quickly checked with u u~ > e+ e- process and it also generated the same warning. Here is the input_app.txt for P1_uux_epem:

         2000         8           3      !Number of events and max and min iterations
  0.01    !Accuracy
  2       !Grid Adjustment 0=none, 2=adjust
  1       !Suppress Amplitude 1=yes
  0        !Helicity Sum/event 0=exact

Thanks Jin I am having a look.

One point Jin (also for Sapta): you are aware of ccache in the cudacpp? I saw that DY012j is now running builds since more than 1h, it might take 3h, so I do not even want to imagine DY+4jets... I imagine that normally you will run these builds only once, or at least only once per software version, but I can very much imagine that sometimes you will rerun the same test with the same build. In that case, using ccache helps enormously. You should have ccache installed, export USECCACHE=1, and have CCACHE_DIR point to a directory that contains your build caches. I never use AFS or EOS because (with or without cacche) builds are horrible there, so CCACHE_DIR is a local disk on my machine. I never managed to configure this with network CCACHE, but in principle also possible. Maybe one thing to investigate for your gen productions.

Regading using cached directories, I am currently using condor environment for testing productions so I should build the processes in somewhere(might be EOS area) and xrdcp throught the condor's own disk... but first build on EOS with local node (lxplus8-gpu) would be painful due to other user's interruption:(. But still I believe it would be a good option to check. I will test it around. Thanks!

Regading using cached directories, I am currently using condor environment for testing productions so I should build the processes in somewhere(might be EOS area) and xrdcp throught the condor's own disk... but first build on EOS with local node (lxplus8-gpu) would be painful due to other user's interruption:(. But still I believe it would be a good option to check. I will test it around. Thanks!

Thanks Jin, ok I understand this is not very usable now. But lets keep this in mind. I opened #954.

Thanks @valassi , regarding the FPE warning, there is no such warnings in fortran, only happening in CUDA. if I comment out fpeEnable() function, than it gives warnings 'FPE_xx' which are triggered from MatrixElementKernel.cc. Otherwise, it would give me crash in runtime.

Thanks Jin! Ok so I understand these are warnings and not crashes, only because you modified the code and disabled the FPEs...

On my side I am redoing some tests of DY+2jets. I did see something that looks very much like your FPE issues

Error when reading /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/G2/results.dat
Command "generate_events -f" interrupted with error:
Exception : Reported error: End code 136.0 
         Full associated log: 
        (ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
        (ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1
        __GpuRuntime: calling GpuSetDevice(0)
        WARNING! Instantiate device Bridge (nevt=16384, gpublocks=64, gputhreads=256, gpublocks*gputhreads=16384)
        INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
         FBRIDGE_MODE (default) =            1
         VECSIZE_USED (default) =        16384
         Process in group number            2
         A PDF is used, so alpha_s(MZ) is going to be modified
         Old value of alpha_s from param_card:   0.11799999999999999     
          ****************************************
         
               NNPDFDriver version 1.0.3
           Grid: NNPDF23_lo_as_0130_qed_mem0.grid
          ****************************************
         New value of alpha_s from PDF nn23lo1:  0.13000000000000000     
         Warning! ptj set to xqcut=   30.000000000000000       to improve integration efficiency
         Note that this might affect non-radiated jets,
         e.g. from decays. Use cut_decays=F in run_card.
         Warning! mmjj set to xqcut=   30.000000000000000       to improve integration efficiency
         Note that this might affect non-radiated jets,
         e.g. from decays. Use cut_decays=F in run_card.
         Define smin to   4000.0000000000000     
         *****************************************************
         *               MadGraph/MadEvent                   *
         *        --------------------------------           *
         *          http://madgraph.hep.uiuc.edu             *
         *          http://madgraph.phys.ucl.ac.be           *
         *          http://madgraph.roma2.infn.it            *
         *        --------------------------------           *
         *                                                   *
         *          PARAMETER AND COUPLING VALUES            *
         *                                                   *
         *****************************************************

          External Params
          ---------------------------------
          
         mdl_MT =    173.00000000000000     
         mdl_MTA =    1.7769999999999999     
         mdl_MZ =    91.188000000000002     
         mdl_MH =    125.00000000000000     
         aEWM1 =    132.50700000000001     
         mdl_Gf =    1.1663900000000000E-005
         aS =   0.11799999999999999     
         mdl_ymt =    173.00000000000000     
         mdl_ymtau =    1.7769999999999999     
         mdl_WT =    1.4915000000000000     
         mdl_WZ =    2.4414039999999999     
         mdl_WW =    2.0476000000000001     
         mdl_WH =    6.3823389999999999E-003
          Internal Params
          ---------------------------------
          
         mdl_conjg__CKM3x3 =    1.0000000000000000     
         mdl_conjg__CKM1x1 =    1.0000000000000000     
         mdl_CKM3x3 =    1.0000000000000000     
         mdl_complexi =                (0.0000000000000000,1.0000000000000000)
         mdl_MZ__exp__2 =    8315.2513440000002     
         mdl_MZ__exp__4 =    69143404.913893804     
         mdl_sqrt__2 =    1.4142135623730951     
         mdl_MH__exp__2 =    15625.000000000000     
         mdl_aEW =    7.5467711139788835E-003
         mdl_MW =    80.419002445756163     
         mdl_sqrt__aEW =    8.6872153846781555E-002
         mdl_ee =   0.30795376724436879     
         mdl_MW__exp__2 =    6467.2159543705357     
         mdl_sw2 =   0.22224648578577766     
         mdl_cw =   0.88190334743339216     
         mdl_sqrt__sw2 =   0.47143025548407230     
         mdl_sw =   0.47143025548407230     
         mdl_g1 =   0.34919219678733299     
         mdl_gw =   0.65323293034757990     
         mdl_vev =    246.21845810181637     
         mdl_vev__exp__2 =    60623.529110035903     
         mdl_lam =   0.12886910601690263     
         mdl_yt =   0.99366614581500623     
         mdl_ytau =    1.0206617000654717E-002
         mdl_muH =    88.388347648318430     
         mdl_I2x33 =               (0.99366614581500623,0.0000000000000000)
         mdl_I3x33 =               (0.99366614581500623,0.0000000000000000)
         mdl_ee__exp__2 =    9.4835522759998875E-002
         mdl_sw__exp__2 =   0.22224648578577769     
         mdl_cw__exp__2 =   0.77775351421422245     
          Internal Params evaluated point by point
          ----------------------------------------
          
         mdl_sqrt__aS =   0.34351128074635334     
         mdl_G__exp__2 =    1.4828317324943823     
          Couplings of sm-no_b_mass
          ---------------------------------
          
                GC_10 =  -0.12177E+01   0.00000E+00
                GC_11 =   0.00000E+00   0.12177E+01
                 GC_1 =  -0.00000E+00  -0.10265E+00
                 GC_2 =   0.00000E+00   0.20530E+00
                 GC_3 =  -0.00000E+00  -0.30795E+00
                GC_50 =  -0.00000E+00  -0.28804E+00
                GC_58 =  -0.00000E+00  -0.27437E-01
                GC_59 =   0.00000E+00   0.82310E-01

         Collider parameters:
         --------------------

         Running at P P   machine @    13000.000000000000       GeV

        Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

        Backtrace for this error:
        #0  0x7f2e29623860 in ???
        #1  0x7f2e29622a05 in ???
        #2  0x7f2e29254def in ???
        #3  0x433257 in ???
        #4  0x434487 in ???
        #5  0x43507e in ???
        #6  0x45047a in ???
        #7  0x430937 in ???
        #8  0x40371e in ???
        #9  0x7f2e2923feaf in ???
        #10  0x7f2e2923ff5f in ???
        #11  0x403844 in ???
        #12  0xffffffffffffffff in ???

        ls status:
        input_app.txt
        run1_app.log

Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/run_01_tag_1_debug.log'.
Please attach this file to your report.

I actually have very many FPEs. I was chaining several backends, fortran, cpp's, cuda, so I am not 100% sure where it happened. But I imagine it was cuda.

Note

more /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/G2/input_app.txt
         8192 1 1      !Number of events and max and min iterations
  0.1    !Accuracy
  2       !Grid Adjustment 0=none, 2=adjust
  1       !Suppress Amplitude 1=yes
  0        !Helicity Sum/event 0=exact
        2

Ok here is a reproducer

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc> ./madevent_cpp < /tmp/avalassi/input_app.txt 

And through gdb

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc> gdb ./madevent_cpp
...
(gdb) run < /tmp/avalassi/input_app.txt 
...
Program received signal SIGFPE, Arithmetic exception.
0x0000000000433257 in dsig1_vec (all_pp=<error reading variable: value requires 3145728 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
    at auto_dsig1.f:395
395               R=R-DABS(ALL_PD(IPSEL,IVEC))/ALL_PD(0,IVEC)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x0000000000433257 in dsig1_vec (all_pp=<error reading variable: value requires 3145728 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
    at auto_dsig1.f:395
#1  0x0000000000434488 in dsigproc_vec (all_p=..., 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=2, 
    symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
    at auto_dsig.f:1032
#2  0x000000000043507f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=2, 
    all_out=..., vecsize_used=16384) at auto_dsig.f:327
#3  0x000000000045047b in sample_full_ ()
#4  0x0000000000430938 in driver () at driver.f:257
#5  0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
#6  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#7  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#8  0x0000000000403845 in _start ()
(gdb) p ivec
$1 = 11
(gdb) p all_pd(0,ivec)
$2 = 0

It seems that the issue happens both with -O3 and with -O, -O1 or without -O... (in the fortran side)

Poor man's debugging

      write(6,*) 'maxproc=', maxproc
      DO IVEC=1,VECSIZE_USED
        write(6,*) 'ivec, all_pd(0,ivec) =', ivec, all_pd(0,ivec) 
C       Do not need those three here do I?	 

gives

 maxproc=           2
 ivec, all_pd(0,ivec) =           1   71.066572418427214     
 ivec, all_pd(0,ivec) =           2   365.96232532682160     
 ivec, all_pd(0,ivec) =           3   1296.3895857186428     
 ivec, all_pd(0,ivec) =           4   132.50043236269377     
 ivec, all_pd(0,ivec) =           5   15.597965701852193     
 ivec, all_pd(0,ivec) =           6   3942.0292625352567     
 ivec, all_pd(0,ivec) =           7   238.75633877329895     
 ivec, all_pd(0,ivec) =           8   143.34436759532593     
 ivec, all_pd(0,ivec) =           9   586.74521438125407     
 ivec, all_pd(0,ivec) =          10   422.52108709808942     
 ivec, all_pd(0,ivec) =          11   0.0000000000000000     

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f5e9b023860 in ???
#1  0x7f5e9b022a05 in ???
#2  0x7f5e9ac54def in ???
#3  0x48c764 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/auto_dsig1.f:397
#4  0x48f3d0 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/auto_dsig.f:1032
#5  0x490906 in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/auto_dsig.f:327
#6  0x4ae2ea in ???
#7  0x488896 in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/driver.f:257
#8  0x488e62 in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc/driver.f:302
Floating point exception (core dumped)

In other words: this does not look like a SIMD issue (so it makes sense that -O3 or -O1 makes no difference)

It seems that there is a real issue, one event has 0 weight? Why only in cpp and not in fortran?...

Ops... also in fortran it is 0, but then why does it not crash?

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc> ./madevent_fortran < /tmp/avalassi/input_app.txt  | more
...
 maxproc=           2
 ivec, all_pd(0,ivec) =           1   71.066572418427214     
 ivec, all_pd(0,ivec) =           2   365.96232532682160     
 ivec, all_pd(0,ivec) =           3   1296.3895857186428     
 ivec, all_pd(0,ivec) =           4   132.50043236269377     
 ivec, all_pd(0,ivec) =           5   15.597965701852193     
 ivec, all_pd(0,ivec) =           6   3942.0292625352567     
 ivec, all_pd(0,ivec) =           7   238.75633877329895     
 ivec, all_pd(0,ivec) =           8   143.34436759532593     
 ivec, all_pd(0,ivec) =           9   586.74521438125407     
 ivec, all_pd(0,ivec) =          10   422.52108709808942     
 ivec, all_pd(0,ivec) =          11   0.0000000000000000     
 ivec, all_pd(0,ivec) =          12   0.0000000000000000     
 ivec, all_pd(0,ivec) =          13   1563.2708732525296     
 ivec, all_pd(0,ivec) =          14   21.762972077135977     
...

In principle All_PD(0, ivec) can not be zero ...
(without looking at the code to avoid bias) that variable should be a sum (of absolute value) for ALL_PD(IPSEL, IVEC)
For value of ipsel which should be 1 or 2 (corresponding to imirror).
Since we are doing LHC setup ALL_PD(1, IVEC) = ALL_PD(2, IVEC)
(since this correspond to the PDF contribution of the event ...)

Now clearly, given your update I'm wrong somewhere, ...

Thanks @oliviermattelaer I was going to ask you in fact ;-)

More debugging

      write(6,*) 'maxproc=', maxproc
      DO IVEC=1,VECSIZE_USED
        write(6,*) 'ivec, all_pd(0,ivec) =', ivec, all_pd(0,ivec)
        write(6,*) 'all_pd(0,ivec) == 0?', all_pd(0,ivec).eq.0

The variable IS 0 both in cpp and fortran. The former prints T and crashes, the latter prints T and does not crash

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/pp_dy012j.mad/SubProcesses/P2_uc_epemuc> ./madevent_fortran < /tmp/avalassi/input_app.txt  | more
...
 maxproc=           2
 ivec, all_pd(0,ivec) =           1   71.066572418427214     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           2   365.96232532682160     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           3   1296.3895857186428     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           4   132.50043236269377     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           5   15.597965701852193     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           6   3942.0292625352567     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           7   238.75633877329895     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           8   143.34436759532593     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =           9   586.74521438125407     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =          10   422.52108709808942     
 all_pd(0,ivec) == 0? F
 ivec, all_pd(0,ivec) =          11   0.0000000000000000     
 all_pd(0,ivec) == 0? T
 ivec, all_pd(0,ivec) =          12   0.0000000000000000     
 all_pd(0,ivec) == 0? T
 ivec, all_pd(0,ivec) =          13   1563.2708732525296     
 all_pd(0,ivec) == 0? F

I imagine that the difference in behaviour (crash or no crash) is just because in cpp there is an explicit call to fpeenable that crashes on divbyzero.

Now why this is 0 is probably for Olivier to debug...

What I will look at in the meantime is if there is a possible workaround to skip some computations if that is zero (i.e. cure the symptoms, not cure the root cause)

I would say that it is pointless to cure the symptoms
Let me understand the real issue which is anyway crucial and required.

I would say that it is pointless to cure the symptoms
Let me understand the real issue which is anyway crucial and required.

Hi yes I had a look and I agree :-)
Thanks Olivier for taking a look!

Indeed the crash is BEFORE the ME calculation. So what is 0 is not the ME, it is the PDF?!

The ALL_PD is a product of U1 and C2 which are the PDFs. So this indicates a zero pdf, which clearly looks like a bug...

      DO IVEC=1,VECSIZE_USED
        IF (ABS(LPP(IB(1))).GE.1) THEN
            !LP=SIGN(1,LPP(IB(1)))
          U1(IVEC)=PDG2PDF(LPP(IB(1)),2, IB(1),ALL_XBK(IB(1),IVEC)
     $     ,DSQRT(ALL_Q2FACT(IB(1), IVEC)))
        ENDIF
        IF (ABS(LPP(IB(2))).GE.1) THEN
            !LP=SIGN(1,LPP(IB(2)))
          C2(IVEC)=PDG2PDF(LPP(IB(2)),4, IB(2),ALL_XBK(IB(2),IVEC)
     $     ,DSQRT(ALL_Q2FACT(IB(2), IVEC)))
        ENDIF
      ENDDO
      ALL_PD(0,:) = 0D0
      IPROC = 0
      IPROC=IPROC+1  ! u c > e+ e- u c
      DO IVEC=1, VECSIZE_USED
        ALL_PD(IPROC,IVEC)=U1(IVEC)*C2(IVEC)
        ALL_PD(0,IVEC)=ALL_PD(0,IVEC)+DABS(ALL_PD(IPROC,IVEC))

      ENDDO
      IPROC=IPROC+1  ! u c > mu+ mu- u c
      DO IVEC=1, VECSIZE_USED
        ALL_PD(IPROC,IVEC)=U1(IVEC)*C2(IVEC)
        ALL_PD(0,IVEC)=ALL_PD(0,IVEC)+DABS(ALL_PD(IPROC,IVEC))

      ENDDO

Looks like my fortran code is not the same as yours (I'm in a middle of merge, which is likely the cause --but means that such merging is super problematic--) but your fortran code makes more sense than mine (and correspond to my explanation above). (Sorry but before focusing on this, will need to fix the merging, one issue at the time).

Now given that "ALL_PD" is a local variable indeed the only possible explanation is that they are ... zero due to the PDF call indeed. Now this likely means that the momenta are likely "bad", The question is obviously why?

One option would be because of the black-box vetoing events, this is possible here but I doubt that this is what is happening here. (this will likely be REWGT set to zero).

The second hyppothesis would be that this is due to a reset of the code after 10 events, Those can occur but typically this happens after the call to the matrix-element... but can you try to set in the run_card: False = mc_grouped_subproc
to see if this might be related to that?

(I'm in a middle of merge, which is likely the cause --but means that such merging is super problematic--) but your fortran code makes more sense than mine (and correspond to my explanation above). (Sorry but before focusing on this, will need to fix the merging, one issue at the time).

Thanks Olivier... maybe open another issue for the merging and we follow that up.

Note, I merged a lot in june24, I am waiting for that. (And I know you are waiting for other things from me... sorry).

The second hyppothesis would be that this is due to a reset of the code after 10 events, Those can occur but typically this happens after the call to the matrix-element... but can you try to set in the run_card: False = mc_grouped_subproc
to see if this might be related to that?

Ok I will try later this week (will be off this afternoon)

Just to follow up on the merging (on which I will open 2 PR on that). This why my mistake (or bias, since I read what "I wanted to see"). So here they are no imirror (which is actually handled in auto_dsig.f, but you still have iproc=2 due to the e+ e- and mu+mu- flavor difference in the final state (where both have identical matrix element). Sorry for the noise here.