gschramm/parallelproj

OpenMP parallel for allocates new variable at each loop

Closed this issue · 12 comments

Here (but also in all other forward and backward methods) the definition of a large number of variables is within the #pragma omp parallel for section, forcing to reallocate the variables at each iteration of a supposedly very large for loop.

I did not have the chance to check the implication for what regards the execution time of the code, but I presume that with many iterations this will be sensible. I suggest to split the #pragma omp in two sections:

#pragma omp parallel
{
    float d0, d1, d2, d0_sq, d1_sq, d2_sq; 
    float lsq, cos0_sq, cos1_sq, cos2_sq;
    unsigned short direction; 
    int i0, i1, i2;
    int i0_floor, i1_floor, i2_floor;
    int i0_ceil, i1_ceil, i2_ceil;
    float x_pr0, x_pr1, x_pr2;
    float tmp_0, tmp_1, tmp_2;
   
    float u0, u1, u2, d_norm;
    float x_m0, x_m1, x_m2;    
    float x_v0, x_v1, x_v2;    

    short it = tof_bin[0];
    float dtof, tw;

    // correction factor for cos(theta) and voxsize
    float cf;
    float toAdd;

    float sig_tof   = sigma_tof[0];
    float tc_offset = tofcenter_offset[0];

    float xstart0 = xstart[0];
    float xstart1 = xstart[1];
    float xstart2 = xstart[2];

    float xend0 = xend[0];
    float xend1 = xend[1];
    float xend2 = xend[2];

    float voxsize0 = voxsize[0];
    float voxsize1 = voxsize[1];
    float voxsize2 = voxsize[2];

    float img_origin0 = img_origin[0];
    float img_origin1 = img_origin[1];
    float img_origin2 = img_origin[2];

    unsigned char intersec;
    float t1, t2;
    float istart_f, iend_f, tmp;
    int   istart, iend;
    float istart_tof_f, iend_tof_f;
    int   istart_tof, iend_tof;

# pragma omp for schedule(static)
  for(i = 0; i < nlors; i++)
  {
    it = tof_bin[i];
    
    sig_tof   = sigma_tof[i];
    tc_offset = tofcenter_offset[i];

    xstart0 = xstart[i*3 + 0];
    xstart1 = xstart[i*3 + 1];
    xstart2 = xstart[i*3 + 2];

    xend0 = xend[i*3 + 0];
    xend1 = xend[i*3 + 1];
    xend2 = xend[i*3 + 2];

    ...
  }
}

Like this, the allocation of the variables happens a number-of-threads times, rather than nlors times.

Also, I do not think redefinitions of variables like xstart0 really simplify the code for the reader. I'd suggest to create a counter variable for the location in the array like pos0 = i * 3 + 0.

I implemented your suggested splitting of the omp parallel section into the omp_edo branch.

Unforunately, for the nonTOF fwd+back and the TOF back projection, the results get worse (see below).
Any idea why that is?

NONTOF

old (0.7 without splitting)
sino fwd 3.1314 s
sino back 3.5869 s

new (with omp parallel split)
sino fwd 4.8537 s
sino back 5.4478 s

TOF

old (0.7 without splitting)
sino fwd 17.4382 s
sino back 31.7857 s

new (with omp parallel split)
sino fwd 7.6153 s
sino back 61.1725 s

Of course not!

How many threads and lors do you have?

What does the schedule(static) do?

Looking at the implemented code I see that the parallel for loop uses an iterator index i that is a shared variable!

My understanding of OpenMP is that you the iterator number should be private to each thread. If I understand correctly your threads are not running efficiently as they need to change the value of a variable atomically. Also i is just declared but doesn't have a value, so all following checks and definitions will not make too much sense.

I'd use the assumption that i=0 in the initial #pragma parallel for declarations, I'd try to remove the declaration of i at the top and do it in the 2 for loops as

#pragma omp for schedule(static, omp_num_threads())
for (long long i=0; i< nlors; i++) {
...
}

Notice that I have specified the number of threads you want to schedule statically with the total number of threads. I don't know what is the default (probably this) but it may not be granted, so better be explicit.

Lastly, I'd try to remove the schedule(static) and see what happens...

Maybe @gfardell could give yome more hints.

My tests above used 20 threads and ca. 10^8 geometrical LORs (in the nonTOF case).
I'll will test your suggested changes later this week.

I implemented your suggested changes once again on the omp_edo branch.
But the results do not seem to improve.

(1) Can you have a look in the nonTOF
fwd and back projectors to see whether I overlooked sth?

(2) If you want to benchmark yourself you can run this example like

python projector_benchmark.py --counts 0 --nsubsets 28 --nontof

(needs of course pyparallelproj on your PYTHONPATH)

That's interesting. First the long long i must be defined outside the parallel block, so my comment is wrong there.

What might go on here is that the compiler is very clever and will not do what you programmer instructed it to do? So maybe it's allocated the variables only once?

If I find time I'll test it!

Allocation of these variables takes zero time. float, int etc will be put on the stack: Increment the stack pointer. done! Probably the compiler does a better job optimising things if it knows that the variables are local. Generally I suggest to

  • declare (and initialise) variables as local as possible
  • make as many of them const as you can. That way, the compiler knows they won't be modified (and can even optimise them away).

Finally, your backproject is so slow but all threads will compete with memory access. In STIR, we allocate one image per thread, and at the end add them up (which ideally we'd do with a reduce operation, but that came after OpenMP 2.0, which is still the only version supported by VC). Of course, it needs more memory.

Initially, I had two OPENMP-based backprojection functions. One that backprojects into a single image with atomic operations and one where every thread back projects into a separate image followed by reducing them.

I decided to remove the last ones here since I thought on a machine with many cores and when using bigger volumes we might run into memory problems. E.g. if we use 64 threads on a 500x500x500 volume in would need ca 32GB RAM in 32bit precision.

I start to have a déja-vu feeling now... Did we have this conversation already?

Of course, memory becomes a problem (but if you have a 64 thread machine, maybe you have 32GB as well...). Did you ever do timing differences? I'd be (mildly) interested.

Regarding @paskino 's question on schedule(static). Doc is here. It works best if all chunck sizes have the same load. Not so sure if that is the case here though. dynamic could work better if it different LORs have different computation load.

Anyway, people are probably mostly interested in the GPU version...

I haven't done systematic comparison of OPENMPs atomic adds, but I know that the ones implemented in CUDA are surprisingly fast. Of course, this all depends on what you backproject ("ordered" views or "random" LM LORs).
Sth, that could be investigated in the frontiers paper.

If I find some time in the next week, I can run some benchmark tests comparing the two backprojections strategies on CPUs using OpenMP.