Parallelize permutation inversion in LLP?

@zacchiro points out that the LLP process spends some non-negligeable time single-threaded

A quick look at perf top shows it is in this function:

Line 194 in b4e4176

invert_in_place(&mut update_perm);

Would it make sense to allocate an array for another permutation, and invert the permutation not-in-place into that array so it can be parallelized?

Some numbers:

2023-09-20T08:19:28+02:00 - INFO - Completed.                                                                                                                                                                        
2023-09-20T08:19:28+02:00 - INFO - Elapsed: 3m 38s [27,397,574,122 nodes, 125233018.78 nodes/s, 7.99 ns/node]                                                                                                        
2023-09-20T08:19:28+02:00 - INFO - 7 updates, 4h 32m 10s, 1.54 updates/h, 38.88 m/update                                                                                                                             
2023-09-20T08:19:28+02:00 - INFO - Gain: 0.0009318041063451512                                                                                                                                                       
2023-09-20T08:19:28+02:00 - INFO - Modified: 124751634                                                                                                                                                               
2023-09-20T08:19:28+02:00 - INFO - Completed.                                                                                                                                                                        
2023-09-20T08:19:28+02:00 - INFO - Elapsed: 4h 32m 10s [7 updates, 1.54 updates/h, 38.88 m/update]                                                                                                                   
2023-09-20T11:29:34+02:00 - INFO - Log-gap cost: 1638690340843                                                                                                                                                       
2023-09-20T11:32:19+02:00 - INFO - 2 gammas, 21h 32m 1s, 2.23 gammas/d, 10.77 h/gamma; 16.67% done, 2d 23h 46m 43s to end; used/avail/free/total mem 1.27TB/870.12GB/190.86GB/4.33TB

the 2.2h jump in timestamps is the permutation inversion

WTF seriously? I would have never thought that it would be so slow. Yes, if you have the space we can do an out-of-place inversion. I don't think there's any easy way to do parallel in-place inversion because you cannot predict how cycles will develop. It might be possible that since most of the cost is moving, if you look at the cycle and mark it as "to move", you might have some threads marking and some threads permuting. But probably since the cost in accessing the memory, not reading or writing, that wouldn't be advantageous.