compdyn/partmc

Adapting time_derivative to multicells

Opened this issue · 7 comments

Currently, the derivative function in CAMP-MONARCH is taking ~50% of the execution time.

An optimization has been proposed time ago to improve vectorization and memory access (#116) based on swap RXN loop and cells loop (Inner loop will be cells loop instead of rxn_type loop).

However, this optimization is more problematic to implement today due to time_deriv implementation. Time_deriv stores partial results of each reaction until the cell calculation is over, and then it does some calculation with these partial contributions to obtain deriv. So, to compute this with loops swapped, we would need to store all partial contributions over all cells. Taking into account that these contributions are long double, this is a considerable amount of data (especially for the GPU case).

So, I'm thinking of some alternatives. First of all, I will program an option to avoid the time_deriv calculation and use the old approach (production_rates-loss_rates) if some debug flag is ON. The idea is to analyze better the effect of time_deriv on the only cb05 CAMP-MONARCH version.

The second option would be to make time_deriv of size n_cells and add all the contributions to the specie X over all cells, save the result on deriv_data and repeat for all species. However, I'm not sure if this approach would be correct with the idea of time_deriv. Can you bring some light about this question @mattldawson ? A general explanation of time_deriv could be enough (also we can re-use this definition for the CAMP article if you see fit).

The reason for introducing the TimeDerivative struct and functions was to reduce the effects of catastrophic cancellation of production and loss terms for processes that rapidly approach some equilibrium value - HL partitioning, SIMPOL VP partitioning, aqueous equilibrium, or even sets of fast gas-phase reactions that lead to large production and loss rates for individual species that nearly cancel each other out. I think the most important part is storing the derivative as long double - I'm not sure if saving separate production and loss arrays makes a big difference. We could try changing TimeDerivative to have one long double net rate array. Do you think this would help? We would have to make sure it doesn't make the system harder to solve (the previous approach resulted in CVODE convergence errors).

I'm not sure if trying to calculate the derivative one species at a time would be efficient - each reaction can affect any of the species in the system, so you would end up recalculating rates for each reaction for every species that it affects (which can be > 10 species for some gas-phase reactions).

The reason for introducing the TimeDerivative struct and functions was to reduce the effects of catastrophic cancellation of production and loss terms for processes that rapidly approach some equilibrium value - HL partitioning, SIMPOL VP partitioning, aqueous equilibrium, or even sets of fast gas-phase reactions that lead to large production and loss rates for individual species that nearly cancel each other out. I think the most important part is storing the derivative as long double - I'm not sure if saving separate production and loss arrays makes a big difference. We could try changing TimeDerivative to have one long double net rate array. Do you think this would help? We would have to make sure it doesn't make the system harder to solve (the previous approach resulted in CVODE convergence errors).

I don't think mixing the two separate arrays in one long double array would make any difference.

I'm not sure if trying to calculate the derivative one species at a time would be efficient - each reaction can affect any of the species in the system, so you would end up recalculating rates for each reaction for every species that it affects (which can be > 10 species for some gas-phase reactions).

And what about storing only n_species contributions for all the cells? Instead of computing TimeDerivative for each cell, it would be computed over all cells. Or it would be a completely different result?

The reason for introducing the TimeDerivative struct and functions was to reduce the effects of catastrophic cancellation of production and loss terms for processes that rapidly approach some equilibrium value - HL partitioning, SIMPOL VP partitioning, aqueous equilibrium, or even sets of fast gas-phase reactions that lead to large production and loss rates for individual species that nearly cancel each other out. I think the most important part is storing the derivative as long double - I'm not sure if saving separate production and loss arrays makes a big difference. We could try changing TimeDerivative to have one long double net rate array. Do you think this would help? We would have to make sure it doesn't make the system harder to solve (the previous approach resulted in CVODE convergence errors).

I don't think mixing the two separate arrays in one long double array would make any difference.

ok, but we should test it to make sure - the solver was very sensitive to these artifacts. We need to make sure that the change to a single array doesn't cause the solver to fail and that it doesn't increase the number of solver steps CVODE needs to solve the system. We will quickly lose any efficiency gained by changing the TimeDerivative struct if we increase the number of solver steps CVODE takes if the derivative is less accurate.

I'm not sure if trying to calculate the derivative one species at a time would be efficient - each reaction can affect any of the species in the system, so you would end up recalculating rates for each reaction for every species that it affects (which can be > 10 species for some gas-phase reactions).

And what about storing only n_species contributions for all the cells? Instead of computing TimeDerivative for each cell, it would be computed over all cells. Or it would be a completely different result?

no - they are completely different for each cell, and the solver is very sensitive to getting accurate derivative calculations

ok, but we should test it to make sure - the solver was very sensitive to these artifacts. We need to make sure that the change to a single array doesn't cause the solver to fail and that it doesn't increase the number of solver steps CVODE needs to solve the system. We will quickly lose any efficiency gained by changing the TimeDerivative struct if we increase the number of solver steps CVODE takes if the derivative is less accurate.

By the way, is strange that creating a long double from double and truncating it later produces a positive result. Normally these truncations can cause bugs.

Well, for the moment I will make some fast tests on the CPU with both options, and later I will put both options on the GPU and check the impact on performance. We can search again for optimizations after the GPU version (since after all the most interesting case is the GPU)

ok, but we should test it to make sure - the solver was very sensitive to these artifacts. We need to make sure that the change to a single array doesn't cause the solver to fail and that it doesn't increase the number of solver steps CVODE needs to solve the system. We will quickly lose any efficiency gained by changing the TimeDerivative struct if we increase the number of solver steps CVODE takes if the derivative is less accurate.

By the way, is strange that truncating a long double into a double produces a positive result. Normally these truncations can cause bugs.

the point is to subtract the production and loss rates as long double so that the result, after the precision loss, still has the precision of a double (or as close to double as we can get):

#include <stdio.h>

int main( ) {

  long double long_loss = 1.0e18l;
  long double long_production = 1.0e18l + 1.0l;
  double rate = long_production - long_loss;

  printf("\n rate = %lg", rate);

  double dbl_loss = 1.0e18;
  double dbl_production = 1.0e18 + 1.0;
  rate = dbl_production - dbl_loss;

  printf("\n rate = %lg", rate );
  printf("\n");
}

output:

 rate = 1
 rate = 0

Well, for the moment I will make some fast tests on the CPU with both options, and later I will put both options on the GPU and check the impact on performance. We can search again for optimizations after the GPU version (since after all the most interesting case is the GPU)

sounds good!