Reduce large floating point accumulation error in high photon simulations
fangq opened this issue · 1 comments
MCX has been using atomic operations for fluence accumulation by default since a few years ago. However, a drop in fluence intensity in large photon simulations has been observed. For example, running the below script using the current MCX github code, you can get the below plot
clear cfg
cfg.vol=uint8(ones(60,60,60));
cfg.srcpos=[30 30 1];
cfg.srcdir=[0 0 1];
cfg.gpuid=1;
cfg.autopilot=1;
cfg.prop=[0 0 1 1;0.005 1 0 1.37];
cfg.tstart=0;
cfg.tend=5e-9;
cfg.tstep=cfg.tend;
figure
hold on;
for i=6:9
cfg.nphoton=10^i;
flux=mcxlab(cfg);
plot(log10(abs(squeeze(flux.data(30,30,:)))),'-','color', rand(1,3));
end
The reason for the drop in intensity was not due to data racing, like the case when non-atomic operations were used, but the accumulations of the round-off errors. In the region near the source, the energy deposit quickly increases to a large value. When adding a new energy deposit (which is a very small value) on top of a large value, the accuracy becomes a problem.
This is a serious problem because, with the increase of GPU computing capacity, most people would choose to run large photon simulations. We must be able to run large photon numbers without loosing accuracy.
There are a few solutions to such problem.
The easiest solution is to change the energy storage to double. However, consumer GPUs have extremely poor double performance, so moving to double precision addition can likely lead to drop in speed.
The standard way to sum a small values with a large floating point value is the Kahan summation. This is what we used in MMC. However, this requires multiple step operations with additional storage space. When combining with the atomic operation, atomic Kahan summation is very difficult to be implemented in the GPU.
Another idea is to use repetitions (-r) to split a large simulation into smaller chunks, and sum the solutions together. For example, for 1e9 photons with 10 respin, we run 10x 10^8 photon simulations. This can reduce the round-off error, but the repeated launch of the kernel causes a large overhead, sometimes, significantly higher than the kernel execution itself. In addition, even simulate at 1e8 photons, from the above plot, the drop in intensity remains noticeable.
A robust method is needed to obtain stable and converging solution especially at large photon numbers.