Efficient pRNG implementation for AdePT
Closed this issue · 4 comments
The criteria defining an efficiency "score" for a pRNG implementation in AdePT are described in the pRNG Wiki section. This tracker follows-up implementations and/or reuses of random number generators in AdePT.
To expand a bit on the discussion this morning regarding comparison with cuRAND: The documentation contains a page about "Testing" that discusses results from BigCrush, see https://docs.nvidia.com/cuda/curand/testing.html. Some particularly interesting statements:
The XORWOW generator passes all of the tests on most runs, but does produce occasional suspect statistics.
(this is the default generator if you just use curandState_t
which is typedef
ed to curandStateXORWOW_t
)
The MRG32k3a generator [...] passes all "BigCrush" tests frequently, with occasional marginal results [...]
The MTGP32 generator exhibits some marginal results on "BigCrush".
Finally, the Philox4_32_10 generator passes all tests of "BigCrush". (The MT19937 generator also passes all tests, but AFAICT it's host-only and you cannot use it on the device, so I'll skip that.)
For performance, I think the interesting metric here is parallel throughput. For testing, I let all launched threads (<<<1024, 128>>>
) execute a fair amount (1e6) of double
-precision random numbers (using curand_uniform_double
) from the same seed. Then I can calculate the time it takes for every CUDA core to spit out one random number as wall-clock time / (#numbers * #threads / #cores)
.
- XORWOW generator:
3.06s / (1e6 * 1024 * 128 / 2304) = 53.8ns
- MRG32k3a generator:
25.05s / (1e6 * 1024 * 128 / 2304) = 440ns
- Philox4_32_10 generator:
9.52s / (1e6 * 1024 * 128 / 2304) = 167ns
(I haven't yet figured out how to initialize the MTGP32 generator...)
For the implementation of RANLUX++, I measure 11.03s / (1e6 * 1024 * 128 / 2304) = 194ns
.
In terms of size for the state, the struct
s for cuRAND contain 2x int
, 1x float
and 1x double
for Box-Muller transformations. If we don't plan on using curand_normal
/ curand_log_normal
, we could drop them for our own implementation.
Other than that, the implementations of XORWOW and MRG32k3a in cuRAND need a state of 6x unsigned int
s or 24 Bytes each. Philox4_32_10 needs 11x unsigned int
(via some vector types), so 44 Bytes. RANLUX++ needs 9x uint64_t
plus one int
position, so a total of 76 Bytes.
For your information I discuss certain criteria for high-quality PRNGs and information on testing PRNGs for high quality. In particular, I define a high-quality PRNG as one that "generates bits that behave like independent uniform random bits (at least for nearly all practical purposes outside of information security)", among other things, and one piece of evidence of this is showing no failures in Chris Doty-Humphrey's PractRand tests at 1TiB.
Forgot to comment here / mention this in #46 where I added the RANLUX++ implementation. Its quality should be really good, it works on the host and the GPU and you can transfer state via a cudaMemcpy
. For efficiency, it remains to be seen how large the fraction is for a realistic scenario
#114 implemented an efficient branching scheme (fixed in #129) and #130 updated RANLUX++ to avoid a bias in the generated numbers. Here's a validation plot comparing RANLUX++ and cuRAND: cuRAND_CaloEDep.pdf Here cuRAND's XORWOW clearly shows a higher relative error compared to results from Geant4, meaning that we should stick to RANLUX++. I consider this issue done for now.