Efficient pRNG implementation for AdePT

Question

Efficient pRNG implementation for AdePT

Closed this issue 4 years ago · 4 comments

The criteria defining an efficiency "score" for a pRNG implementation in AdePT are described in the pRNG Wiki section. This tracker follows-up implementations and/or reuses of random number generators in AdePT.

Answer 1 · 2020-11-19T10:41:44.000Z

To expand a bit on the discussion this morning regarding comparison with cuRAND: The documentation contains a page about "Testing" that discusses results from BigCrush, see https://docs.nvidia.com/cuda/curand/testing.html. Some particularly interesting statements:

The XORWOW generator passes all of the tests on most runs, but does produce occasional suspect statistics.

(this is the default generator if you just use curandState_t which is typedefed to curandStateXORWOW_t)

The MRG32k3a generator [...] passes all "BigCrush" tests frequently, with occasional marginal results [...]

The MTGP32 generator exhibits some marginal results on "BigCrush".

Finally, the Philox4_32_10 generator passes all tests of "BigCrush". (The MT19937 generator also passes all tests, but AFAICT it's host-only and you cannot use it on the device, so I'll skip that.)

For performance, I think the interesting metric here is parallel throughput. For testing, I let all launched threads (<<<1024, 128>>>) execute a fair amount (1e6) of double-precision random numbers (using curand_uniform_double) from the same seed. Then I can calculate the time it takes for every CUDA core to spit out one random number as wall-clock time / (#numbers * #threads / #cores).

XORWOW generator: 3.06s / (1e6 * 1024 * 128 / 2304) = 53.8ns
MRG32k3a generator: 25.05s / (1e6 * 1024 * 128 / 2304) = 440ns
Philox4_32_10 generator: 9.52s / (1e6 * 1024 * 128 / 2304) = 167ns

(I haven't yet figured out how to initialize the MTGP32 generator...)

For the implementation of RANLUX++, I measure 11.03s / (1e6 * 1024 * 128 / 2304) = 194ns.

In terms of size for the state, the structs for cuRAND contain 2x int, 1x float and 1x double for Box-Muller transformations. If we don't plan on using curand_normal / curand_log_normal, we could drop them for our own implementation.

Other than that, the implementations of XORWOW and MRG32k3a in cuRAND need a state of 6x unsigned ints or 24 Bytes each. Philox4_32_10 needs 11x unsigned int (via some vector types), so 44 Bytes. RANLUX++ needs 9x uint64_t plus one int position, so a total of 76 Bytes.

Answer 2 · 2020-11-24T09:54:47.000Z

For your information I discuss certain criteria for high-quality PRNGs and information on testing PRNGs for high quality. In particular, I define a high-quality PRNG as one that "generates bits that behave like independent uniform random bits (at least for nearly all practical purposes outside of information security)", among other things, and one piece of evidence of this is showing no failures in Chris Doty-Humphrey's PractRand tests at 1TiB.

Answer 3 · 2020-12-01T10:22:49.000Z

Forgot to comment here / mention this in #46 where I added the RANLUX++ implementation. Its quality should be really good, it works on the host and the GPU and you can transfer state via a cudaMemcpy. For efficiency, it remains to be seen how large the fraction is for a realistic scenario

Answer 4 · 2021-06-30T14:56:18.000Z

#114 implemented an efficient branching scheme (fixed in #129) and #130 updated RANLUX++ to avoid a bias in the generated numbers. Here's a validation plot comparing RANLUX++ and cuRAND: cuRAND_CaloEDep.pdf Here cuRAND's XORWOW clearly shows a higher relative error compared to results from Geant4, meaning that we should stick to RANLUX++. I consider this issue done for now.