brian-team/brian2cuda

Optimize `StateMonitor`

denisalevi opened this issue · 0 comments

There are a few straight forward optimizations for our current StateMonitor implementation:

  1. We are currently using dynamic device vectors (thrust) for monitor data. And we use one vector per recorded neuron. We don't need that at all, we know how much data each monitor variable will need from its clock and the number of recorded neurons. So instead of dynamic vectors and resizing, set the size once in the beginning.
  2. Currently, monitor data is stored on the GPU and only copied to CPU at the end of a simulation. We should implement GPU -> CPU copies at user-defined (or heuristic) intervals. I think a global (or per monitor) preference that sets a fix amount of GPU memory for the monitor would be good. Whenever that GPU memory is full, we copy the data to host. Optionally, the data could then also be written directly to disc. This would allow recording a lot of data even with little RAM.
  3. Transpose the 2D monitor arrays in GPU memory, such that writing the state variables to the monitor for all recorded neurons in a single time step is coalesced. This would also require modifying the loop in object.cu that writes the data to disc, such that the written format is unchanged (for Brian to read it correctly). This basically needs another transpose I guess.

For 3., here is the corresponding comment from From #201 and #50:

And the global memory writes are not coalesced. Currently we have a 2D data structure of dimensions indices x record_times (vector of vectors) for each variable monitor. And we fill that in the kernel like this

monitor[tid][current_iteration] = ...

For coalesced writes we could just "transpose" the monitor data structure so we can use

monitor[current_iteration][tid] = ...

We might have to resort the monitor in the end though since it might then not fit with the format that Brian expects to read back.

Originally posted by @denisalevi in #50 (comment)