emer/axon

Flexible neuron, synapse data layout

Closed this issue · 6 comments

This seems very straightforward, now that Neurons and Syns are all just giant arrays of structs on the Network:

  • Neurons and Syns are just single giant []float32 arrays (likewise on GPU).
  • Vars are just enums (put desc tooltips into separate array of strings)
  • Access is via methods with indexes to layer, neuron, variable, etc that take Context as first arg (nearly ubiquitous), which has a Network index in CPU mode, that allows access to a global list of networks, which then allows access to these arrays. On GPU, the accessor methods just access the global arrays defined in the kernel.
  • It is then entirely trivial to reorganize the memory layout any which way.
  • Context is shared between GPU and CPU and can't contain pointers, hence the need for the network index. Could alternatively have a pointer that is meaningless to the GPU -- isn't needed there anyway.
  • Also one of the indexes can be data index so data parallel can be interleaved in sequence.

more context: GPU has to access global arrays directly, which are allocated for each kernel, so need global functions that are defined differently on CPU vs. GPU:

func NeuronVarIdx(ctx *Context, neurIdx, dataIdx int32, nvar NeuronVars) int32 {
    return var * ctx.Strides.NeuronVar + neurIdx * ctx.Strides.Neuron + dataIdx * ctx.Strides.NeuronData
}

// CPU version
func NeuronVar(ctx *Context, neurIdx, dataIdx int32, nvar NeuronVars) float32 {
    return Networks[ctx.NetIdx].Neurons[NeuronVarIdx(ctx, neurIdx, dataIdx, nvar)]
}

// GPU version
[[vk::binding(1, 2)]] RWStructuredBuffer<float> Neurons;
float NeuronVar(ctx *Context, neurIdx, dataIdx int32, nvar NeuronVars) {
    return Neurons[NeuronVarIdx(ctx, neurIdx, dataIdx, nvar)]
}

Have a separate SetNeuronVar, and versions that go via layer if needed (layer has starting global neuron index), etc.

Can just switch out NeuronVarIdx functions to see impact of different layouts.

The non-float32 vars (flags, indexes) in Neuron would be stored separately, so you don't have to complicate anything about type conversions. Also, SynCa would be separated from Synapses memory, per #168

Context requires data parallel state for all the PVLV, NeuroMod stuff.

Plan: add GlobalVars enum in globals.go, store memory in Network, with all the NeuroMod and PVLV state. Can parameterize with NDrives with computed offsets etc for flexible storage. maybe have an offset index lookup table for each val (above a given enum value). Data inner-loop indexing for each var. GPU just exposes the Globals directly as usual.

This worked as expected and massively improves GPU performance, and even CPU performance is significantly improved in NData > 1 cases.