ShopRunner/collie

Add architecture that allows embeddings to be stored on the CPU but training done on the GPU

nathancooperjones opened this issue · 0 comments

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When we have many users and/or items, the sizes of these embeddings tables quickly increases the amount of GPU memory we consume leading to an OOM error even before starting training.

A clear and concise description of what you want to happen.

Ideally, the embedding tables will instead live on the CPU, and during training, be indexed into and only bring that subset of the embeddings to the GPU for the model forward and backward. Then, after the batch, the subset of embeddings are released from GPU memory.

A clear and concise description of any alternative solutions or features you've considered.

It's also possible to accomplish this by splitting the embeddings up over many GPUs and use a model parallel, multi-GPU solution. But this solution outlined above allows a scalable model on a single node, which may be more desirable to more users of the library.

Add any other context or information about the feature request here.

See here for some related discussion on this.