harvard-acc/DeepRecSys

Can these implemented models represent real scenarios?

Closed this issue · 4 comments

I found in the config file, the embedding table sizes of RMC-1, RMC-2, RMC-3 are 4GB, 4GB, and 2GB, which are relatively small, as we are told the embedding tables are usually on the order of tens of GBs. (Actually, I am also a bit confused, because, in your another paper, you said the sizes should be 100MB, 10GB, and 1GB, still far from large). These make me wonder whether results on these small models are true on large models in the industry.

I know you have tested in Facebook's production line and things worked well, but as we do not have access to these real things, I would be appreciated it if you can share any insights on this. Thanks!

Hi @EtoDemerzel0427 , Thanks for you great question. The configuration files provided are examples for the experiments provided. One of the reasons we the configuration files represent smaller models is to study the impact of co-locating models, DeepRecSys uses multiple processes. As such, model parameters are not shared and given the limited DRAM capacity of systems studied we studied models on the smaller end. The impact of irregular memory accesses in embedding operations on cache and DRAM memories should hold despite the smaller configurations.

As you point out, industry scale production models may be larger and evolving rapidly. To model this larger models you should be able to increase the number of sparse features (i.e., embedding tables, from tens to hundred), the number of vectors (e.g., up to tens or hundreds of millions entries), and latent dimension of vectors in the configuration files (e.g., from 32 to 64 to 128) Depending on the use case(s) you wish to target you may also consider studying one-hot encoded versus multi-hot encoded lookups.

Some other recent work you may want to look at:

  1. https://arxiv.org/pdf/2104.05158.pdf
  2. https://arxiv.org/pdf/2011.05497.pdf

@alugupta Thank you for this almost instant response. Yes, I once guessed it should be toy examples, but after I saw the configs were identical to table I in DeepRecSys paper, I got confused. Thank you for the clarification.

I still have a few questions:

  1. When you say "co-locating models", I think you are referring to https://arxiv.org/pdf/1906.03109.pdf? As I remember DeepRecSys didn't discuss this topic. So, it means the so-called 100MB, 10GB, 1GB are also toy examples?

  2. Another follow-up question is, when increasing the model size, how should I configure the ratio between the MLP and embedding table size? In https://arxiv.org/pdf/1906.03109.pdf, you normalized the sizes model-wise, making the relationship between the MLP size and embedding table size unknown. I have no idea how this ratio will impact the performance.

Finally, thank you again for sharing these papers, though I am more interested in inference, I believe they will give me some insights too.

@EtoDemerzel0427

  1. DeepRecSys does consider "co-locating" models. You can configure the number of parallel models being concurrently processed on the CPU by setting the number of inference engines. This is crucial for achieving high latency-bounded throughput. By configuring the per-core batch-size (or the number of batches to split individual queries into), each sub-batch is processed on a separate core running one of the inference engines/models.
  2. The ratio of MLP sizes and embedding tables may be use case and data set dependent. I would suggest taking a look at some of the industry publications (e.g., previous reply) as a guide, as well as at open-source data sets such as MovieLens and Criteo (https://github.com/mlcommons/training/tree/master/recommendation)

Hope this helps.

Thanks, your answer well resolved my questions.