Semantically Constrained Memory Allocation (SCMA) for Embedding inEfficient Recommendation Systems

Description:

This project applys similarity-based shared memory(SCMA) embeddings on DLRM. It is modified from Facebook's DLRM repository(https://github.com/facebookresearch/dlrm).

Implementation

DLRM + SCMA: PyTorch. Currently we only support SCMA for PyTorch.

   dlrm_s_pytorch.py - Main script for DLRM model
   lsh_pp_pretraining.py - Pretraining script for SCMA, run this file to generate the minhash table.
   min_hash_generator.py - Min hash table generator. Most of the SCMA + DLRM experiments in the paper are finished using min hash table generated by this script, including the hyperparameter tuning experiments.
   min_hash_generator_universal_hash.py - Min hash table generator using updated universal hash functions. The 540M parameters' experiment for Criteo dataset is finished using min hash table generated by this updated script.

How to run our code?

Prerequests:

pytorch-nightly (6/10/19)

scikit-learn

numpy

onnx (optional)

pydot (optional)

torchviz (optional)

cuda-dev(optional)

tqdm

Data preparation

Download the dataset from https://www.kaggle.com/c/criteo-display-ad-challenge and Put the train.txt file and test.txt file in /dlrm_ssm/input/

Run SCMA embeddings with DLRM:

Note: This code provides a proof of concept of LMA-DLRM . It precomputes lsh based mapping and uses it in code. We are in process of releasing the efficient code where mapping is computed on the fly.

1. Pre-training

Use tricks/lsh_pp_pretraining.py to generate the min_hash_table for SCMA embeddings. 3 hyperparameter is needed：

command line arguments	usage
EMBEDDING	embedding dimension
NUM_HASH	number of hash functions used to generate min_hash_table
NUM_PT	number of datapoints used to represent the dataset

For example:

python tricks/lsh_pp_pretraining 128 2 125000

will generate a min hash table with 128 embedding dimension using 2 hash functions and 125k datapoints.

2. Training

Use dlrm_s_pytorch.py to train the DLRM model with SCMA embeddings.

command line arguments

command line arguments	type	usage
lsh-emb-flag	-	enable SSM embedding for DLRM
lsh-emb-compression-rate	float	1/expansion rate
rand-emb-flag	-	enable HashNet embedding for DLRM
rand-emb-compression-rate	float	1/expansion rate
arch-sparse-feature-size	string	we use "13-512-256-128" which is the default value for DLRM
arch-mlp-bot	string	we use "1024-1024-512-256-1" which is the default value for DLRM
data-generation	string	use "dataset" to run our code, default is "ramdom"
data-set	string	use "kaggle" to run our code
raw-data-file	string	please use "./input/train.txt"
processed-data-file	string	please use "./input/kaggleAdDisplayChallenge_processed.npz"
loss-function	string	please use "bce"
round-targets	boolean	pleause use "True"
learning-rate	double	default learning rate is 0.1
mini-batch-size	int	the batch size is set to 2048 in our experiments
nepochs	int	the number of epochs is set to 15 in our experiments
print-freq	int	the frequency to print log
print-time	-
test-mini-batch-size	int	the batch size for test, please use 16384 to run our code
test-num-workers	int	we use 16 for our experiments

A sample run of DLRM with SSM ebmedding code

A sample that runs our code is in:

bench/dlrm_s_criteo_kaggle.sh

Run HashNet embeddings with DLRM:

1. Install HashNet

cuda-dev is a prerequest for our implementation of HashNet embeddings.

python tricks/hashedEmbeddingBag/setup.py install

will install the HashNet.

We also have a saparate HashNet repo here: https://github.com/wanmeihuali/HashedEmbeddingBag

2. Run DLRM with HashNet embedding

Use --rand-emb-flag in bench/dlrm_s_criteo_kaggle.sh to enbale HashNet embedding, use --rand-emb-compression-rate to set your expansion rate. Other settings can stay the same as SSM embedding.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

yanzhoupan/dlrm_ssm