lsh-density-sampling: A Python repository from Lauler

DENSITY sampler

Example implementation of the DENSITY sampler proposed in "How to Train Data-Efficient LLMs" (Sachdeva et al., 2024).

A first pass over the dataset is performed to construct a kernel density estimator (KDE) for the data (the sketch):

Use R independent hash functions to hash the input data (embeddings of training data documents).
Use a sketch (RxB) to count the number of times each hash value appears for a particular hash function (over the training data).

A second pass over the data to compute scores/weights for each data point:

Initialize empty list of scores
Initialize score=0 for each data point.
For a given data point, hash it using the same R hash functions.
For each hash function, look up the count of the hash value index in the sketch.
Add the count (score) of each hash function to score to compute a sum of scores for the data point.
Instead of averaging score over the R hash functions to get the final score for the data point, we normalize the score over the total count of the sketch matrix (score/(R * N)). The sum np.sum(sketch) is N * R.
Append the data point's score to the list of scores.

Finally, calculate the sampling probabilities (inverse propensity means taking the reciprocal):

weights = 1 / np.array(scores)