Example implementation of the DENSITY sampler proposed in "How to Train Data-Efficient LLMs" (Sachdeva et al., 2024).
A first pass over the dataset is performed to construct a kernel density estimator (KDE) for the data (the sketch
):
- Use
R
independent hash functions to hash the input data (embeddings of training data documents). - Use a
sketch
(RxB) to count the number of times each hash value appears for a particular hash function (over the training data).
A second pass over the data to compute scores/weights for each data point:
- Initialize empty list of
scores
- Initialize
score=0
for each data point. - For a given data point, hash it using the same
R
hash functions. - For each hash function, look up the count of the hash value index in the
sketch
. - Add the count (score) of each hash function to
score
to compute a sum of scores for the data point. - Instead of averaging
score
over theR
hash functions to get the final score for the data point, we normalize the score over the total count of the sketch matrix (score/(R * N)
). The sumnp.sum(sketch)
isN * R
. - Append the data point's score to the list of scores.
Finally, calculate the sampling probabilities (inverse propensity means taking the reciprocal):
weights = 1 / np.array(scores)
"How to Train Data-Efficient LLMs" (Sachdeva et al., 2024)
"Locality-Sensitive Hashing Scheme Based on p-Stable Distributions" (Datar et al., 2004)
"Implementing LSH Functions in Tensorflow" (Benjamin Coleman)
Coleman, Benjamin, et al. "One-pass diversified sampling with application to terabyte-scale genomic sequence streams." International Conference on Machine Learning. PMLR, 2022.