Check if the test set of your data is duplicated in your pretraining data ☠️ This should be a bit easier than checking if your training dataset has duplications if your test set is small -- you only have to index the test set.
See query_infinigram.py
.
Run with
pdm install
pdm run query_infinigram.py