/decontamination

This repository contains code for cleaning your training data of benchmark data to help combat data snooping.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

decontamination

This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).

Install

pip install decontamination

How to use

First you need to specify which benchmarks you want to clean your data of. You can do this by creating dictionary with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

!export HF_ACCESS_TOKEN=<TOKEN>
from datasets import load_dataset
from decontamination.core import BenchmarkCleaner

# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")

benchmarks = ["openai_humaneval", "lambada"]
cleaner = BenchmarkCleaner(benchmarks, "/tmp/benchmarks", threshold=0.1, num_perm=128)

# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content", check_for_fp=True)
[01/24/23 00:27:37] INFO     Benchmark datasets already exist. Skipping hashing.                        core.py:181
/home/nathan/miniconda3/envs/decontamination/lib/python3.10/site-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing 'storage_options=fs.storage_options' instead.
  warnings.warn(
Checking for false positives...: 100%|██████████| 8636/8636 [00:33<00:00, 261.25it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [06:58<00:00, 21.06it/s]
Checking for false positives...: 100%|██████████| 8722/8722 [06:39<00:00, 21.82it/s]
Filtering duplicates... #0: 100%|██████████| 1/1 [00:00<00:00, 140.36ba/s]

Filtering duplicates... #1: 100%|██████████| 1/1 [00:00<00:00, 123.28ba/s]
Filtering duplicates... #2: 100%|██████████| 1/1 [00:00<00:00, 169.47ba/s]





Filtering duplicates... #3: 100%|██████████| 1/1 [00:00<00:00, 141.77ba/s]




Filtering duplicates... #4: 100%|██████████| 1/1 [00:00<00:00, 142.31ba/s]





Filtering duplicates... #5: 100%|██████████| 1/1 [00:00<00:00, 139.13ba/s]






Filtering duplicates... #6: 100%|██████████| 1/1 [00:00<00:00, 156.00ba/s]







Filtering duplicates... #7: 100%|██████████| 1/1 [00:00<00:00, 139.18ba/s]








Filtering duplicates... #8: 100%|██████████| 1/1 [00:00<00:00, 162.53ba/s]









Filtering duplicates... #9: 100%|██████████| 1/1 [00:00<00:00, 140.68ba/s]










Filtering duplicates... #10: 100%|██████████| 1/1 [00:00<00:00, 138.69ba/s]











Filtering duplicates... #11: 100%|██████████| 1/1 [00:00<00:00, 145.31ba/s]












Filtering duplicates... #12: 100%|██████████| 1/1 [00:00<00:00, 144.74ba/s]













Filtering duplicates... #13: 100%|██████████| 1/1 [00:00<00:00, 157.68ba/s]





























Filtering duplicates... #14: 100%|██████████| 1/1 [00:00<00:00, 95.45ba/s]
Filtering duplicates... #15: 100%|██████████| 1/1 [00:00<00:00, 135.26ba/s]
















Filtering duplicates... #16: 100%|██████████| 1/1 [00:00<00:00, 136.07ba/s]



































Filtering duplicates... #17: 100%|██████████| 1/1 [00:00<00:00, 107.33ba/s]
Filtering duplicates... #18: 100%|██████████| 1/1 [00:00<00:00, 141.83ba/s]
Filtering duplicates... #19: 100%|██████████| 1/1 [00:00<00:00, 139.11ba/s]
Filtering duplicates... #20: 100%|██████████| 1/1 [00:00<00:00, 137.10ba/s]
Filtering duplicates... #21: 100%|██████████| 1/1 [00:00<00:00, 146.80ba/s]
Filtering duplicates... #22: 100%|██████████| 1/1 [00:00<00:00, 147.25ba/s]
Filtering duplicates... #23: 100%|██████████| 1/1 [00:00<00:00, 149.84ba/s]
Filtering duplicates... #24: 100%|██████████| 1/1 [00:00<00:00, 132.19ba/s]
Filtering duplicates... #25: 100%|██████████| 1/1 [00:00<00:00, 24.02ba/s]
Filtering duplicates... #30: 100%|██████████| 1/1 [00:00<00:00, 119.37ba/s]
Filtering duplicates... #29: 100%|██████████| 1/1 [00:00<00:00, 98.58ba/s]
Filtering duplicates... #28: 100%|██████████| 1/1 [00:00<00:00, 85.76ba/s]
Filtering duplicates... #26: 100%|██████████| 1/1 [00:00<00:00, 76.09ba/s]
Filtering duplicates... #31: 100%|██████████| 1/1 [00:00<00:00, 69.66ba/s]
Filtering duplicates... #27: 100%|██████████| 1/1 [00:00<00:00, 62.54ba/s]
[01/24/23 00:41:50] INFO     Data Number                   : 10000                                      core.py:277
                    INFO     Duplicate Number              : 3932                                       core.py:278
                    INFO     Duplicate Rate                : 39.32%                                     core.py:279
                    INFO     Total Time                    : 853.66 seconds                             core.py:280