allenai/dolma

Deduplication / Decontamination

Opened this issue · 0 comments

Hi,

dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.

Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:

What is the problem?

Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the desired_false_positive_rate overlap_threshold parameter, but without any success.

{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL1"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL2"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL3"}

Infos about my setup:

I am using the latest dolma 1.0.3 release. My latest minimum working example is based on configs/dolma-v1_5/decontamination.

Here are my config files create-bloomfilter.yaml:
documents:
  - benchmarks.jsonl.gz  # these are the files I want to filter with the decontamination step

dedupe:
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
  skip_empty: true

bloom_filter:
  read_only: false
  estimated_doc_count: 73543
  #size_in_bytes: 104857  # 100 MB; smaller causes too many FPs
  desired_false_positive_rate: 1e-3  # TOD: 1e-15
  file: decontamination_bloom_filter.bin

processes: 4 

decontaminate.yaml:

documents:
  - tmp/v0/documents/*.gz

work_dir:
  input: work/para/input
  output: work/para/output

dedupe:
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
  skip_empty: true

bloom_filter:
  read_only: true
  estimated_doc_count: 288347 
  desired_false_positive_rate: 1e-3
  file: decontamination_bloom_filter.bin

processes: 3
Here is the output dolma -c create-bloomfilter.yaml dedupe
bloom_filter:
  desired_false_positive_rate: 0.001
  estimated_doc_count: 73543
  file: decontamination_bloom_filter.bin
  read_only: false
  size_in_bytes: 0
dedupe:
  min_length: 0
  min_words: 0
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
    by_ngram:
      ngram_length: 0
      overlap_threshold: 1.0
      skip_short_paragraphs: false
      stride: 0
    paragraph_separator: '

      '
  skip_empty: true
documents:
- benchmarks.jsonl.gz
processes: 4
work_dir:
  input: /tmp/dolma-input-1rmq0gbx
  output: /tmp/dolma-output-ky8van2k
[2024-06-27T12:34:26Z INFO  dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO  dolma::deduper] Skipping "/disk/cschroeder/workspaces/dolma/benchmarks.jsonl.gz" because it already exists
[2024-06-27T12:34:26Z INFO  dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO  dolma::deduper] Bloom filter written.
[2024-06-27T12:34:26Z INFO  dolma::deduper] Done!

dolma -c decontaminate.yaml dedupe

bloom_filter:
  desired_false_positive_rate: 0.1
  estimated_doc_count: 288347
  file: decontamination_bloom_filter.bin
  read_only: true
  size_in_bytes: 0
dedupe:
  min_length: 0
  min_words: 0
  name: decontaminate
  paragraphs:
    attribute_name: paragraphs_bff_duplicates
    by_ngram:
      ngram_length: 0
      overlap_threshold: 1.0
      skip_short_paragraphs: false
      stride: 0
    paragraph_separator: '

      '
  skip_empty: true
documents:
- tmp/v0/documents/*.gz
processes: 3
work_dir:
  input: work/para/input
  output: work/para/output
[2024-06-27T12:38:17Z INFO  dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0000.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0001.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0002.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0003.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Keeping local file "tmp/v0/documents/part-0004.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:22Z INFO  dolma::deduper] Bloom filter written.
[2024-06-27T12:38:22Z INFO  dolma::deduper] Done!

Am I missing somehting?