dtype option is not working as expected

Question

dtype option is not working as expected

Closed this issue 2 months ago · 1 comments

I'm trying to increase the vocab size by setting dtype to uint32 when running the dolma tokens command. However, it seems that the dtype option is not being propagated correctly.

Command:
dolma tokens --documents output/*/ --destination dataset --tokenizer.name_or_path meta-llama/Meta-Llama-3-8B --tokenizer.bos_token_id 128000 --dtype uint32

When I try to read the generated file using numpy.memmap, I get a ValueError:

import numpy as np
indices = np.memmap("dataset/part-0-00000.npy", mode="r+", dtype=np.uint32)

Error message: ValueError: Size of available data is not a multiple of the data-type size.

Sometimes, I'm able to read the file, but the output contains junk values with token IDs exceeding the vocab size of 128256.

Config:
Here is full config with --dryrun option

batch_size: 10000
debug: false
destination: dataset
documents:
- output/A/
- output/B/
- output/C/
dryrun: true
dtype: uint32
files_per_process: null
max_size: 1073741824
processes: 1
ring_size: 8
sample_ring_prop: false
seed: 3920
tokenizer:
  bos_token_id: 128000
  eos_token_id: null
  name_or_path: meta-llama/Meta-Llama-3-8B
  pad_token_id: null
  segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
  input: null

I suspect that the dtype option is not being propagated correctly.

Answer 1 · 2024-05-02T07:29:16.000Z

Thank you for the report! This has been fixed in main.