Dolma stat crashes because number of bins overflows python integer
Closed this issue · 0 comments
peterbjorgensen commented
The NUM_BINS
constant in python/dolma/core/analyzer.py
is 100k by default and this value overflows the 10**NUM_BINS
expression in FixedBucketsValTracker
. The _make_tracker
function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?
I have tried using the InferBucketsValTracker
instead and it seems to work. However, the bins array in the results is sometimes +1
larger than the counts
, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?
dolma stat --attributes "mC4_da/attributes/v0tags/*.json.gz" --bins 100 --processes 12 --report v0tags_report2
attributes:
- mC4_da/attributes/v0tags/*.json.gz
bins: 100
debug: false
processes: 12
regex: null
report: v0tags_report2
seed: 0
work_dir:
input: null
output: null
Found 1,024 files to process
files: 0.00f [00:00, ?f/s] multiprocessing.pool.RemoteTraceback:
"""uments: 0.00d [00:00, ?d/s]
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/home/peter/kode/dolma_clean/python/dolma/core/parallel.py", line 174, in _process_single_and_save_status
cls.process_single(
File "/home/peter/kode/dolma_clean/python/dolma/core/analyzer.py", line 120, in process_single
trackers.setdefault(f"{attr_name}/score", _make_tracker()).add(score)
File "/home/peter/kode/dolma_clean/python/dolma/core/binning.py", line 245, in add
k = int(m * self.n), e
~~^~~~~~~~
OverflowError: int too large to convert to float
"""