OSError: Memory mapping file failed: Cannot allocate memory
Closed this issue · 4 comments
Hi thanks for the awesome tool!
It is so convenient. I have used it to dedup several oscar datasets.
However, when I tried to dedup hindi dataset, i don't know if it is because it is larger than other smaller languages I tried out ? It gives out the error
File "~/conda/envs/mv2t/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Memory mapping file failed: Cannot allocate memory
Any feedback would be appreciated!
Thank you!
Thanks for opening the issue and finding it useful.
It does look like a memory issue since memory mapping is happening in the data loading stage. If you don't mind, could you share some code so I can reproduce it? If that's not possible, could you share your hardware spec (memory size mainly) and data stats (physical size etc.) instead?
Hi
Thanks for answering so promptly.
Apparently this only occurs when I am using my server in the interactive mode, I guess then the memory is very limited, I used the example code:
python -m text_dedup.minhash --path oscar-corpus/OSCAR-2301 \
--name ${LANG} --cache_dir "~/.cache" --split "train" \
--column "text" --batch_size 10000 --output output/${LANG} --use_auth_token true
I deduped OSCAR data in dozens of languages, only for Chinese and Japanese there is UnicodeDecodeError UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data
, in case you want to look into it.
Thanks again :)
@siebeniris Thanks! I will take a look into the decode error in the next few days.
Stale issue message