OSError: Memory mapping file failed: Cannot allocate memory

Question

OSError: Memory mapping file failed: Cannot allocate memory

Closed this issue 4 months ago · 4 comments

Hi thanks for the awesome tool!

It is so convenient. I have used it to dedup several oscar datasets.

However, when I tried to dedup hindi dataset, i don't know if it is because it is larger than other smaller languages I tried out ? It gives out the error

  File "~/conda/envs/mv2t/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
    memory_mapped_stream = pa.memory_map(filename)
  File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Memory mapping file failed: Cannot allocate memory

Any feedback would be appreciated!
Thank you!

Answer 1 · 2024-05-15T17:08:55.000Z

Thanks for opening the issue and finding it useful.

It does look like a memory issue since memory mapping is happening in the data loading stage. If you don't mind, could you share some code so I can reproduce it? If that's not possible, could you share your hardware spec (memory size mainly) and data stats (physical size etc.) instead?

Answer 2 · 2024-05-16T08:24:29.000Z

Hi
Thanks for answering so promptly.

Apparently this only occurs when I am using my server in the interactive mode, I guess then the memory is very limited, I used the example code:

python -m text_dedup.minhash --path oscar-corpus/OSCAR-2301 \
    --name ${LANG} --cache_dir "~/.cache" --split "train" \
    --column "text" --batch_size 10000 --output output/${LANG} --use_auth_token true

I deduped OSCAR data in dozens of languages, only for Chinese and Japanese there is UnicodeDecodeError UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data , in case you want to look into it.

Thanks again :)

Answer 3 · 2024-05-17T07:20:40.000Z

@siebeniris Thanks! I will take a look into the decode error in the next few days.

Answer 4 · 2024-07-16T17:47:07.000Z

Stale issue message