nanoporetech/pod5-file-format

Pod5 convert fast5 killed midway.

Closed this issue ยท 16 comments

Hi, I am trying to convert some fast5 files to pod5, the command checks the files and starts converting them but it crashes after a while.

pod5 convert fast5 fast5/*.fast5 -r -o pod5/ -O fast5/
Converting 886 Fast5s:   3%|##1                                                                | 111250/3543065 [00:47<22:58, 2488.94Reads/s]
Killed

I get the same error. It seems to get killed due to an out of memory error. (I'm currently testing this on a 32GB VM).

martin@pod5-instance:~$ time pod5 convert fast5 -t 2 testsample/*.fast5 --output testsample.pod5
Converting 476 Fast5s:   0%|1                                                                                                                                                | 1500/1902741 [00:03<1:01:10, 517.97Reads/s]

Converting 476 Fast5s:  24%|##################################7                                                                                                            | 462750/1902741 [16:48<1:34:07, 254.97Reads/s]
Killed

real    21m19.411s
user    1m6.860s
sys     1m19.523s
martin@pod5-instance:~$ sudo dmesg | grep pod5 | grep -i kill | tail
. . . [CUT FOR CONCISENESS] . . .
[23430.025367] pod5 invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[23430.025926] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1004.slice/session-1.scope,task=pod5,pid=27807,uid=1004
[23430.026149] Out of memory: Killed process 27807 (pod5) total-vm:29833352kB, anon-rss:24706004kB, file-rss:0kB, shmem-rss:12kB, UID:1004 pgtables:49604kB oom_score_adj:0

martin@pod5-instance:~$ pod5 -v
Pod5 version: 0.1.16

Yep I have just confirmed the same thing. I wonder if its possible to tweak the memory usage.

Hello,

This sounds like a similar issue to one @sklages experienced whilst convering for dorado (over on the dorado project here nanoporetech/dorado#88).

We've identified the internal cause of the issue - reworking the convert script in the most recent releases of pod5.

We are working on a fix internally, will have it out asap.

Thanks,

  • George

Yes, it's the same issue (concerning pod5): [112608.885203] Out of memory: Killed process 44683 (pod5)

We've got a fix for this issue in review and it will be released soon (hopefully tomorrow).

In the meantime if this issue is blocking your work, please try reducing the number of --threads. This will stop the queue of pending converted reads filling memory as the writer will be more able to keep up. An appropriate value of --threads will depend on your system specifications but 1 should be quite safe. Of course this will increase runtime significantly but it should mitigate the immediate issue.

Kind regards,
Rich

In case it helps anyone, I had the OOM kill happening as described above. Reading this issue, I copied my fast5 files to a more performant drive on our server and now the same command on the same data is running without problems. It is non-intuitive that IO performance would apparently lead to memory accumulation. Look forward to the fix, but regardless IO performance is important in ONT land...

addendum: spoke to soon, it got notably farther with faster IO, but the parsers still overtook the writer and killed OOM, even with quite a lot of memory assigned

@HalfPhoton - can you estimate when you will release the fixed version of pod5? Just to (roughly) plan data processing tasks here ..

@jorj1988 @HalfPhoton -

We've got a fix for this issue in review and it will be released soon (hopefully tomorrow).

In the meantime if this issue is blocking your work, please try reducing the number of --threads. This will stop the queue of pending converted reads filling memory as the writer will be more able to keep up. An appropriate value of --threads will depend on your system specifications but 1 should be quite safe. Of course this will increase runtime significantly but it should mitigate the immediate issue.

Even with --threads 1 it fails to generate pod5 file:

pod5 convert \
  fast5 $FAST5_DIR \
  --strict \
  --threads 1
  --output ${SAMPLE_ID}.pod5

Error:

### [2023-04-18 12:12:03] START: A549 (mod_bases=5mCG)
Error processing: /path/to/reads/PAG97913_2a1c7820_1636.fast5

Sub-process trace:
A process in the process pool was terminated abruptly while the future was running or pending.

Converting 2444 Fast5s:  29%|##9       | 2838500/9774489 [2:16:22<5:33:14, 346.89Reads/s]

An unexpected error occurred: A process in the process pool was terminated abruptly while the future was running or pending.

concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
  File "/pkg/python-3.10.8-0/lib/python3.10/concurrent/futures/process.py", line 387, in wait_result_broken_or_wakeup
    result_item = result_reader.recv()
  File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 426, in _recv_bytes
    return self._recv(size)
  File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 391, in _recv
    buf.write(chunk)
MemoryError
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/venv3/bin/pod5", line 8, in <module>
    sys.exit(main())
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/main.py", line 56, in main
    return run_tool(parser)
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 41, in run_tool
    raise exc
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 38, in run_tool
    return tool_func(**kwargs)
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 129, in run
    return convert_from_fast5(**kwargs)
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 658, in convert_from_fast5
    raise exc
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 649, in convert_from_fast5
    write_converted_reads(
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 566, in write_converted_reads
    if futures_exception(path, future, status, strict):
  File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 550, in futures_exception
    raise exc
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Any ideas are welcome :-)

System: 300GB RAM

pod5 --version
Pod5 version: 0.1.16

The only workaround I found for now was to break the fast5 directory into chunks of files and process each chunk in its own call to pod5 convert, then merge the pod5 files back into a single directory after the fact.

The memory used by one pod5 process (I presume it is the writer, but don't know), accumulates in a linear fashion as long as the job runs, to 100s of GB. If you run a smaller number of files at a time, it will end before hitting your system's memory wall.

I'm not convinced this has to do with the read/write pacing at this point, I wonder if there is a memory leak in the writer?

Well thanks for the hint!
That's what I will probably do (but wanted to avoid) .. :-)
Hopefully there is a quick update of pod5. I have quite a few prom runs waiting for being basecalled with dorado .. same error has been reported in #34 ..

Sorry for the delays with these fixes. We ran into issues in testing on very large datasets and had to make a number of revisions.

We'll let you know as soon as we get a release ready. Thank you for your patience.

Kind regards,
Rich

Untested but using xargs or parallel to call pod5 convert one fast5 file (not directory) at a time might be the simplest current workaround - unless you are starting with one monolithic fast5 file.

@caonetto , @martinaryee, @sklages & @wilsonte-umich ,

I pleased to say that 0.1.19 was released last night. Could you please let us know if this patch has resolved your issues?

Kind regards,
Rich

@HalfPhoton - Great, thank you! ๐Ÿ‘

Currently running on three datasets, a small one and two large ones on AMD/A100.
One dataset is finished converting fast5, without error:

Converting 2444 Fast5s: 100%|##<..>##| 9774489/9774489 [1:39:48<00:00, 1632.21Reads/s]

@HalfPhoton - two other (large) datasets have been successfully converted .. for me this issue is solved.

@sklages , thank you very much for the update :)