Pod5 convert fast5 killed midway.
Closed this issue ยท 16 comments
Hi, I am trying to convert some fast5 files to pod5, the command checks the files and starts converting them but it crashes after a while.
pod5 convert fast5 fast5/*.fast5 -r -o pod5/ -O fast5/
Converting 886 Fast5s: 3%|##1 | 111250/3543065 [00:47<22:58, 2488.94Reads/s]
Killed
I get the same error. It seems to get killed due to an out of memory error. (I'm currently testing this on a 32GB VM).
martin@pod5-instance:~$ time pod5 convert fast5 -t 2 testsample/*.fast5 --output testsample.pod5
Converting 476 Fast5s: 0%|1 | 1500/1902741 [00:03<1:01:10, 517.97Reads/s]
Converting 476 Fast5s: 24%|##################################7 | 462750/1902741 [16:48<1:34:07, 254.97Reads/s]
Killed
real 21m19.411s
user 1m6.860s
sys 1m19.523s
martin@pod5-instance:~$ sudo dmesg | grep pod5 | grep -i kill | tail
. . . [CUT FOR CONCISENESS] . . .
[23430.025367] pod5 invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[23430.025926] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1004.slice/session-1.scope,task=pod5,pid=27807,uid=1004
[23430.026149] Out of memory: Killed process 27807 (pod5) total-vm:29833352kB, anon-rss:24706004kB, file-rss:0kB, shmem-rss:12kB, UID:1004 pgtables:49604kB oom_score_adj:0
martin@pod5-instance:~$ pod5 -v
Pod5 version: 0.1.16
Yep I have just confirmed the same thing. I wonder if its possible to tweak the memory usage.
Hello,
This sounds like a similar issue to one @sklages experienced whilst convering for dorado (over on the dorado project here nanoporetech/dorado#88).
We've identified the internal cause of the issue - reworking the convert script in the most recent releases of pod5.
We are working on a fix internally, will have it out asap.
Thanks,
- George
Yes, it's the same issue (concerning pod5
): [112608.885203] Out of memory: Killed process 44683 (pod5)
We've got a fix for this issue in review and it will be released soon (hopefully tomorrow).
In the meantime if this issue is blocking your work, please try reducing the number of --threads
. This will stop the queue of pending converted reads filling memory as the writer will be more able to keep up. An appropriate value of --threads
will depend on your system specifications but 1
should be quite safe. Of course this will increase runtime significantly but it should mitigate the immediate issue.
Kind regards,
Rich
In case it helps anyone, I had the OOM kill happening as described above. Reading this issue, I copied my fast5 files to a more performant drive on our server and now the same command on the same data is running without problems. It is non-intuitive that IO performance would apparently lead to memory accumulation. Look forward to the fix, but regardless IO performance is important in ONT land...
addendum: spoke to soon, it got notably farther with faster IO, but the parsers still overtook the writer and killed OOM, even with quite a lot of memory assigned
@HalfPhoton - can you estimate when you will release the fixed version of pod5? Just to (roughly) plan data processing tasks here ..
@jorj1988 @HalfPhoton -
We've got a fix for this issue in review and it will be released soon (hopefully tomorrow).
In the meantime if this issue is blocking your work, please try reducing the number of
--threads
. This will stop the queue of pending converted reads filling memory as the writer will be more able to keep up. An appropriate value of--threads
will depend on your system specifications but1
should be quite safe. Of course this will increase runtime significantly but it should mitigate the immediate issue.
Even with --threads 1
it fails to generate pod5 file:
pod5 convert \
fast5 $FAST5_DIR \
--strict \
--threads 1
--output ${SAMPLE_ID}.pod5
Error:
### [2023-04-18 12:12:03] START: A549 (mod_bases=5mCG)
Error processing: /path/to/reads/PAG97913_2a1c7820_1636.fast5
Sub-process trace:
A process in the process pool was terminated abruptly while the future was running or pending.
Converting 2444 Fast5s: 29%|##9 | 2838500/9774489 [2:16:22<5:33:14, 346.89Reads/s]
An unexpected error occurred: A process in the process pool was terminated abruptly while the future was running or pending.
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "/pkg/python-3.10.8-0/lib/python3.10/concurrent/futures/process.py", line 387, in wait_result_broken_or_wakeup
result_item = result_reader.recv()
File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 426, in _recv_bytes
return self._recv(size)
File "/pkg/python-3.10.8-0/lib/python3.10/multiprocessing/connection.py", line 391, in _recv
buf.write(chunk)
MemoryError
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/to/venv3/bin/pod5", line 8, in <module>
sys.exit(main())
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/main.py", line 56, in main
return run_tool(parser)
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 41, in run_tool
raise exc
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 38, in run_tool
return tool_func(**kwargs)
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/parsers.py", line 129, in run
return convert_from_fast5(**kwargs)
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 658, in convert_from_fast5
raise exc
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 649, in convert_from_fast5
write_converted_reads(
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 566, in write_converted_reads
if futures_exception(path, future, status, strict):
File "/path/to/venv3/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 550, in futures_exception
raise exc
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Any ideas are welcome :-)
System: 300GB RAM
pod5 --version
Pod5 version: 0.1.16
The only workaround I found for now was to break the fast5 directory into chunks of files and process each chunk in its own call to pod5 convert
, then merge the pod5 files back into a single directory after the fact.
The memory used by one pod5 process (I presume it is the writer, but don't know), accumulates in a linear fashion as long as the job runs, to 100s of GB. If you run a smaller number of files at a time, it will end before hitting your system's memory wall.
I'm not convinced this has to do with the read/write pacing at this point, I wonder if there is a memory leak in the writer?
Well thanks for the hint!
That's what I will probably do (but wanted to avoid) .. :-)
Hopefully there is a quick update of pod5
. I have quite a few prom runs waiting for being basecalled with dorado
.. same error has been reported in #34 ..
Sorry for the delays with these fixes. We ran into issues in testing on very large datasets and had to make a number of revisions.
We'll let you know as soon as we get a release ready. Thank you for your patience.
Kind regards,
Rich
Untested but using xargs
or parallel
to call pod5 convert
one fast5 file (not directory) at a time might be the simplest current workaround - unless you are starting with one monolithic fast5 file.
@caonetto , @martinaryee, @sklages & @wilsonte-umich ,
I pleased to say that 0.1.19
was released last night. Could you please let us know if this patch has resolved your issues?
Kind regards,
Rich
@HalfPhoton - Great, thank you! ๐
Currently running on three datasets, a small one and two large ones on AMD/A100.
One dataset is finished converting fast5, without error:
Converting 2444 Fast5s: 100%|##<..>##| 9774489/9774489 [1:39:48<00:00, 1632.21Reads/s]
@HalfPhoton - two other (large) datasets have been successfully converted .. for me this issue is solved.
@sklages , thank you very much for the update :)