pod5 subset cannot shard a large aggregate pod5 file
billytcl opened this issue · 6 comments
Using python 3.7 and pod5 0.2.4. I'm trying to use pod5 subset to break up a large pod5 file into smaller ones, but it crashes with a strange error.
I used pod view to generate a reads table and split on the "channel" column:
nohup pod5 inspect reads converted.pod5 > inspect_reads.txt &
First 10k lines:
read_id filename read_number channel mux end_reason start_time start_sample duration num_samples minknow_events sample_rate median_before predicted_scaling_scale predicted_scaling_shift tracked_scaling_scale tracked_scaling_shift num_reads_since_mux_change time_since_mux_change run_id sample_id experiment_id flow_cell_id pore_type
001e35eb-1c55-4c7b-8886-dcdcb6dca15f converted.pod5 74306 148 4 signal_positive 22092.86475000 88371459 0.64550000 2582 144 4000 205.11199951 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
00356992-b31c-49a9-bb54-b5f04e8feccb converted.pod5 67712 850 4 signal_positive 22086.55175000 88346207 0.71600000 2864 142 4000 208.46253967 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
00399263-1258-42df-be8e-623435a5e3b0 converted.pod5 68784 1138 3 signal_positive 22092.13000000 88368520 0.93625000 3745 195 4000 205.83134460 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
0046e652-310e-42e8-a9eb-a97afe61cd6c converted.pod5 71090 2653 3 signal_positive 22093.76100000 88375044 0.94250000 3770 201 4000 204.88726807 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
004f3d41-1063-43d5-a2fe-ab70094d784c converted.pod5 74144 1178 3 signal_positive 22089.63250000 88358530 0.90700000 3628 194 4000 208.21510315 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
00625954-411f-4c18-ad6d-f7d0b5234f84 converted.pod5 65143 1218 1 signal_positive 22094.15175000 88376607 0.78825000 3153 163 4000 213.10104370 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
007ff853-7128-4f74-bb12-88a31ac23205 converted.pod5 46458 2338 1 signal_positive 22088.29975000 88353199 0.67450000 2698 152 4000 210.64157104 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
0080d4a1-8abd-4a06-9a9a-e6b8d856122a converted.pod5 74703 1140 4 signal_positive 22090.70975000 88362839 1.09300000 4372 258 4000 206.32232666 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
0089e982-2261-4a08-9945-07b25a7f7214 converted.pod5 67244 56 1 signal_positive 22091.08575000 88364343 1.03300000 4132 196 4000 211.88562012 NaN NaN NaN NaN 0 0.00000000 389c96a8-a692-4592-85de-b4c93643740e Seq_Output not_set PAM91261 not_set
Then I run subset:
nohup pod5 subset pod5/*.pod5 -o pod5_subset/ -f -r -s pod5/inspect_reads.head.txt -c channel &
It crashes with:
Parsed 9999 targets
Traceback (most recent call last):
File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
POD5 has encountered an error: ''
For detailed information set POD5_DEBUG=1'
Thats a new one - thanks @billytcl,
We will take a look and try to work out whats going on internally...
- George
A short term workaround might be to work in smaller batches, if that helps?
If you are able to rerun with POD5_DEBUG=1
set during the execution it may provide us more information to debug.
How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on?
Thanks,
- George
Ok - no worries.
Could you try using view
rather than inspect
- this will cut down the size of the input csv to subset:
pod5 view --include "read_id, channel" converted.pod5
Should produce a significantly smaller file for subset to process.
Going to try the pod5 view with the reduced columns now. In the meantime here are the logs with debug mode:
2023-09-18--00-55-28-p-3088300-pod5.log
2023-09-18--00-55-26-main-pod5.log
Alongside is the error to nohup:
Parsed 9999 targets
Traceback (most recent call last):
File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Traceback (most recent call last):
File "/home/billylau/.conda/envs/pod5/bin/pod5", line 8, in <module>
sys.exit(main())
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/main.py", line 60, in main
return run_tool(parser)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 41, in run_tool
raise exc
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 38, in run_tool
return tool_func(**kwargs)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 564, in run
return subset_pod5(**kwargs)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
raise exc
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
ret = func(*args, **kwargs)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 647, in subset_pod5
force_overwrite=force_overwrite,
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
raise exc
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
ret = func(*args, **kwargs)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 585, in subset_pod5s_with_mapping
sources_df = parse_sources(inputs, duplicate_ok, threads)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
raise exc
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
ret = func(*args, **kwargs)
File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 321, in parse_sources
items.append(parsed_sources.get(timeout=60))
File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 105, in get
raise Empty
_queue.Empty