nanoporetech/pod5-file-format

pod5:Empty queue or timeout

Closed this issue · 20 comments

Hi,
When using the command "pod5 convert fast5," I consistently encounter an error that causes the program to stop running. However, upon inspecting my data, I can confirm that the data exists and is actually stored. How can I resolve this issue?

pod5 convert fast5 ./fast5/*.fast5 --output pod5/

image

Hi @HITzhongyu ,

Could you set the POD5_DEBUG=1 environment variable and run the same command again? The converter will now generate a number of log files which show the state of the Queue at runtime. I can use these to help resolve this issue.

POD5_DEBUG=1 pod5 convert fast5 ./fast5/*.fast5 --output debug_pod5/

Kind regards,
Rich

Hi @HalfPhoton

I changed to a new set of test data and reran the command. This time, I encountered an error right from the beginning, as follows:
image
image

However, the program can still run normally. But it gets stuck at 99% and throws the following error:
image

and I try to run POD5_DEBUG=1 pod5 convert fast5 ./test/*.fast5 --output debug_pod5/, the errors are as following:
image

Kind regards,
Zhongyu

@HITzhongyu

The first report shows you using-t/--threads 40 which is giving a different error to the second report. You might be requesting too many resources which is why the tool is failing to create a new process or thread resulting in resource temporarily unavailable. I would suggest reducing the value given to --threads

For the second report which is related to the original issue raised; there should be .log files created now that POD5_DEBUG=1 is set . Can you share those with me please?

It looks like the Queue that contains the conversion tasks is becoming empty somehow or timing out after 600 seconds for a single conversion task (which should be plenty of time for a small chunk of).
The log files will help me track down why this happens. Either the process is getting stuck or the queue logic is failing in your example.

Kind regards,
Rich

Hi @HITzhongyu

Thank you very much for the logs. They've been very helpful.

From the main-pod5.log we can see that one of the worker processes has been killed from a segmentation fault

2023-06-27 20:21:44,357 DEBUG 66:'terminate_processes': ... SpawnProcess-11, stopped[SIGSEGV] daemon ...

and in the worker 11513-pod5.log we see that the log ends abruptly here:

--- Finishing previous file FAQ32498_pass_09083b73_65.fast5
2023-06-27 20:11:05,414 DEBUG 53:'convert_fast5_file':Done:37.193s
2023-06-27 20:11:05,425 DEBUG 53:'convert_fast5_file':Returned:4000
2023-06-27 20:11:05,427 INFO Enqueueing file end: FAQ32498_pass_09083b73_65.fast5 reads: 4000
2023-06-27 20:11:05,428 DEBUG c7:'enqueue_data'

--- Getting next file FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,430 DEBUG 56:'get_input':(<pod5.tools.pod5_convert_from_fast5.QueueManager object at 0x7f8b6c3b5b10>,), {}
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Done:0.000s
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Returned:test/FAQ32498_pass_09083b73_71.fast5

--- Testing is_multi_read_fast5 on FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,431 DEBUG 72:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_71.fast5'),), {}

--- Segfault

We'd expect to see

2023-06-27 20:10:26,479 DEBUG fd:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_65.fast5'),), {}
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Done:1.741s
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Returned:True

Can you please try and check that this file test/FAQ32498_pass_09083b73_71.fast5 is not corrupt in some way?

Kind regards,
Rich

Hi @HalfPhoton
I try to open test/FAQ32498_pass_09083b73_71.fast5,but HDFView can't open it
I can upload the file to you,and you can test it

Kind regards,
Zhongyu

@HITzhongyu ,
Can you open it with python?

Using the same environment where pod5 is installed these module imports should exists:

# Get the a path to the file
from pathlib import Path
path = Path("test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()

# Can we open the file with h5py? If it fails here then the HDF5 file is corrupted somehow
import h5py
h5 = h5py.File(path)

# Is the file empty? If it fails here there's nothing to do anyway and the file should be deleted 
assert len(h5) > 0

# Can pod5 check the file? If it fails here then there might be something we can do
from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
is_multi_read_fast5(pp)

@HalfPhoton
it report an error : Segmentation fault (core dumped)
image

Can you add a few print statements between tests or run it line-by-line in an interpreter to determine where the segfault occurs?

@HalfPhoton
sure!

from pathlib import Path
path = Path("/home/user/ydliu/hitbic/HG002/test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()
print("666")

import h5py
h5 = h5py.File(path)
print("777")

assert len(h5) > 0
print("888")

from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
print(is_multi_read_fast5(path))

image

Ok,

Please try this:

print("start")
with h5py.File(path) as _h5:
  print("open")           
  print(_h5)

  _h5.attrs
  print("can access_h5.attrs")
  print(_h5.attrs)

  # The "file_type" attribute might be present on supported multi-read fast5 files.
  if _h5.attrs.get("file_type") == "multi-read":
    return True
  print( "is not multi-read file type")

  if len(_h5) == 0:
    return True
  print( "is not len 0")

  # if there are "read_x" keys, this is a multi-read file
  if any(key for key in _h5 if key.startswith("read_")):
    print("found a read")
    return True

  print("closing handle")
print("everything is fine?!")

I modify your code,because it cause some error

print("start")
with h5py.File(path) as _h5:
    print("open")           
    print(_h5)

    _h5.attrs
    print("can access_h5.attrs")
    print(_h5.attrs)

    # The "file_type" attribute might be present on supported multi-read fast5 files.
    if _h5.attrs.get("file_type") == "multi-read":
        print("True")
        # return True
    print( "is not multi-read file type")

    if len(_h5) == 0:
        print("True")
        # return True
    print( "is not len 0")

    # if there are "read_x" keys, this is a multi-read file
    if any(key for key in _h5 if key.startswith("read_")):
        print("found a read")
        # return True

    print("closing handle")
print("everything is fine?!")

It reports an error:

start
open
<HDF5 file "FAQ32498_pass_09083b73_71.fast5" (mode r)>
can access_h5.attrs
<Attributes of HDF5 object at 139974581599904>
is not multi-read file type
is not len 0
Traceback (most recent call last):
  File "test.py", line 40, in <module>
    if any(key for key in _h5 if key.startswith("read_")):
  File "test.py", line 40, in <genexpr>
    if any(key for key in _h5 if key.startswith("read_")):
  File "/home/user/ydliu/miniconda3/envs/remora/lib/python3.8/site-packages/h5py/_hl/group.py", line 499, in __iter__
    for x in self.id.__iter__():
  File "h5py/h5g.pyx", line 128, in h5py.h5g.GroupIter.__next__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5l.pyx", line 316, in h5py.h5l.LinkProxy.iterate
RuntimeError: Link iteration failed (incorrect metadata checksum after all read attempts)

Hi @HITzhongyu ,

It does appear that your fast5 file is corrupt. This is the same issue as seen here: megalodon#279

I'm not sure what we can do other than to recommend that you check your files, and drop those that are corrupt before continuing with pod5 convert [Edit: subset -> convert]

Apologies we don't have better solution.

Kind regards,
Rich

Hi @HalfPhoton

Thank you very much for your patient explanation.

I have another question. If the Fast5 data is corrupted, why is there no issue with it during Guppy processing, but problems arise specifically with pod5?

Regarding this issue, can you perform a filtering step before converting with pod5, skipping any damaged Fast5 files that are recognized as single Fast5, without affecting the subsequent program execution? If there are only a few such damaged data points, it should not impact the results of large-scale methylation detection.

or if it's convenient for you, could you please let me know which part of the code needs to be modified? I can make the changes on my end.

Kind regards,
Zhongyu

@HITzhongyu

pod5 convert will try to ignore bad fast5 unless --strict is set. We removed the up-front fast5 checking because it was so slow.

In your case, the files are causing a prompt segfault which kills the worker process immediately instead of allowing it to handle the error gracefully. This is an issue with h5py.

There potential changes we can make to how we handle dead workers which we might investigate.

As for how Guppy can handle this file when pod5 cannot; I'm not sure, but Guppy is not using python / h5py which is where I believe the issue is caused.

Kind regards,
Rich

Edit: subset -> convert

@HalfPhoton

Thank you very much for your patient explanation!

Kind regards,
Zhongyu

Hi @HalfPhoton
I find pod5 subset to check pod5 not fast5

usage: pod5 subset [-h] [-o OUTPUT] [-r] [-f] [-t THREADS] [--csv CSV]
                   [-s TABLE] [-R READ_ID_COLUMN] [-c COLUMNS [COLUMNS ...]]
                   [--template TEMPLATE] [-T] [-M] [-D]
                   inputs [inputs ...]

Given one or more pod5 input files, take subsets of reads into one or more pod5 output files by a user-supplied mapping.

positional arguments:
  inputs                Pod5 filepaths to use as inputs

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Destination directory to write outputs (default:
                        /home/user/ydliu/hitbic/HG002)
  -r, --recursive       Search for input files recursively matching `*.pod5`
                        (default: False)
  -f, --force-overwrite
                        Overwrite destination files (default: False)
  -t THREADS, --threads THREADS
                        Number of subsetting workers (default: 8)

direct mapping:
  --csv CSV             CSV file mapping output filename to read ids (default:
                        None)

table mapping:
  -s TABLE, --summary TABLE, --table TABLE
                        Table filepath (csv or tsv) (default: None)
  -R READ_ID_COLUMN, --read-id-column READ_ID_COLUMN
                        Name of the read_id column in the summary (default:
                        read_id)
  -c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        Names of --summary / --table columns to subset on
                        (default: None)
  --template TEMPLATE   template string to generate output filenames (e.g.
                        "mux-{mux}_barcode-{barcode}.pod5"). default is to
                        concatenate all columns to values as shown in the
                        example. (default: None)
  -T, --ignore-incomplete-template
                        Suppress the exception raised if the --template string
                        does not contain every --columns key (default: None)

content settings:
  -M, --missing-ok      Allow missing read_ids (default: False)
  -D, --duplicate-ok    Allow duplicate read_ids (default: False)

Example: pod5 subset inputs.pod5 --output subset_mux/ --summary summary.tsv --columns mux```

Sorry, my error. I meant to say pod5 convert not pod5 subset when explaining the --strict option above.

Are you happy with the solution @HITzhongyu , can we close this issue?

Hi @HalfPhoton

sure!
Thank you very much for your patient explanation!

Kind regards,
Zhongyu