nanoporetech/pod5-file-format

fast5 to pod5 conversion issue

Closed this issue · 11 comments

Created from Nanopore Community Post:

https://community.nanoporetech.com/posts/not-even-convertion-of-fa

Hello,

We aim to convert our old .fast5 files into the .pod5 format to enable basecalling with the latest version of Dorado. To achieve this, we utilize the following command:

$ pod5 convert fast5 path/*.fast5 --output path/Pod5

However, we encountered an issue where only 86% of the .fast5 files were successfully converted. The program concluded, displaying the following message on the screen:

"Unable to synchronously open object (object 'Signal' doesn't exist)" "Unable to synchronously open object (object 'Signal' doesn't exist)" "Unable to synchronously open object (object 'Signal' doesn't exist)" Converting 720 Fast5s: 86%|#########################################################7 | 2481101/2879901 [14:53<02:23, 2777.96Reads/s]

I've attempted this process multiple times, even dividing the .fast5 files into separate subfolders to decrease the total number of files. However, it consistently converts the same number of reads. Has anyone else encountered a similar issue?

regards,


Hi Francisco Astigueta ,

This issue looks like a problem with a FAST5 file where the "Signal" group is missing.

Here is a short script which should help identify which fast5 is causing you the problem.

from pathlib import Path
from sys import argv
from typing import List
from pod5.tools.pod5_convert_from_fast5 import collect_inputs, is_multi_read_fast5
from h5py import File


def test_file(f5_path) -> List[str]:
    issues = []
    if not is_multi_read_fast5(f5_path):
        issues.append(f"Not a multi-read fast5: {f5_path}")
        return issues

    try:
        with File(str(f5_path), "r") as f5:
            read_ids = [k for k in f5.keys() if k.startswith("read_")]

            if not read_ids:
                issues.append(f"Is empty fast5: {f5_path}")

            for read_id in read_ids:
                try:
                    read = f5[read_id]
                except Exception:
                    issues.append(f"Cannot read {read_id} from: {f5_path}")

                if not read:
                    issues.append(f"Null read {read_id} from: {f5_path}")

                if "Raw" not in read:
                    issues.append(f"Missing 'Raw' group {read_id} from: {f5_path}")
                else:
                    raw = read["Raw"]
                    if "Signal" not in raw:
                        issues.append(f"Missing 'Raw/Signal' group {read_id} from: {f5_path}")

    except Exception as exc:
        issues.append(f"Other exception: {exc} from: {f5_path}")

    return issues

def main(search_dir: Path = Path.cwd()):
    """Checks all FAST5 files in `search_dir` for issues"""

    print(f"Searching {search_dir.resolve()} for fast5 files")

    fast5_paths = collect_inputs([search_dir], False, "*.fast5", threads=1)
    if not fast5_paths:
        raise RuntimeError(
            f"Found no fast5 inputs to process in: {search_dir.resolve()}"
        )

    issues = []
    for f5_path in fast5_paths:
        issue = test_file(f5_path)
        if issue:
            issues.extend(issue)

    print(f"Found {len(issues)} issues from {len(fast5_paths)} input paths")
    if issues:
        try:
            outfile = Path.cwd() / "issues.txt"
            with outfile.open("w") as out:
                print(f"Writing {len(issues)} to {outfile}")
                for issue in issues:
                    out.write(issue + "\n")
        except Exception as exc:
            print(f"Failed to write to output file: {exc}")
            for issue in issues:
                print(issue)


if __name__ == "__main__":
    main(Path(argv[1]))

This script must be run in an environment with pod5 installed.

If this file is saved as find_fas5_issues.py, this script can be used to find issues in all fas5 files in the current directory . like this:

python find_fas5_issues.py .

Any issues should be written to issues.txt which may look like this:

❯ python find_fas5_issues.py . 
  Searching /your/search/path  for fast5 files
  Found N issues from Y input paths
  Writing N to issues.txt

❯ cat issues.txt 
  Missing 'Raw/Signal' group read_id0 from: bad_1.fast5
  Other exception: Unable to open file (file signature not found) from: bad_2.fast5
  ...

Edit 07-06-2024: Script now collects all errors when missing 'Raw' group

Hello HalfPhoton,
Quick update. I've executed the script, and here are a few of the reported issues from the output file:"Missing 'Raw/Signal' group read_b775948c-f165-46b8-aa1f-202b9aec3928 from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_129.fast5
Missing 'Raw/Signal' group read_c96181fe-a0ab-45cc-af60-6846ee0f637e from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_198.fast5"

all problematic reads presented the same descripion. Could this be fixed?

regards,

Hi @astifran ,

This is a very unusual error that I have not seen before.

Are these files empty or unusual in some other way maybe they're custom generated?

The following command will show the unique filenames with issues and count the number of issues that appear for each file if this is helpful. A typical complete fast5 file will contain about 4000 records.

❯ awk -F',' '{print $NF}' issues.txt | uniq -c

My recommendation would be to exclude these files from conversion as they're potentially corrupt.

Best regards,
Rich

Rich, here are some of the results: 1 Missing 'Raw/Signal' group read_b775948c-f165-46b8-aa1f-202b9aec3928 from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_129.fast5
1 Missing 'Raw/Signal' group read_c96181fe-a0ab-45cc-af60-6846ee0f637e from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_198.fast5
1 Missing 'Raw/Signal' group read_29831ee8-5614-49b3-ab47-a18d068b4f71 from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_174.fast5
1 Missing 'Raw/Signal' group read_4e851c3b-08a4-4063-85e4-96cd684d9c0b from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_119.fast5
1 Missing 'Raw/Signal' group read_2d88cdb3-83d0-4005-8b27-9d21b788edc0 from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_148.fast5
1 Missing 'Raw/Signal' group read_52cb75cc-9a99-4bc3-894c-ce4d3051375e from: /mnt/tools/RUN001/fast5/fast5_14/FAV30123_2e7199c6_cdef5928_112.fast5

do you think all 4000 records per fast5 file could be damage? could not save some of them?

This means that there ir only 1 read with the missing "Raw/Signal" flag per .fast5 file? If so, perhaps we colud be able to save the rest of the reads in each file. Moreover, here is the error shown when trying to convert the .fast5 files in the online tool. It procces some files but when some broken file appears, it stops the convertion:

image

could we edit the .fast5 files in order to do this?

Hi @astifran , yes we can remove these problematic read ids using the information contained within the issues.txt

The following will extract all the problematic read_id groups from the issues.txt into bad_reads.txt

grep -o "read_[0-9a-f-]*" issues.txt > bad_reads.txt

This python snippet will delete matching read ids in all fast5 files given a list of read ids. Let's call this file delete_fas5_reads.py

Caution

This will delete reas and should be used with caution!

from pathlib import Path
from sys import argv
from typing import Set
from pod5.tools.pod5_convert_from_fast5 import collect_inputs, is_multi_read_fast5
from h5py import File


def remove_reads(f5_path: Path, bad_ids: Set[str]):
    if not is_multi_read_fast5(f5_path):
        return f"Not a multi-read fast5: {f5_path}"

    try:
        with File(str(f5_path), "r+") as f5:
            read_ids = set([k for k in f5.keys() if k.startswith("read_")])

            if not read_ids:
                return f"Is empty fast5: {f5_path}"

            common_ids = bad_ids & read_ids

            for read_id in common_ids:
                try:
                    # del f5[read_id]  # UNCOMMENT THIS LINE TO DELETE READS
                    print(f"Deleted {read_id} from: {f5_path}")
                except Exception:
                    print(f"Cannot delete {read_id} from: {f5_path}")

    except Exception as exc:
        return f"Other exception: {exc} from: {f5_path}"


def main(search_dir: Path = Path.cwd(), bad_reads: Path = Path.cwd() / "bad_reads.txt"):
    """Remove bad reads from fast5"""

    print(f"Searching {search_dir.resolve()} for fast5 files")

    fast5_paths = collect_inputs([search_dir], False, "*.fast5", threads=1)
    if not fast5_paths:
        raise RuntimeError(
            f"Found no fast5 inputs to process in: {search_dir.resolve()}"
        )

    bad_ids = set(
        [
            line
            for line in bad_reads.read_text().splitlines()
            if line.startswith("read_")
        ]
    )
    if not bad_ids:
        raise RuntimeError(f"Found no read ids in: {bad_reads.resolve()}")

    for f5_path in fast5_paths:
        issue = remove_reads(f5_path, bad_ids)
        if issue: 
            print(issue)


if __name__ == "__main__":
    main(Path(argv[1]), Path(argv[2]))

Caution

This will delete reads and should be used with caution!

Takes a fast5 path or directory of fast5s and a list of read ids to delete (default: bad_reads.txt)
The script above has the line del f5[read_id] commented out and in it's current state does nothing other than explain what the script would do. Please test this script in it's inert state first.

Important

Please back-up all your data before running this script and store it away from where this tool will be run.
Please review this code and only use it once your satisfied that it will do as you intend.

Kind regards,
Rich

Rich, I believed the script goes OK. This is the result:
$ python3 /mnt/tools/RUN001/delete_fas5_reads.py /mnt/tools/RUN001/fast5_for_pod5 /mnt/tools/RUN001/bad_reads.txt
Searching /mnt/tools/RUN001/fast5_for_pod5 for fast5 files
Deleted read_48262472-b561-4d05-a438-c0a2dbd77259 from: /mnt/tools/RUN001/fast5_for_pod5/FAV30123_2e7199c6_cdef5928_220.fast5
Deleted read_2bc6adbb-6694-43b0-8ce3-5a6751902cf2 from: /mnt/tools/RUN001/fast5_for_pod5/FAV30123_2e7199c6_cdef5928_144.fast5
Deleted read_055ca1da-162f-4efa-ae85-1647fc8e4213 from: /mnt/tools/RUN001/fast5_for_pod5/FAV30123_2e7199c6_cdef5928_157.fast5

etc, etc, The same number of erroneous reads detected were 'excluded' using the script. However, as I understood from your previous message, this was merely to verify the script. I'll await your reply before converting it to .pod5 again

Hi @astifran ,
This looks as I'd expect.

If you're happy to continue, uncomment the del f5[read_id] line and it should delete those reads on the next run.

Please back-up your data first if possible.

Kind regards,
Rich

"Hi Rich, here's an update. It appears that some reads were deleted. However, after attempting to convert .fast5 files (excluding the problematic reads), the same error persists. Nonetheless, the conversion rate improved significantly. It increased from 86% without removing the problematic reads:

Converting 720 Fast5s: 86% | ########################################################7 |
2481101/2879901 [14:53<02:23, 2777.96 Reads/s]

To 96% after isolating the problematic reads:

Converting 719 Fast5s: 95% | ########################################################5 |
2725842/2875842 [20:01<01:06, 2268.29 Reads/s]

The error message remains consistent:
"Unable to synchronously open object (object 'Signal' doesn't exist)"

This led me to suspect there might have been undetected problematic reads initially. I re-ran the find_fas5_issues.py analysis and found a few new problematic reads. I then used the function to eliminate them and re-ran the conversion to .pod5.

Again, the conversion reported a few errors but almost reached 100% completion:

"Unable to synchronously open object (object 'Signal' doesn't exist)"
Converting 719 Fast5s: 99% | ################################################################## | 2834130/2875730 [20:42<00:18, 2281.58 Reads/s].

It seems the pipeline functions quite well, though it requires running it 3 to 4 times to catch all problematic reads.

Looking forward to your comments. Thank you, Rich!"

Hi @astifran, this was my mistake:
The finder script only reports the first bad read in a file.

I'll retroactively update the initial script to fix this. Please see the updates when they're made.

I'm glad to hear your issue is mostly resolve though.

Out of interest do you know how this might happened int the first place?

Best regards,
Rich

Rich, thank you for your help. I'm looking forward to the final script.
I'm not entirely sure why this happened. These reads were created several months ago as .fast5 files using a minION and the latest chemistry, so they're not extremely old files. Subsequently, they were moved to an external disk for backup purposes. I don't think the issue arose during the transfer process, but one can never be certain.
best regards,
Francisco

Francisco, the script in the first post was updated in-place.
This should now find all issues in a file in one go allowing all bad records to be deleted at once.

Thanks for your patience with this issue and I'll raise this strange result with the appropriate teams to investigate.

Best regards,
Rich