nanoporetech/pod5-file-format

pod5 filter freeze

diego-rt opened this issue · 3 comments

Hello,

When running pod5 filter I often get a freeze for the process in a random manner. Rerunning it usually results in successful completion. It is annoying because it does not lead to an exit code or something so the process just hangs until timeout.

This is the command:

pod5 filter ${pod5_dir} -t ${task.cpus} -r --ids filtered.channel_\${channel}.txt --missing-ok --output ./filtered.channel_\${channel}.pod5 

This is the output:

Parsed 98 reads_ids from: filtered.channel_1381.txt
terminate called without an active exception

Thanks!

Hi @diego-rt ,

We're reworking subset which is the underlying process used by filter to significantly lower resources and improve performance. Also mentioned here: #93 (comment)

We'll hopefully get this out before year end.


However to help out in the meantime:

It looks like you're running in nextflow based on the syntax of your command. I would recommend trying / exploring the following to which will hopefully improve reliability.

  • Reduce -t ${task.cpus} - this only has a small effect in filter and doesn't effect the runtime performance of filter.
  • Increase memory allocated for the task.
  • Use maxForks to limit the number of concurrent tasks.
    • Reducing the number of parallel tasks might improve stability especially if there are large number of input files as there are potentially a very large number of open file descriptors during filtering / subsetting.
  • errorStrategy: retry
    • Retry failing jobs automatically

I hope these points help in the meantime and we'll get back to you soon with an update.

Kind regards,
Rich

Hi @HalfPhoton ,

Yes indeed I'm using nextflow with only one thread and 3G memory. I think the issue is that I've heavily parallelized it and have several hundred jobs simultaneously accessing the same file, which leads to some understandable I/O error. I should maybe reduce the number of forks, that's true.

But I think the main problem is the fact that the process hangs without exiting. It would be fine if it just died with an error exit code because it just would retry, but since it does not actually exit, then the process just sits there until timeout.

Yes you're absolutely correct and these changes will be incorporated to the new design of filter and subset which will be more stable for large numbers of Inputs / Outputs and scale better for use cases like your own.

Best regards,
Rich