nanoporetech/pod5-file-format

pod5 subset - terminate called without an active exception with slurm submissions

Closed this issue · 4 comments

bef22 commented

I'm using pod5 0.2.2 (miniconda environment) and I'm trying to run pod5 subset on a couple of pod5 files on a computing cluster via slurm and get quite often the following errors: terminate called without an active exception, RuntimeError: Resource temporarily unavailable. Some jobs complete normal, some continue running until the end of the allowed time without producing any output and some produce only part of the output

the relevant slurm and code snippets are these for some of the jobs which failed the first time and I rerun it (again only ~50% of the jobs completed normal the others failed again)
#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=9
#SBATCH --array=16,23-27,29,31-36,38-41
export OMP_NUM_THREADS=8

POD5FILE="../pod5batches/batch_${SLURM_ARRAY_TASK_ID}.pod5"
CSVFILE="summary_${SLURM_ARRAY_TASK_ID}.csv"
CMD="pod5 subset --threads 8 --force-overwrite --csv $CSVFILE --duplicate-ok $POD5FILE"

source ~/.bashrc
source ~/pod5/bin/activate
eval $CMD

Attached are some error reports (including from first attempt)
subset_23024531_34.err.txt
subset_23059021_27.err.txt
subset_23059021_32.err.txt

Hello @bef22

Thank you for reporting this issue. From the reports you've shared (thank you they're very helpful!) we can see that your tasks are crashing during thread creation in various different places:

32 - Polars (Rust style panic):
thread '<unnamed>' panicked at 'could not spawn threads: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }', /home/runner/work/polars/polars/polars/polars-core/src/lib.rs:61:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
33 - Importing the lib_pod5 c api (which uses multiple threads)
  File "/home/bef22/pod5/lib/python3.7/site-packages/lib_pod5/__init__.py", line 4, in <module>
    from .pod5_format_pybind import (
ImportError: Resource temporarily unavailable

The curious log is is 27 terminate called without an active exception but based on what we've seen above and from some googling this appears to be a threading issue also.

Under the hood there are a large number of threads and processes being created by pod5 especially when subsetting. The --threads argument here is a misnomer as it's actually controlling the number of additional processes being created. For each process lib_pod5 c_api creates 10 threads and polars creates as many as it can get away with.

My recommendation to improve the stability of pod5 while running on your slurm cluster is to apply one more of the following:

  • Reduce pod5 subset --threads
  • Increase requested slurm resources
  • Reduce the number of threads spawned by polars POLARS_MAX_THREADS polars.threadpool_size

I hope this helps resolve the issue. Please keep us informed with any progress.

Kind regards,
Rich

@bef22 , are there any updates on this issue or can we close it?

bef22 commented

I processed my second library which had a higher read length, so fewer pod5 batches to process and changed the slurm script to:
#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=9
#SBATCH --array=1-14
export OMP_NUM_THREADS=1

POD5FILE="../pod5batches/batch_${SLURM_ARRAY_TASK_ID}.pod5"
CSVFILE="summary_${SLURM_ARRAY_TASK_ID}.csv"
CMD="pod5 subset --threads 8 --force-overwrite --csv $CSVFILE --duplicate-ok $POD5FILE"

source ~/.bashrc
source ~/pod5/bin/activate
eval $CMD

This actually ran without any errors!

You can close this issue, it might have been the "export OMP_NUM_THREADS" in my first script which might have caused this issue.

Thank you for the update. I'm glad to hear your your pipeline is working again!

Kind regards,
Rich