pod5 subset - terminate called without an active exception with slurm submissions
Closed this issue · 4 comments
I'm using pod5 0.2.2 (miniconda environment) and I'm trying to run pod5 subset on a couple of pod5 files on a computing cluster via slurm and get quite often the following errors: terminate called without an active exception, RuntimeError: Resource temporarily unavailable. Some jobs complete normal, some continue running until the end of the allowed time without producing any output and some produce only part of the output
the relevant slurm and code snippets are these for some of the jobs which failed the first time and I rerun it (again only ~50% of the jobs completed normal the others failed again)
#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=9
#SBATCH --array=16,23-27,29,31-36,38-41
export OMP_NUM_THREADS=8
POD5FILE="../pod5batches/batch_${SLURM_ARRAY_TASK_ID}.pod5"
CSVFILE="summary_${SLURM_ARRAY_TASK_ID}.csv"
CMD="pod5 subset --threads 8 --force-overwrite --csv $CSVFILE --duplicate-ok $POD5FILE"
source ~/.bashrc
source ~/pod5/bin/activate
eval $CMD
Attached are some error reports (including from first attempt)
subset_23024531_34.err.txt
subset_23059021_27.err.txt
subset_23059021_32.err.txt
Hello @bef22
Thank you for reporting this issue. From the reports you've shared (thank you they're very helpful!) we can see that your tasks are crashing during thread creation in various different places:
32 - Polars (Rust style panic):
thread '<unnamed>' panicked at 'could not spawn threads: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }', /home/runner/work/polars/polars/polars/polars-core/src/lib.rs:61:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
33 - Importing the lib_pod5 c api (which uses multiple threads)
File "/home/bef22/pod5/lib/python3.7/site-packages/lib_pod5/__init__.py", line 4, in <module>
from .pod5_format_pybind import (
ImportError: Resource temporarily unavailable
The curious log is is 27
terminate called without an active exception
but based on what we've seen above and from some googling this appears to be a threading issue also.
Under the hood there are a large number of threads and processes being created by pod5
especially when subsetting. The --threads
argument here is a misnomer as it's actually controlling the number of additional processes being created. For each process lib_pod5 c_api
creates 10 threads and polars
creates as many as it can get away with.
My recommendation to improve the stability of pod5
while running on your slurm
cluster is to apply one more of the following:
- Reduce
pod5 subset --threads
- Increase requested
slurm
resources - Reduce the number of threads spawned by polars
POLARS_MAX_THREADS
polars.threadpool_size
I hope this helps resolve the issue. Please keep us informed with any progress.
Kind regards,
Rich
@bef22 , are there any updates on this issue or can we close it?
I processed my second library which had a higher read length, so fewer pod5 batches to process and changed the slurm script to:
#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=9
#SBATCH --array=1-14
export OMP_NUM_THREADS=1
POD5FILE="../pod5batches/batch_${SLURM_ARRAY_TASK_ID}.pod5"
CSVFILE="summary_${SLURM_ARRAY_TASK_ID}.csv"
CMD="pod5 subset --threads 8 --force-overwrite --csv $CSVFILE --duplicate-ok $POD5FILE"
source ~/.bashrc
source ~/pod5/bin/activate
eval $CMD
This actually ran without any errors!
You can close this issue, it might have been the "export OMP_NUM_THREADS" in my first script which might have caused this issue.
Thank you for the update. I'm glad to hear your your pipeline is working again!
Kind regards,
Rich