hzi-bifo/RiboDetector

ribodetector_cpu hangs with SLURM

gianfilippo opened this issue · 4 comments

Hi,

I tried your package on an interactive SLURM session, and it worked.

I then tried to submit it as a job via SLURM and it hangs at

2023-03-09 16:13:36 : INFO Using high MCC model file: /home/conda_envs/ribodetector/lib/python3.9/site-packages/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.onnx on CPU

I already tried to reinstall and nothing changes.

The command I issued in both sessions is
ribodetector_cpu -t 8 -l 92 -i $FASTQ1.fq.gz $FASTQ1.fq.gz -e rrna -o $outFASTQ1.nonrrna.1.fq $outFASTQ2.nonrrna.2.fq

What can I do ?

Thanks

Could you post your SLURM script or command used to submit the job? You need to specify --cpus-per-task to the number you CPU cores you need and set --threads-per-core to 1.

I'm running into the same issue here. I submit it with sbatch, and it runs within a singularity container from here.
At the start there are two active processes on the node, and after 5 mins, there's nothing going on anymore..

This is my script:

#!/usr/bin/env bash

#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=4G
#SBATCH --cpus-per-task=12
#SBATCH --threads-per-core=1

cd /workdir

MEAN_READ_LENGTH=`zcat results/fastp/MP_35_R1_trimmed.fastq.gz | head -1000 | awk '{if(NR%4==2) {count++; bases += length} } END {print int(bases/count)}' || true`

echo "Estimated read length: $MEAN_READ_LENGTH" 

singularity exec containers/ribodetector_0.2.7-cpu.sif \
ribodetector_cpu \
--len "$MEAN_READ_LENGTH" \
--threads "$SLURM_CPUS_PER_TASK" \
--input results/fastp/MP_35_R1_trimmed.fastq.gz results/fastp/MP_35_R2_trimmed.fastq.gz \
--output results/ribodetector/MP_35_R1.fastq.gz results/ribodetector/MP_35_R2.fastq.gz \
--rrna results/ribodetector/MP_35_R1_rrna.fastq.gz results/ribodetector/MP_35_R2_rrna.fastq.gz \
--ensure rrna

It works now. The issue was not setting --chunk_size which led to memory issues.

RTFM.....

dawnmy commented

It works now. The issue was not setting --chunk_size which led to memory issues.

RTFM.....

It is great that you figured out the solution. This will be beneficial to other users. Will incorporate this into the FAQ in README.