hzi-bifo/RiboDetector

ribodetector_cpu classifies and outputs more reads in the end than there were in the input data to begin with

erzakiev opened this issue ยท 13 comments

Downloaded and installed ribodetector through conda on an M1 silicon with MacOS.

When running the ribodetector_cpu it continuously prints out the progress which is very cool, but at some point I realized that the number of classified reads as reported by the logger (for instance 93.8 million in the latest log message below), as well as the number of reads in the output files is drastically bigger than the number of reads in the input files. What gives?

% ribodetector_cpu -t 10 \
  -l 100 \
  -i inputs/raw_data/124N_S1_L003_R1_001.fastq.gz inputs/raw_data/124N_S1_L003_R2_001.fastq.gz \
  -e rrna \
  --chunk_size 256 \
  -o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq

The input fastq files have ~12 million lines each, so approx ~4 million reads

wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
 12378859 inputs/raw_data/124N_S1_L003_R1_001.fastq.gz

and

wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
 12840144 inputs/raw_data/124N_S1_L003_R2_001.fastq.gz

When It's running:

2023-02-22 00:40:05 : INFO  Using high MCC model file: /Users/administrateur/miniconda3/envs/ribodetector/lib/python3.8/site-packages/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.onnx on CPU
2023-02-22 00:40:05 : INFO  Classify reads with chunk size 256
2023-02-22 00:40:05 : INFO  Writing output non-rRNA sequences into file: outputs/reads.nonrrna.1.fq, outputs/reads.nonrrna.2.fq
2023-02-22 00:41:06 : INFO  262144 reads classified!
2023-02-22 00:42:07 : INFO  524288 reads classified!
2023-02-22 00:43:09 : INFO  786432 reads classified!
.......................................................................................
... ###after several hours of runtime###......................
.......................................................................................
2023-02-22 08:21:08 : INFO  91750400 reads classified!
2023-02-22 08:22:07 : INFO  92012544 reads classified!
2023-02-22 08:23:07 : INFO  92274688 reads classified!
2023-02-22 08:24:07 : INFO  92536832 reads classified!
2023-02-22 08:25:06 : INFO  92798976 reads classified!
2023-02-22 08:26:06 : INFO  93061120 reads classified!
2023-02-22 08:27:07 : INFO  93323264 reads classified!
2023-02-22 08:28:08 : INFO  93585408 reads classified!
2023-02-22 08:29:09 : INFO  93847552 reads classified!

the number of classified reads in the logger is 10x higher, but in the output files there is even more of them: they suddenly both have exactly the same number of 379 Million reads (which is suspicious I would assume?)

wc -l outputs/reads.nonrrna.1.fq  
 379258656 reads.nonrrna.1.fq
wc -l outputs/reads.nonrrna.2.fq  
 379258656 outputs/reads.nonrrna.2.fq

Thank you for reporting this issue. This is interesting. I have never tested it with Apple silicon CPU. But with old Apple MacBook with Intel CPU, it works. I assume the multiprocessing works a bit different in Apple silicon. I will investigate it. This software is designed more for a x86 based system server as the real input data are usually large. If you have a Linux server, you could also try it out.

Thank you for the response. I will test on a Linux server shortly and will come back to you

How would you launch it on a SLURM scheduler in order to parallelize it across, say, 50 jobs? The script below seems to launch only one job on a single node...
runRiboDetector.sh:

#!/bin/bash
## SLURM parameters
#SBATCH --cpus-per-task=50
#SBATCH --threads-per-core=1
#SBATCH -t 0-09:59 # time (D-HH:MM) # 0-00:00 denotes absence of time limit
#SBATCH -o ./ribodetector_output/slurm.%A.%a.out # STDOUT
#SBATCH -e ./ribodetector_output/slurm.%A.%a.err # STDERR

srun ribodetector_cpu -t 1 \
  -l 100 \
  -i 124N_S1_L003_R1_001.fastq.gz 124N_S1_L003_R2_001.fastq.gz \
  -e rrna \
  --chunk_size 256 \
  -o reads.nonrrna.1.fq reads.nonrrna.2.fq

Launched with sbatch runRiboDetector.sh

@erzakiev I think you just need to change -t 1 to -t 50. This is what I usually do.

On the cluster when I run it with -t 50 it feels much faster but it also overproduces 10x the output file size and the number of classified reads as reported by the logfile (~90 million instead of ~8). Just like it does on the local machine...

But with srun ribodetector_cpu -t 1 it seems to work correctly, albeit slowly.

If I combine -t 1 with sbatch --array=0-49 runRiboDetector.sh the results seem to also overshoot and some jobs failing and in the end the output is all over the place.

I will probably stick to running it in -t 1 and not using --array mode, at least for now.

The compute nodes' architectures on which I tried RiboDetector are as follows:

(ribodetector) [zakiev@frontal ~]$ scontrol show node n005
NodeName=n005 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=50 CPUTot=64 CPULoad=1.05
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n005 NodeHostName=n005 Version=20.11.2
   OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 
   RealMemory=510000 AllocMem=0 FreeMem=45476 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=production,research 
   BootTime=2021-10-29T09:02:48 SlurmdStartTime=2021-10-29T09:03:08
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=cpu=50
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

(ribodetector) [zakiev@frontal ~]$ scontrol show node n008
NodeName=n008 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=20 CPUTot=64 CPULoad=0.96
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n008 NodeHostName=n008 Version=20.11.2
   OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 
   RealMemory=510000 AllocMem=0 FreeMem=1992 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=production,research 
   BootTime=2021-10-29T09:02:40 SlurmdStartTime=2021-10-29T09:03:00
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=cpu=20
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

(ribodetector) [zakiev@frontal ~]$ scontrol show node n012
NodeName=n012 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=20 CPUTot=64 CPULoad=1.12
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n012 NodeHostName=n012 Version=20.11.2
   OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 
   RealMemory=510000 AllocMem=0 FreeMem=21325 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=production,research 
   BootTime=2021-10-29T09:02:43 SlurmdStartTime=2021-10-29T09:03:03
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=cpu=20
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

(ribodetector) [zakiev@frontal ~]$ scontrol show node n011
NodeName=n011 Arch=x86_64 CoresPerSocket=32 
   CPUAlloc=56 CPUTot=64 CPULoad=2.83
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n011 NodeHostName=n011 Version=20.11.2
   OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 
   RealMemory=510000 AllocMem=0 FreeMem=28392 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=production,research 
   BootTime=2021-10-29T09:02:39 SlurmdStartTime=2021-10-29T09:03:02
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=cpu=56
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

this is strange. which version did you use? I usually allocate the resources first with: srun --pty -p cpu --qos verylong --cpus-per-task=40 --mem=256g --time=600:00:00. Then

ribodetector_cpu -t 10 -l 80 -i datasets/s1.{1,2}.fq -r outputs/s1.ribodetector.rrna.{1,2}.fq.gz -o outputs/s1.ribodetector.norrna.{1,2}.fq.gz -e rrna

Never had such an issue.

If you are using the latest version and still have this issue, could you send me your fq file for a test?

Sure, I will re-run with the parameters you suggested

srun --pty -p cpu --qos verylong --cpus-per-task=40 --mem=256g --time=600:00:00. Then

ribodetector_cpu -t 10 -l 80 -i datasets/s1.{1,2}.fq -r outputs/s1.ribodetector.rrna.{1,2}.fq.gz -o outputs/s1.ribodetector.norrna.{1,2}.fq.gz -e rrna

The version should be the latest, I sourced it from conda this morning,

(ribodetector) [zakiev@frontal ~]$ ribodetector -v
ribodetector 0.2.7

send me your fq file for a test?

In the meantime that I rerun riboDetector with the suggested parameters, that, indeed, would be great, but please feel no pressure, maybe it's because of bad parameters on my part. I will double-check first. But if you wish to take a look yourself as well, feel free to do so, obviously; could you post your email or some other means of communication so I could drop you a link to download the fastq files (~6 Gb)?

wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
12378859 inputs/raw_data/124N_S1_L003_R1_001.fastq.gz

Could you try to count the number of reads with

zcat fq.gz | wc -l

wc -l outputs/reads.nonrrna.1.fq
379258656 reads.nonrrna.1.fq

This should be 379258656/4 = 94814664 reads

download the fastq files (~6 Gb)

Do you mean 6G bytes? Compressed fastq file with 4 million reads should not be so large, right? My email is dawnmsg(at)gmail.com

Hey, sorry for the delay in response.

zcat fq.gz | wc -l

oops, what a blunder on my part!!! Yes, i was wc -ling the files without unzipping them... Sorry for that.
In reality there is 371 million lines (92 million entries/reads)

zcat 124T_S5_L003_R1_001.fastq.gz | wc -l
371452316

Let me double-check everything and then we could close the issue as I believe it was an error on my part. But it's good that we discussed this because someone else might do the same mistake one day and this thread will be hopefully helpful to them.

ok it works, sorry for the false alarm

Hey, Just a quick Q: how crucial is to set the -l parameter exactly to the size of my reads? The Fastqc, for example, reports the size of all my reads as 76 in a given experiment:
Screenshot 2023-02-27 at 15 46 47
In this particular instance I ran the algorithm with the default setting of -l 80 and wonder if I need to re-run it again with the -l 76 for better results.

I would assume -l 76 gives more accurate result. But the difference can be small