ribodetector_cpu classifies and outputs more reads in the end than there were in the input data to begin with
erzakiev opened this issue ยท 13 comments
Downloaded and installed ribodetector through conda on an M1 silicon with MacOS.
When running the ribodetector_cpu
it continuously prints out the progress which is very cool, but at some point I realized that the number of classified reads as reported by the logger (for instance 93.8 million in the latest log message below), as well as the number of reads in the output files is drastically bigger than the number of reads in the input files. What gives?
% ribodetector_cpu -t 10 \
-l 100 \
-i inputs/raw_data/124N_S1_L003_R1_001.fastq.gz inputs/raw_data/124N_S1_L003_R2_001.fastq.gz \
-e rrna \
--chunk_size 256 \
-o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq
The input fastq files have ~12 million lines each, so approx ~4 million reads
wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
12378859 inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
and
wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
12840144 inputs/raw_data/124N_S1_L003_R2_001.fastq.gz
When It's running:
2023-02-22 00:40:05 : INFO Using high MCC model file: /Users/administrateur/miniconda3/envs/ribodetector/lib/python3.8/site-packages/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.onnx on CPU
2023-02-22 00:40:05 : INFO Classify reads with chunk size 256
2023-02-22 00:40:05 : INFO Writing output non-rRNA sequences into file: outputs/reads.nonrrna.1.fq, outputs/reads.nonrrna.2.fq
2023-02-22 00:41:06 : INFO 262144 reads classified!
2023-02-22 00:42:07 : INFO 524288 reads classified!
2023-02-22 00:43:09 : INFO 786432 reads classified!
.......................................................................................
... ###after several hours of runtime###......................
.......................................................................................
2023-02-22 08:21:08 : INFO 91750400 reads classified!
2023-02-22 08:22:07 : INFO 92012544 reads classified!
2023-02-22 08:23:07 : INFO 92274688 reads classified!
2023-02-22 08:24:07 : INFO 92536832 reads classified!
2023-02-22 08:25:06 : INFO 92798976 reads classified!
2023-02-22 08:26:06 : INFO 93061120 reads classified!
2023-02-22 08:27:07 : INFO 93323264 reads classified!
2023-02-22 08:28:08 : INFO 93585408 reads classified!
2023-02-22 08:29:09 : INFO 93847552 reads classified!
the number of classified reads in the logger is 10x higher, but in the output files there is even more of them: they suddenly both have exactly the same number of 379 Million reads (which is suspicious I would assume?)
wc -l outputs/reads.nonrrna.1.fq
379258656 reads.nonrrna.1.fq
wc -l outputs/reads.nonrrna.2.fq
379258656 outputs/reads.nonrrna.2.fq
Thank you for reporting this issue. This is interesting. I have never tested it with Apple silicon CPU. But with old Apple MacBook with Intel CPU, it works. I assume the multiprocessing works a bit different in Apple silicon. I will investigate it. This software is designed more for a x86 based system server as the real input data are usually large. If you have a Linux server, you could also try it out.
Thank you for the response. I will test on a Linux server shortly and will come back to you
How would you launch it on a SLURM scheduler in order to parallelize it across, say, 50 jobs? The script below seems to launch only one job on a single node...
runRiboDetector.sh
:
#!/bin/bash
## SLURM parameters
#SBATCH --cpus-per-task=50
#SBATCH --threads-per-core=1
#SBATCH -t 0-09:59 # time (D-HH:MM) # 0-00:00 denotes absence of time limit
#SBATCH -o ./ribodetector_output/slurm.%A.%a.out # STDOUT
#SBATCH -e ./ribodetector_output/slurm.%A.%a.err # STDERR
srun ribodetector_cpu -t 1 \
-l 100 \
-i 124N_S1_L003_R1_001.fastq.gz 124N_S1_L003_R2_001.fastq.gz \
-e rrna \
--chunk_size 256 \
-o reads.nonrrna.1.fq reads.nonrrna.2.fq
Launched with sbatch runRiboDetector.sh
@erzakiev I think you just need to change -t 1 to -t 50. This is what I usually do.
On the cluster when I run it with -t 50
it feels much faster but it also overproduces 10x the output file size and the number of classified reads as reported by the logfile (~90 million instead of ~8). Just like it does on the local machine...
But with srun ribodetector_cpu -t 1
it seems to work correctly, albeit slowly.
If I combine -t 1
with sbatch --array=0-49 runRiboDetector.sh
the results seem to also overshoot and some jobs failing and in the end the output is all over the place.
I will probably stick to running it in -t 1
and not using --array
mode, at least for now.
The compute nodes' architectures on which I tried RiboDetector are as follows:
(ribodetector) [zakiev@frontal ~]$ scontrol show node n005
NodeName=n005 Arch=x86_64 CoresPerSocket=32
CPUAlloc=50 CPUTot=64 CPULoad=1.05
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=n005 NodeHostName=n005 Version=20.11.2
OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
RealMemory=510000 AllocMem=0 FreeMem=45476 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
Partitions=production,research
BootTime=2021-10-29T09:02:48 SlurmdStartTime=2021-10-29T09:03:08
CfgTRES=cpu=64,mem=510000M,billing=64
AllocTRES=cpu=50
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
(ribodetector) [zakiev@frontal ~]$ scontrol show node n008
NodeName=n008 Arch=x86_64 CoresPerSocket=32
CPUAlloc=20 CPUTot=64 CPULoad=0.96
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=n008 NodeHostName=n008 Version=20.11.2
OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
RealMemory=510000 AllocMem=0 FreeMem=1992 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
Partitions=production,research
BootTime=2021-10-29T09:02:40 SlurmdStartTime=2021-10-29T09:03:00
CfgTRES=cpu=64,mem=510000M,billing=64
AllocTRES=cpu=20
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
(ribodetector) [zakiev@frontal ~]$ scontrol show node n012
NodeName=n012 Arch=x86_64 CoresPerSocket=32
CPUAlloc=20 CPUTot=64 CPULoad=1.12
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=n012 NodeHostName=n012 Version=20.11.2
OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
RealMemory=510000 AllocMem=0 FreeMem=21325 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
Partitions=production,research
BootTime=2021-10-29T09:02:43 SlurmdStartTime=2021-10-29T09:03:03
CfgTRES=cpu=64,mem=510000M,billing=64
AllocTRES=cpu=20
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
(ribodetector) [zakiev@frontal ~]$ scontrol show node n011
NodeName=n011 Arch=x86_64 CoresPerSocket=32
CPUAlloc=56 CPUTot=64 CPULoad=2.83
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=n011 NodeHostName=n011 Version=20.11.2
OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
RealMemory=510000 AllocMem=0 FreeMem=28392 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
Partitions=production,research
BootTime=2021-10-29T09:02:39 SlurmdStartTime=2021-10-29T09:03:02
CfgTRES=cpu=64,mem=510000M,billing=64
AllocTRES=cpu=56
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
this is strange. which version did you use? I usually allocate the resources first with: srun --pty -p cpu --qos verylong --cpus-per-task=40 --mem=256g --time=600:00:00
. Then
ribodetector_cpu -t 10 -l 80 -i datasets/s1.{1,2}.fq -r outputs/s1.ribodetector.rrna.{1,2}.fq.gz -o outputs/s1.ribodetector.norrna.{1,2}.fq.gz -e rrna
Never had such an issue.
If you are using the latest version and still have this issue, could you send me your fq file for a test?
Sure, I will re-run with the parameters you suggested
srun --pty -p cpu --qos verylong --cpus-per-task=40 --mem=256g --time=600:00:00
. Thenribodetector_cpu -t 10 -l 80 -i datasets/s1.{1,2}.fq -r outputs/s1.ribodetector.rrna.{1,2}.fq.gz -o outputs/s1.ribodetector.norrna.{1,2}.fq.gz -e rrna
The version should be the latest, I sourced it from conda this morning,
(ribodetector) [zakiev@frontal ~]$ ribodetector -v
ribodetector 0.2.7
send me your fq file for a test?
In the meantime that I rerun riboDetector with the suggested parameters, that, indeed, would be great, but please feel no pressure, maybe it's because of bad parameters on my part. I will double-check first. But if you wish to take a look yourself as well, feel free to do so, obviously; could you post your email or some other means of communication so I could drop you a link to download the fastq files (~6 Gb)?
wc -l inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
12378859 inputs/raw_data/124N_S1_L003_R1_001.fastq.gz
Could you try to count the number of reads with
zcat fq.gz | wc -l
wc -l outputs/reads.nonrrna.1.fq
379258656 reads.nonrrna.1.fq
This should be 379258656/4 = 94814664 reads
download the fastq files (~6 Gb)
Do you mean 6G bytes? Compressed fastq file with 4 million reads should not be so large, right? My email is dawnmsg(at)gmail.com
Hey, sorry for the delay in response.
zcat fq.gz | wc -l
oops, what a blunder on my part!!! Yes, i was wc -l
ing the files without unzipping them... Sorry for that.
In reality there is 371 million lines (92 million entries/reads)
zcat 124T_S5_L003_R1_001.fastq.gz | wc -l
371452316
Let me double-check everything and then we could close the issue as I believe it was an error on my part. But it's good that we discussed this because someone else might do the same mistake one day and this thread will be hopefully helpful to them.
ok it works, sorry for the false alarm
Hey, Just a quick Q: how crucial is to set the -l
parameter exactly to the size of my reads? The Fastqc, for example, reports the size of all my reads as 76 in a given experiment:
In this particular instance I ran the algorithm with the default setting of -l 80
and wonder if I need to re-run it again with the -l 76
for better results.
I would assume -l 76 gives more accurate result. But the difference can be small