ConFindr :: problem with bbtools
Closed this issue · 16 comments
Hello while running the example test set with bbtoos version 38.01 with --rmlst (dunno if this make sense or not) option on I had the follwoing error message
tested with bbtools version bbmap/37.78 bbmap/38.91
2021-12-08 15:19:05 Encountered error when attempting to run ConFindr on sample example. Skipping...
2021-12-08 15:19:05 Error encounted was:
Traceback (most recent call last):
File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 1051, in confindr
find_contamination(pair=fastq,
File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 623, in find_contamination
out, err, cmd = bbtools.bbduk_bait(reference=sample_database,
File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait
out, err = run_subprocess(cmd)
File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess
raise subprocess.CalledProcessError(x.returncode, cmd=command)
subprocess.CalledProcessError: Command 'bbduk.sh in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56' returned non-zero exit status 1.
that was due to
rpm_maker:ConFindr/0.7.4 > bbduk.sh in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56
java -ea -Xmx52354m -Xms52354m -cp /opt/gensoft/exe/bbmap/38.91/libexec/current/ jgi.BBDuk in=example-data/example_R1.fastq.gz in2=example-data/example_R2.fastq.gz outm=example-out/example/rmlst_R1.fastq.gz outm2=example-out/example/rmlst_R2.fastq.gz ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta threads=56
Executing jgi.BBDuk [in=example-data/example_R1.fastq.gz, in2=example-data/example_R2.fastq.gz, outm=example-out/example/rmlst_R1.fastq.gz, outm2=example-out/example/rmlst_R2.fastq.gz, ref=/opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta, threads=56]
Version 38.91
Set threads to 56
0.038 seconds.
Initial:
Memory: max=52610m, total=52610m, free=47943m, used=4667m
java.lang.Exception:
An input file appears to be misformatted:
The character with ASCII code 39 appeared where a base was expected: '''
Sequence #0
Sequence ID: 'BACT000001_10671'
regards
Eric
Are you able to take a look at /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta to confirm that the file hasn't been corrupted (or something else is wrong with it)? BACT000001_10671 is the first entry in the file, and should be 1674 bp long.
Also, just to confirm, you do have the credentials required to download the rMLST databases as mentioned in the docs?
hummmm
rpm_maker:ConFindr/ConFindr-0.7.4 > head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta
>BACT000001_10671
b'ATGACTGAATCTTTTGCTCAACTCTTTGAAGAGTCCTTAAAAGAAATCGAAACCCGCC
CGGGTTCTATCGTTCGTGGCGTTGTTGTTGCTATCGACAAAGATGTAGTACTGGTTGACG
where does b'
come from ?
yes RMLST already downloaded and available in CONFINDR_DB
rpm_maker:ConFindr/ConFindr-0.7.4 > ls $CONFINDR_DB
Escherichia_db.fasta Listeria_db_cgderived.fasta
Escherichia_db_cgderived.fasta Salmonella_db_cgderived.fasta
Escherichia_db_cgderived.fasta.fai download_date.txt
Escherichia_db_cgderived_kma.comp.b gene_allele.txt
Escherichia_db_cgderived_kma.length.b profiles.txt
Escherichia_db_cgderived_kma.name rMLST_combined.fasta
Escherichia_db_cgderived_kma.seq.b refseq.msh
Bytes. I wonder if the encoding has changed by default in one of the downloading or formatting libraries used to create those files.
Can you check rMLST_combined.fasta to see if it is also in bytes?
yes bytes in python, but they should not appear in fasta files
I'll try to check who's guilty.
we can already skip downloading as original files are OK
yes rMLST_combined.fasta
is also in bytes repr
which version of Python//Biopython are you using ?
here Python/3.8.1
// biopython-1.79
trying right now with biopython-1.78
Yes.... now I know why this sounded familiar. I believe that issue #27 is related.
yes I already have patched line 209 of database_setup.py
with biopython-1.78 no trouble
head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/rMLST_combined.fasta
>BACT000001_1
ATGGAAAATTTTGCTCAGCTGTTGGAAGAAAGCTTTACCCTGCAAGAAATGAACCCGGGT
GAGGTGATTACCGCTGAAGTAGTGGCAATCGACCAAAACTTCGTTACCGTAAACGCAGGT
waiting for Escherichia_db.fasta to be generated
Escherichia_db.fasta OK too
confindr.py -i example-data -o example-out --rmlst
also ran successfully
Hi,
I'm getting a similar error message as Eric when running
db="path_to_confindr_db"
confindr.py -i confindr_test -o out_test -d $db --rmlst -t 10 -Xmx 4g
Error message:
Traceback (most recent call last):
File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 1067, in confindr
fasta=args.fasta)
File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/confindr.py", line 638, in find_contamination
returncmd=True)
File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 258, in bbduk_bait
out, err = run_subprocess(cmd)
File "/home/schlae0003/st0001/mambaforge/envs/confindr/lib/python3.7/site-packages/confindr_src/wrappers/bbtools.py", line 16, in run_subprocess
raise subprocess.CalledProcessError(x.returncode, cmd=command)
subprocess.CalledProcessError: Command 'bbduk.sh in=confindr_test/F207_R1.fastq.gz in2=confindr_test/F207_R2.fastq.gz outm=out_test/F207/rmlst_R1.fastq.gz outm2=out_test/F207/rmlst_R2.fastq.gz ref=/home/schlae0003/GROUP/taxanomy_databases/confindr_db/rMLST_combined.fasta threads=10 Xmx=4g' returned non-zero exit status 1.
However the confindr_log.txt
says:
Command used: mash screen /home/schlae0003/GROUP/taxanomy_databases/confindr_db/refseq.msh confindr_test/F207_R1.fastq.gz confindr_test/F207_R2.fastq.gz -p 10 -w -i 0.85 | sort -gr > out_test/F207/screen.tab
STDERR: Loading /home/schlae0003/GROUP/taxanomy_databases/confindr_db/refseq.msh...
1023303 distinct hashes.
Streaming from 2 inputs...
Estimated distinct k-mers in mixture: 59041145
Summing shared...
Reallocating to winners...
Computing coverage medians...
Writing output...
refseq.msh
seems to be a binary file in my case...
I'm using confindr 0.7.4, mash 2.3 and BBMap 38.45
My confindr_db looks like this
download_date.txt
gene_allele.txt
profiles.txt
rMLST_combined.fasta
Escherichia_db_cgderived.fasta
Listeria_db_cgderived.fasta
refseq.msh
Salmonella_db_cgderived.fasta
Thanks,
Lea
I believe refseq.msh
should be a binary file.
What versions of BioPython and Python are you using?
Can you run the BBduk command separately to see if there's any useful output? bbduk.sh in=confindr_test/F207_R1.fastq.gz in2=confindr_test/F207_R2.fastq.gz outm=out_test/F207/rmlst_R1.fastq.gz outm2=out_test/F207/rmlst_R2.fastq.gz ref=/home/schlae0003/GROUP/taxanomy_databases/confindr_db/rMLST_combined.fasta threads=10 Xmx=4g
Thanks,
A
I'm using Python=3.7 and BioPython=1.78
When running the BBduk command separately, I realised that there was a memory issue. Normally this would give me a core dump or "out of memory" message on the cluster, but for some reason it didn't... I was able to fix the problem by going overboard with memory:threads=2 Xmx=64g
(I'm working with bacteria, fastq.gz files around 500M). Now confindr also runs without errors.
Thanks!!!
hummmm
rpm_maker:ConFindr/ConFindr-0.7.4 > head -n 3 /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db.fasta >BACT000001_10671 b'ATGACTGAATCTTTTGCTCAACTCTTTGAAGAGTCCTTAAAAGAAATCGAAACCCGCC CGGGTTCTATCGTTCGTGGCGTTGTTGTTGCTATCGACAAAGATGTAGTACTGGTTGACG
where does
b'
come from ?
Fixed by 19d0d1d in v0.8.1.