AAI calculation error: tmp file issue?
bluegenes opened this issue · 6 comments
Hi folks,
I'm trying to use a snakemake workflow to run a number of EzAAI jobs. However, I'm getting an error during extract
that makes me a bit worried about whether or not the files will always be right.
For extract, I'm generating a temporary gunzipped file, since I have all references available as .fna.gz
. Here's an example command.
gunzip -c /path/to/ref/files/GCA_009909065.1_genomic.fna.gz > GCA_009909065.1.tmp.fna
java -jar EzAAI_latest.jar extract -i GCA_009909065.1.tmp.fna -o GCA_009909065.1.db -l GCA_009909065.1 > GCA_009909065.1.extract.log
rm GCA_009909065.1.tmp.fna
The error I'm (sometimes!) getting is the following:
java.io.FileNotFoundException: /tmp/prodigal.faa (No such file or directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.base/java.io.FileReader.<init>(FileReader.java:75)
at leb.process.ProcCDSPredictionByProdigal.execute(ProcCDSPredictionByProdigal.java:147)
at leb.main.EzAAI.runExtract(EzAAI.java:224)
at leb.main.EzAAI.run(EzAAI.java:482)
at leb.main.EzAAI.main(EzAAI.java:518)
If I'm running more than one EzAAI
job on the same node, how would that affect the /tmp/prodigal.faa
file for each extraction? Will the later job always fail without overwriting the first /tmp/prodigal.faa
file? Running failed jobs a second time usually resolves the issue.
Note, I am running EzAAI in the following conda environment:
conda list -p /home/ntpierce/miniconda3/7841dd127abab0c21fbc5a5b78f2aefd
# packages in environment at /home/ntpierce/miniconda3/7841dd127abab0c21fbc5a5b78f2aefd:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2021.10.8 ha878542_0 conda-forge
gawk 5.1.0 h7f98852_0 conda-forge
gettext 0.19.8.1 h73d1719_1008 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 11.2.0 h1d223b6_12 conda-forge
libgomp 11.2.0 h1d223b6_12 conda-forge
libidn2 2.3.2 h7f98852_0 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_12 conda-forge
libunistring 0.9.10 h7f98852_0 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
mmseqs2 13.45111 h95f258a_1 bioconda
openssl 3.0.0 h7f98852_2 conda-forge
prodigal 2.6.3 h779adbc_3 bioconda
wget 1.20.3 ha35d2d1_1 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
Upon further testing with calculate
, I do think something is going wrong with db
generation using this strategy. When I use the db
files generated with simultaneous snakemake jobs distributed across the cluster, I often get the error below. When I regenerate the db
files in an interactive session/ without running simultaneous jobs, calculate
works as intended.
So it seems multiple extract
jobs cannot be run simultaneously. Please let me know if you have any suggestions.
calculate
error:
java.lang.ArithmeticException: / by zero
at leb.process.ProcCalcPairwiseAAI.calcIdentityWithDetails(ProcCalcPairwiseAAI.java:462)
at leb.process.ProcCalcPairwiseAAI.pairwiseMmseqs(ProcCalcPairwiseAAI.java:643)
at leb.process.ProcCalcPairwiseAAI.calculateProteomePairWithDetails(ProcCalcPairwiseAAI.java:250)
at leb.main.EzAAI.runCalculate(EzAAI.java:351)
at leb.main.EzAAI.run(EzAAI.java:483)
at leb.main.EzAAI.main(EzAAI.java:518)
Follow-up question: I assume this error means there is no shared similarity, so AAI cannot be calculated. Is this the case, or do you report no similarity as 0
in the output file? It would be great to have a 0 value reported (and normal program exit) if it's not a different error! Happy to drop this in a separate issue if it would be helpful.
Hello,
Thank you so much for these detailed reports.
First issue was caused by the simple code mistake, which gave the temporary files common names. By this, the error you mentioned occurred because one of your sessions tried to access and wipe the prodigal output produced by the completely different session.
I fixed the code to give temporary files properly randomized names so that the sessions won't interrupt each other.
Also for the second issue, I provided a few lines of fail-safe code for zero-division cases.
The new version of EzAAI has been uploaded in our website, which can also be downloaded using following link: Download
Thanks!
Maybe it is something about .fna files, I got similar errors with .fna files but when I tried with .faa files from prokka it worked without errors. Although I was not running simultaneous jobs. I hope it is helpful.
Thanks for the fixes, folks! I ended up running all my extract
steps independently, but will make sure to go back and double check your fix with some additional files.
One more follow-up question:
Temp filenames seem random for calculate
, but I am occasionally running into a similar error:
java.io.FileNotFoundException: /tmp/4b5f93eeab2eefee_faa/j0.faa (No such file or directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
at java.base/java.io.FileReader.<init>(FileReader.java:60)
at leb.process.ProcCalcPairwiseAAI.pairwiseMmseqs(ProcCalcPairwiseAAI.java:579)
at leb.process.ProcCalcPairwiseAAI.calculateProteomePairWithDetails(ProcCalcPairwiseAAI.java:250)
at leb.main.EzAAI.runCalculate(EzAAI.java:361)
at leb.main.EzAAI.run(EzAAI.java:493)
at leb.main.EzAAI.main(EzAAI.java:528)
Do you think a similar issue could be happening, e.g. if filenames are not fully randomized? I was running ~30 jobs at once, and this error was cropping up pretty often. Again, re-running usually "solves" the issue (program exits without error).
Note this is with EzAAI_v1.11.jar
, and running a single instance at once results in no errors.
The .faa generation issue could occur when you run multiple calculate
module simultaneously, because the module was not designed to handle multiple sessions.
.db
file, which is an input of calculate
module, is simply a compressed tarball containing mmseqs
output files containing common named contents. Multiple sessions of calculate
modules will try to manipulate multiple .db
files into a single directory, therefore any session can easily overwrite/remove the files belong to another session.
To prevent this from happening, please run the calculate
module with -t [THREAD]
argument, instead of running multiple sessions simultaneously, to utilize the multi-threading option that removes the risk of such issue while maintaining the throughput of the analysis.
I have a plan to develop a multi-threading option for extract
module as well, to make our pipeline consistent.
Thanks again and any further feedback will be welcomed!
Ok, thanks - this is very important to know!
I'm not sure how common my use case is relative to others -- I have a series of specific pairwise comparisons I'm interested in, rather than a large all x all comparison. In any case, multithreading and then running each process sequentially worked, though slower than spamming jobs across a large cluster :). Thanks!