Java Runtime Error when running test databases on VM
dajsfiles opened this issue · 43 comments
Hi,
I'm currently trying to get hecatomb working on a VM, but I've run into a Java-related error. The support for said VM and storage servers told me that this error was not related to the VM, and that I should contact the hecatomb developers. I've attached the error log. Please let me know if there's any further information that needs to be provided.
hs_err_pid170.log
It appears to me that the JVM consumed all the system memory and was killed.
Can you please check the # Run Parameters #
section of hecatomb.config.yaml
and make sure that you are not requesting too much memory?
Let us know if that fixes the problem.
I'm not sure how much memory I should use. I've been running hecatomb from a docker which requires a specific allocated memory number that I've requested to be 16gb in most cases. Should I request in the docker itself to get more memory?
This is my Hecatomb.config.yaml file, by the way.
##################
Run Parameters
##################
Database installation location, leave blank = use hecatomb install location
Databases:
STICK TO YOUR SYSTEM'S CPU:RAM RATIO FOR THESE
BigJobMem: 64000 # Memory for MMSeqs in megabytes (e.g 64GB = 64000, recommend >= 64000)
BigJobCpu: 24 # Threads for MMSeqs (recommend >= 16)
BigJobTimeMin: 5760 # Max runtime in minutes for MMSeqs (this is only enforced by the Snakemake profile)
MediumJobMem: 32000 # Memory for Megahit/Flye in megabytes (recommend >= 32000)
MediumJobCpu: 16 # CPUs for Megahit/Flye in megabytes (recommend >= 16)
SmallJobMem: 16000 # Memory for BBTools etc. in megabytes (recommend >= 16000)
SmallJobCpu: 8 # CPUs for BBTools etc. (recommend >= 8)
# default CPUs = 1
defaultMem: 2000 # Default memory in megabytes (for use with --profile)
defaultTime: 1440 # Default time in minutes (for use with --profile)
defaultJobs: 100 # Default concurrent jobs (for use with --profile)
Some jobs need more RAM; go over your CPU:RAM ratio if needed
MoreRamMem: 16000 # Memory for slightly RAM-hungry jobs in megabytes (recommend >= 16000)
MoreRamCpu: 2 # CPUs for slightly RAM-hungry jobs (recommend >= 2)
According to the log, you have 64 cores and 32 GB of RAM. Is that correct? If so, by default Hecatomb will be spinning up 3 or 4 BBTools jobs, each reserving 16GB which will put you over your system's available memory.
Change this part of the config like so:
BigJobMem: 32000 # Memory for MMSeqs in megabytes (e.g 64GB = 64000, recommend >= 64000)
BigJobCpu: 64 # Threads for MMSeqs (recommend >= 16)
BigJobTimeMin: 5760 # Max runtime in minutes for MMSeqs (this is only enforced by the Snakemake profile)
MediumJobMem: 32000 # Memory for Megahit/Flye in megabytes (recommend >= 32000)
MediumJobCpu: 64 # CPUs for Megahit/Flye in megabytes (recommend >= 16)
SmallJobMem: 16000 # Memory for BBTools etc. in megabytes (recommend >= 16000)
SmallJobCpu: 32 # CPUs for BBTools etc. (recommend >= 8)
Thank you!
This initially seemed to solve my problem, but after 64%, another process failed. I've attached the log here.
2022-01-25T213119.818853.snakemake.log
It's progress at least. This is an MMSeqs error, and I can't see anything helpful related to bus errors on the mmseqs github issues page https://github.com/soedinglab/MMseqs2/issues
Try rerunning it and I'll see how the newest version of MMSeqs2 works with Hecatomb.
Isn't the specific version specified in the mmseqs.yaml though? It should be as the newer versions of mmseqs (13 and above) changed almost everything about mmseqs output so we would want to make sure no other versions other than the one specific in the env are used or there will be loads of downstream parsing issues.
Scott, the new mmseqsUpdate branch seems to be working for me on the test dataset and we should be good to migrate to the new version whenever we want. it includes a couple of bugfixes that i'll need to cherry pick into dev and master for now. In the end I only needed to tweak the AA taxonomy steps. The NT tax steps and the assembly mmseqs step worked fine (though I still need to check the assembly contig annotations to make sure they're correct). The bigtable looks fine though.
Jason, let me know if you want to try this version and need help checking out the mmseqsUpdate branch and running it.
Hi @beardymcjohnface we should really take a deeper look. When mmseqs2 updated to release 13-45111 they changed not only everything about how the algorithm works (it works primarily as a contig annotator and less well as a short read annotator) but they also changed all of the output files. The columns are not the same, I don't think I was able to sort out how to dissect the LCA results. It really wasn't an incremental version release as much as it was a release of an entirely new software package.
I agree, I made it a separate branch so we could make a pull request, review it there and make any necessary changes before merging it with the main branch (assuming it works fine).
Hi,
I've tried running it again, and this time it hit a different error. I'm not sure if these two are related.
2022-02-01T211755.954089.snakemake.log
Hi Sorry for the late reply. This is an MMSeqs issue I think. You could try running the commands manually and see if they work, but I would also append memory limit on the search command (which I've done below). I'll patch this into the next release of Hecatomb just to be safe. If it does work, rerun Hecatomb and it should continue after this step.
mmseqs createdb \
hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/assembly.fasta \
hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB \
--dbtype 2
mmseqs search \
hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/queryDB \
/storage1/fs1/leyao.wang/Active/jason_test/hecatomb/snakemake/workflow/../../databases/nt/virus_primary_nt/sequenceDB \
hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/results/result \
hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/mmseqs_nt_tmp \
--start-sens 2 -s 7 --sens-steps 3 --min-length 90 -e 1e-5 --search-type 3 \
--split-memory-limit 24000
It says that the mmseqs command is not found. Should I install MMseqs? Could that be what's causing the issue?
oh my bad. You could install it, or you could use the conda env that snakemake created. The easiest way is to just install mmseqs2:
# dont run from your base env, your hecatomb env should be fine
mamba install mmseqs2=12.113e3=h2d02072_2
Hi Michael,
When I ran the aforementioned commands, I got another issue: there's a file in hecatomb that couldn't be opened for writting.
2-10-2022-output.txt
I'm not sure why MMSeqs is failing here. you could try deleting the mmseqs directories hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/mmseqs_nt_tmp
and hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/FLYE/results
and rerunning hecatomb. Otherwise we'll have to pester the MMSeqs developers for some ideas.
If you're not worried about the contig annotations you can rerun hecatomb and add the option --snake=-k
. The pipeline will still "fail" but it should create everything except these files (the assembly, seqtable, bigtable, read-based contig annotations etc).
I was able to delete mmseqs_nt_tmp but it looks like results didn't exist in the first place.
When I ran hecatomb again, it exited almost instantly. This was the error log:
2022-02-11T202054.110585.snakemake.log
Also, we do need contig annotations.
I would suggest deleting the hecatomb_out/PROCESSING/ASSEMBLY/CONTIG_DICTIONARY/
directory and making the pipeline regenerate those files; something has been corrupted at some point I think. You should also include the snakemake 'keep going' flag by adding --snake=-k
to the end of your hecatomb command. That should hopefully make the pipeline finish the read annotations if nothing else.
Ok! Would the pipeline regenerate them if I simply ran '''hecatomb run --test --snake=-k'''?
Yes, any files that are missing should be regenerated, as well as any subsequent files that depend on them. I'm just looking back through the thread; is this the test dataset that is failing?
Yes.
After trying to regenerate them, I tried to run it again but it still keeps hitting errors. When I tried to regenerate again after deleting config_dictionary, I then was greeted with this message:
(/storage1/fs1/leyao.wang/Active/jason_test/hecatomb) j.m.li@compute1-exec-132:~$ hecatomb run --test --snake=-k
Config file hecatomb.config.yaml already exists.
Running Hecatomb
Running snakemake command:
snakemake -j 32 --use-conda --conda-frontend mamba --rerun-incomplete --printshellcmds --nolock --show-failed-logs --conda-prefix /storage1/fs1/leyao.wang/Active/jason_test/hecatomb/snakemake/workflow/conda --configfile hecatomb.config.yaml -k -s /storage1/fs1/leyao.wang/Active/jason_test/hecatomb/snakemake/workflow/Hecatomb.smk -C Reads=/storage1/fs1/leyao.wang/Active/jason_test/hecatomb/test_data Host=human Output=hecatomb_out SkipAssembly=False Fast=False Report=False
Building DAG of jobs...
WorkflowError:
Unable to obtain modification time of file hecatomb_out/RESULTS/assembly.fasta although it existed before. It could be that a concurrent process has deleted it while Snakemake was running.
File "/storage1/fs1/leyao.wang/Active/jason_test/hecatomb/lib/python3.10/asyncio/runners.py", line 44, in run
File "/storage1/fs1/leyao.wang/Active/jason_test/hecatomb/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete
I've also attached my error file from the normal run.
2022-02-17T210726.314687.snakemake.log
That modification time error can occur during reruns of failed/killed snakemake pipelines. I think you can just touch the file and it should be ok. Alternatively, you should be able to delete the .snakemake/ directory.
The normal run error is back to the mmseqs running out of memory. I don't think we actually got running the mmseqs commands manually to run did we?
The fix for the memory issue is here: 14625b0
You just need to add --split-memory-limit {MMSeqsMemSplit}
to the mmseqs command in 03_contig_annotation.smk rules file for the mmseqs_contig_annotation rule. Your file should be in /storage1/fs1/leyao.wang/Active/jason_test/hecatomb/snakemake/workflow/rules/03_contig_annotation.smk
. Otherwise, we could try and install the github version and checkout the dev branch, or wait for the next release.
Regarding the modification time error, do you mean I should open and close the 2 python files mentioned?
For mmseqs, there's multiple categories for the annotation rule. Which one should I put the command under?
For reference, this is what the file lists for rule mmseqs_contig_annotation:
rule mmseqs_contig_annotation:
"""Contig annotation step 01: Assign taxonomy to contigs in contig_dictionary using mmseqs
Database: NCBI virus assembly with taxID added
"""
input:
contigs=os.path.join(ASSEMBLY,"CONTIG_DICTIONARY","FLYE","assembly.fasta"),
db=os.path.join(NCBIVIRDB, "sequenceDB")
output:
queryDB=os.path.join(ASSEMBLY,"CONTIG_DICTIONARY","FLYE","queryDB"),
result=os.path.join(ASSEMBLY,"CONTIG_DICTIONARY","FLYE","results","result.index")
params:
respath=os.path.join(ASSEMBLY,"CONTIG_DICTIONARY","FLYE","results","result"),
tmppath=os.path.join(ASSEMBLY,"CONTIG_DICTIONARY","FLYE","mmseqs_nt_tmp")
benchmark:
os.path.join(BENCH, "mmseqs_contig_annotation.txt")
log:
os.path.join(STDERR, "mmseqs_contig_annotation.log")
resources:
mem_mb=MMSeqsMem
threads:
MMSeqsCPU
conda:
os.path.join("../", "envs", "mmseqs2.yaml")
shell:
"""
{{
mmseqs createdb {input.contigs} {output.queryDB} --dbtype 2;
mmseqs search {output.queryDB} {input.db} {params.respath} {params.tmppath} \
{MMSeqsSensNT} {config[filtNTsecondary]} \
--search-type 3 ; }} &> {log}
rm {log}
"""
You can use the touch
command to update the timestamps of the files, which is what Snakemake uses to keep track of what it does and does not need to do. If you open the commit link -> 14625b0 you can make the same changes in your file.
I've been able to update the modification time of the symlink, but even then, I'm still encountering this error.
Ok, that's weird. I ran it after changing the name of the file I was supposed to touch so that hecatomb wouldn't be able to find it and it completed a test run successfully.
Now that it successfully completed, do I have to run it again with any other modifications, or is it good to use?
There was a problem when I was running actual datasets, but I don't know if these are issues with hecatomb itself or with the data sets. I've attached all 3 error logs.
The main issue referred to something as "Invalid header line: must start with @HD/@SQ/@RG/@PG/@co".
Did my version of hecatomb become corrupted due to numerous failed runs?
2022-03-08T220712.552501.snakemake.log
2022-03-08T222137.634342.snakemake.log
hecatomb.crashreport.log
The error here is with samtools view in host_removal_mapping. minimap maps the reads to the host genome, samtools view will filter the mapped reads, and samtools fastq will convert the bam format back to fastq. I'm not sure why the header isn't being passed by minimap.
Can you run ls -lh hecatomb_out/PROCESSING/TMP/p06/
to make sure the input fastq files aren't empty?
Then you could run
minimap2 -ax sr -t 8 --secondary=no \
/storage1/fs1/leyao.wang/Active/jason_test/hecatomb/snakemake/workflow/../../databases/host/human/masked_ref.fa.gz.idx \
hecatomb_out/PROCESSING/TMP/p06/M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001_R1.s6.out.fastq \
hecatomb_out/PROCESSING/TMP/p06/M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001_R2.s6.out.fastq \
> A07.minimap.test.sam
to see if minimap is outputting any alignments.
The first command returns an error: "ls: cannot access 'hecatomb_out/PROCESSING/TMP/p06/': Operation not permitted"
The second command returns the error "bash: minimap2: command not found"
you might need to create and spin up an environment with minimap2 conda create -n minimap2 -c bioconda minimap2 && conda activate minimap2
does the hecatomb_out/PROCESSING/TMP/p06/
directory exist?
The directory exists. I'll back to you on the minimap issue.
Also, should I be running the program while inside of the hecatomb folder or would it be ok if I just cd-ed to the folder of the inputs and then ran it?
run it in a clean folder. When I'm running it, I'll create a new folder someAnalysis
and a sub folder for the reads someAnalysis/reads
. I would copy or link the reads to the reads folder, then cd to someAnalysis
and run hecatomb from there. Don't run it from the hecatomb installation folder.
I moved to a clean folder, ran it, got an error, then proceeded to run the minimap command and then run it again. Unfortunately, it looks like I'm still hitting errors. Here's what I got:
2022-03-10T232525.690659.snakemake.log
Was I supposed to cd into hecatomb_out/PROCESSING/TMP/p06/ and then run the command?
Because if I do that I get an error: "ERROR: failed to open file 'hecatomb_out/PROCESSING/TMP/p06/M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001_R1.s6.out.fastq'"
Thanks!
Here's the file that's unable to be opened.
M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001.s6.stats.zip
That looks like the same error as before. I'm guessing that sample doesn't have any reads following QC and host removal. I'll have to add an update to check for this.
Is this work urgent; do you want me to try and run your samples for you?
That would be great, thanks! However, the samples, after being zipped, is still about 4GB. How should I send it to you?
Thanks for the email. The dataset ran fine on our system using the current conda version of hecatomb. I wish I knew why it was causing so much grief, but we'll probably have to test Hecatomb in some cloud VMs at some point.
Ok, thank you.
Do you know what the error message
Logfile hecatomb_out/STDERR/host_removal_mapping.M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001.samtoolsView.log:
[E::sam_hdr_create] Invalid header line: must start with @HD/@SQ/@RG/@PG/@CO
[main_samview] fail to read the header from "-".
Logfile hecatomb_out/STDERR/host_removal_mapping.M667_I8470_32876_Wang_Asthma_A07_NEBNext_Index_5_ACAGTGATCT_S5_L001.samtoolsFastq.log:
Failed to read header for "-"
is referring to? Since this seems to be a local problem.
cont'd via email.