Failed to run with custom data
shenwei356 opened this issue · 4 comments
Firstly, all files were prepared and checked.
$ ls cobs/ | head -n 3
achromobacter_xylosoxidans__01.cobs_classic.xz
acinetobacter_baumannii__01.cobs_classic.xz
acinetobacter_baumannii__02.cobs_classic.xz
$ ls asms/ | head -n 3
achromobacter_xylosoxidans__01.tar.xz
acinetobacter_baumannii__01.tar.xz
acinetobacter_baumannii__02.tar.xz
$ grep batches config.yaml
# batches to consider during search
batches: "data/batches_2m.txt"
$ wc -l data/batches_2m.txt
640 data/batches_2m.txt
$ head -n 3 data/batches_2m.txt
acinetobacter_nosocomialis__01
aeromonas_salmonicida__01
acinetobacter_baumannii__02
$ grep achromobacter_xylosoxidans data/batches_2m.txt
achromobacter_xylosoxidans__01
Run on a cluster node with make clean; make
Query files: ['input/t.sm.MutL.fasta']
Building DAG of jobs...
MissingInputException in rule translate_matches in file /hps/nobackup/iqbal/shenwei/2kk/mof-search.all/Snakefile, line 485:
Missing input files for rule translate_matches:
output: intermediate/04_filter/t.sm.MutL.fa
wildcards: qfile=t.sm.MutL
affected files:
intermediate/03_match/salmonella_enterica__125____t.sm.MutL.gz
intermediate/03_match/salmonella_enterica__131____t.sm.MutL.gz
intermediate/03_match/salmonella_enterica__126____t.sm.MutL.gz
....
intermediate/03_match/salmonella_enterica__101____t.sm.MutL.gz
intermediate/03_match/salmonella_enterica__102____t.sm.MutL.gz
intermediate/03_match/salmonella_enterica__123____t.sm.MutL.gz
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.all/scripts/benchmark.py", line 82, in <module>
main()
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.all/scripts/benchmark.py", line 58, in main
raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/match_2024_03_15T16_43_51.txt.tmp -f "%e %S %U %P %M %I %O" snakemake match --cores all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=51200' returned non-zero exit status 1.
make[1]: *** [Makefile:89: match] Error 1
make[1]: Leaving directory '/hps/nobackup/iqbal/shenwei/2kk/mof-search.all'
make: *** [Makefile:32: all] Error 2
Well, actually the intermediate directory is actually empty
$ dirsize intermediate/
intermediate/: 76.00 B
14.00 B 00_queries_preprocessed
14.00 B 01_queries_merged
14.00 B 02_cobs_decompressed
14.00 B 03_match
14.00 B 05_map
6.00 B 04_filter
$ tree intermediate/
intermediate/
├── 00_queries_preprocessed
├── 01_queries_merged
├── 02_cobs_decompressed
├── 03_match
├── 04_filter
└── 05_map
6 directories, 0 files
```
My first suggestion would be to create data/batches_2m_small.txt
with ~3 small batches, possibly exactly the same ones as here: https://github.com/karel-brinda/Phylign/blob/main/data/batches_small.txt, which will be used for testing.
Then we can look at the messages with these (currently it looks like the issue is with just Salmonella ?).
When I created and used a small batch file. I begins to install cobs and minimap conda env. While there are other errors.
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/benchmark.py", line 82, in <module>
main()
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/benchmark.py", line 58, in main
raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/translate_matches/translate_matches___t2.sm.MutL.txt.tmp -f "%e %S %U %P %M %I %O" ./scripts/filter_queries.py \
-n 1000 \
-q intermediate/01_queries_merged/t2.sm.MutL.fa \
intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz \
> intermediate/04_filter/t2.sm.MutL.fa 2>logs/04_filter/t2.sm.MutL.log' returned non-zero exit status 1.
[Mon Mar 18 07:55:41 2024]
Error in rule translate_matches:
jobid: 1
input: intermediate/01_queries_merged/t2.sm.MutL.fa, intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz, intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz, intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz
output: intermediate/04_filter/t2.sm.MutL.fa
log: logs/04_filter/t2.sm.MutL.log (check log file(s) for error details)
conda-env: /hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/.snakemake/conda/4224e0d82ee1aa3330a9bb10ca65cbea_
shell:
./scripts/benchmark.py --log logs/benchmarks/translate_matches/translate_matches___t2.sm.MutL.txt \
'./scripts/filter_queries.py \
-n 1000 \
-q intermediate/01_queries_merged/t2.sm.MutL.fa \
intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz \
> intermediate/04_filter/t2.sm.MutL.fa 2>logs/04_filter/t2.sm.MutL.log'
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job translate_matches since they might be corrupted:
intermediate/04_filter/t2.sm.MutL.fa
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-18T075225.813254.snakemake.log
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/scripts/benchmark.py", line 82, in <module>
main()
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/scripts/benchmark.py", line 58, in main
raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/match_2024_03_18T07_52_25.txt.tmp -f "%e %S %U %P %M %I %O" snakemake match --cores all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=102400' returned non-zero exit status 1.
make[1]: *** [Makefile:89: match] Error 1
make[1]: Leaving directory '/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin'
make: *** [Makefile:32: all] Error 2
$ more logs/04_filter/t2.sm.MutL.log
Translating matches intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz
Processing batch acinetobacter_baumannii__02 query #0 (None)
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 180, in process_cobs_file
_ = self._query_dict[qname]
~~~~~~~~~~~~~~~~^^^^^^^
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 240, in <module>
main()
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 236, in main
process_files(args.query_fn, args.match_fn, args.keep)
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 203, in process_files
sift.process_cobs_file(fn)
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 182, in process_cobs_file
self._query_dict[qname] = SingleQuery(qname, self._keep_matches)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SingleQuery.__init__() missing 1 required positional argument: 'keep_matches'
acinetobacter_baumannii__02____t2.sm.MutL.gz
and other two files are empty.
$ tree -sh intermediate/
intermediate/
├── [ 4.0K] 00_queries_preprocessed
│ └── [ 2.0K] t2.sm.MutL.fa
├── [ 4.0K] 01_queries_merged
│ └── [ 2.0K] t2.sm.MutL.fa
├── [ 4.0K] 02_cobs_decompressed
├── [ 4.0K] 03_match
│ ├── [ 20] acinetobacter_baumannii__02____t2.sm.MutL.gz
│ ├── [ 20] acinetobacter_nosocomialis__01____t2.sm.MutL.gz
│ └── [ 20] aeromonas_salmonicida__01____t2.sm.MutL.gz
├── [ 4.0K] 04_filter
└── [ 4.0K] 05_map
$ zcat intermediate/03_match/*
$
I thought it was because there was no match, but after adding the positive species that the query belongs to, it's the same error.
$ cat -A data/batches_2m_small.txt
acinetobacter_nosocomialis__01$
aeromonas_salmonicida__01$
acinetobacter_baumannii__02$
streptococcus_mutans__01$
I think this is the principle error message:
$ more logs/04_filter/t2.sm.MutL.log
Translating matches intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz
Processing batch acinetobacter_baumannii__02 query #0 (None)
Traceback (most recent call last):
File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 180, in process_cobs_file
_ = self._query_dict[qname]
~~~~~~~~~~~~~~~~^^^^^^^
KeyError: None
I think this suggest that there might be some old intermediate data?
Try running make clean
before rerunning everything.
I ran make clean
before make
.
The intermediate directory is empty after running make clean
, as mentioned before.