karel-brinda/Phylign

Failed to run with custom data

shenwei356 opened this issue · 4 comments

Firstly, all files were prepared and checked.

$ ls cobs/ | head -n 3
achromobacter_xylosoxidans__01.cobs_classic.xz
acinetobacter_baumannii__01.cobs_classic.xz
acinetobacter_baumannii__02.cobs_classic.xz

$ ls asms/ | head -n 3
achromobacter_xylosoxidans__01.tar.xz
acinetobacter_baumannii__01.tar.xz
acinetobacter_baumannii__02.tar.xz

$ grep batches config.yaml 
# batches to consider during search
batches: "data/batches_2m.txt"

$ wc -l data/batches_2m.txt 
640 data/batches_2m.txt

$ head -n 3  data/batches_2m.txt 
acinetobacter_nosocomialis__01
aeromonas_salmonicida__01
acinetobacter_baumannii__02

$ grep achromobacter_xylosoxidans  data/batches_2m.txt 
achromobacter_xylosoxidans__01

Run on a cluster node with make clean; make

Query files: ['input/t.sm.MutL.fasta']
Building DAG of jobs...
MissingInputException in rule translate_matches in file /hps/nobackup/iqbal/shenwei/2kk/mof-search.all/Snakefile, line 485:
Missing input files for rule translate_matches:
    output: intermediate/04_filter/t.sm.MutL.fa
    wildcards: qfile=t.sm.MutL
    affected files:
        intermediate/03_match/salmonella_enterica__125____t.sm.MutL.gz
        intermediate/03_match/salmonella_enterica__131____t.sm.MutL.gz
        intermediate/03_match/salmonella_enterica__126____t.sm.MutL.gz
        ....
        intermediate/03_match/salmonella_enterica__101____t.sm.MutL.gz
        intermediate/03_match/salmonella_enterica__102____t.sm.MutL.gz
        intermediate/03_match/salmonella_enterica__123____t.sm.MutL.gz
Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.all/scripts/benchmark.py", line 82, in <module>
    main()
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.all/scripts/benchmark.py", line 58, in main
    raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/match_2024_03_15T16_43_51.txt.tmp -f "%e       %S      %U      %P      %M      %I      %O" snakemake match --cores all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=51200' returned non-zero exit status 1.
make[1]: *** [Makefile:89: match] Error 1
make[1]: Leaving directory '/hps/nobackup/iqbal/shenwei/2kk/mof-search.all'
make: *** [Makefile:32: all] Error 2

Well, actually the intermediate directory is actually empty

$ dirsize  intermediate/

intermediate/: 76.00 B
   14.00 B      00_queries_preprocessed
   14.00 B      01_queries_merged
   14.00 B      02_cobs_decompressed
   14.00 B      03_match
   14.00 B      05_map
    6.00 B      04_filter
$ tree intermediate/
intermediate/
├── 00_queries_preprocessed
├── 01_queries_merged
├── 02_cobs_decompressed
├── 03_match
├── 04_filter
└── 05_map

6 directories, 0 files
```

My first suggestion would be to create data/batches_2m_small.txt with ~3 small batches, possibly exactly the same ones as here: https://github.com/karel-brinda/Phylign/blob/main/data/batches_small.txt, which will be used for testing.

Then we can look at the messages with these (currently it looks like the issue is with just Salmonella ?).

When I created and used a small batch file. I begins to install cobs and minimap conda env. While there are other errors.

Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/benchmark.py", line 82, in <module>
    main()
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/benchmark.py", line 58, in main
    raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/translate_matches/translate_matches___t2.sm.MutL.txt.tmp -f "%e        %S      %U      %P      %M      %I      %O" ./scripts/filter_queries.py \
                    -n 1000 \
                    -q intermediate/01_queries_merged/t2.sm.MutL.fa \
                    intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz \
                > intermediate/04_filter/t2.sm.MutL.fa 2>logs/04_filter/t2.sm.MutL.log' returned non-zero exit status 1.
[Mon Mar 18 07:55:41 2024]
Error in rule translate_matches:
    jobid: 1
    input: intermediate/01_queries_merged/t2.sm.MutL.fa, intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz, intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz, intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz
    output: intermediate/04_filter/t2.sm.MutL.fa
    log: logs/04_filter/t2.sm.MutL.log (check log file(s) for error details)
    conda-env: /hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/.snakemake/conda/4224e0d82ee1aa3330a9bb10ca65cbea_
    shell:
        
        ./scripts/benchmark.py --log logs/benchmarks/translate_matches/translate_matches___t2.sm.MutL.txt \
            './scripts/filter_queries.py \
                    -n 1000 \
                    -q intermediate/01_queries_merged/t2.sm.MutL.fa \
                    intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz intermediate/03_match/acinetobacter_nosocomialis__01____t2.sm.MutL.gz intermediate/03_match/aeromonas_salmonicida__01____t2.sm.MutL.gz \
                > intermediate/04_filter/t2.sm.MutL.fa 2>logs/04_filter/t2.sm.MutL.log'
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job translate_matches since they might be corrupted:
intermediate/04_filter/t2.sm.MutL.fa
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-18T075225.813254.snakemake.log
Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/scripts/benchmark.py", line 82, in <module>
    main()
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/scripts/benchmark.py", line 58, in main
    raise subprocess.CalledProcessError(return_code,
subprocess.CalledProcessError: Command '/usr/bin/time -o logs/benchmarks/match_2024_03_18T07_52_25.txt.tmp -f "%e       %S      %U      %P      %M      %I      %O" snakemake match --cores all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=102400' returned non-zero exit status 1.
make[1]: *** [Makefile:89: match] Error 1
make[1]: Leaving directory '/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin'
make: *** [Makefile:32: all] Error 2
$ more logs/04_filter/t2.sm.MutL.log
Translating matches intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz
Processing batch acinetobacter_baumannii__02 query #0 (None)
Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 180, in process_cobs_file
    _ = self._query_dict[qname]
        ~~~~~~~~~~~~~~~~^^^^^^^
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 240, in <module>
    main()
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 236, in main
    process_files(args.query_fn, args.match_fn, args.keep)
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 203, in process_files
    sift.process_cobs_file(fn)
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 182, in process_cobs_file
    self._query_dict[qname] = SingleQuery(qname, self._keep_matches)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SingleQuery.__init__() missing 1 required positional argument: 'keep_matches'

acinetobacter_baumannii__02____t2.sm.MutL.gz and other two files are empty.

$ tree -sh intermediate/
intermediate/
├── [ 4.0K]  00_queries_preprocessed
│   └── [ 2.0K]  t2.sm.MutL.fa
├── [ 4.0K]  01_queries_merged
│   └── [ 2.0K]  t2.sm.MutL.fa
├── [ 4.0K]  02_cobs_decompressed
├── [ 4.0K]  03_match
│   ├── [   20]  acinetobacter_baumannii__02____t2.sm.MutL.gz
│   ├── [   20]  acinetobacter_nosocomialis__01____t2.sm.MutL.gz
│   └── [   20]  aeromonas_salmonicida__01____t2.sm.MutL.gz
├── [ 4.0K]  04_filter
└── [ 4.0K]  05_map

$ zcat intermediate/03_match/*
$

I thought it was because there was no match, but after adding the positive species that the query belongs to, it's the same error.

$ cat -A data/batches_2m_small.txt 
acinetobacter_nosocomialis__01$
aeromonas_salmonicida__01$
acinetobacter_baumannii__02$
streptococcus_mutans__01$

I think this is the principle error message:

$ more logs/04_filter/t2.sm.MutL.log
Translating matches intermediate/03_match/acinetobacter_baumannii__02____t2.sm.MutL.gz
Processing batch acinetobacter_baumannii__02 query #0 (None)
Traceback (most recent call last):
  File "/hps/nobackup/iqbal/shenwei/2kk/mof-search.no_dustbin/./scripts/filter_queries.py", line 180, in process_cobs_file
    _ = self._query_dict[qname]
        ~~~~~~~~~~~~~~~~^^^^^^^
KeyError: None

I think this suggest that there might be some old intermediate data?

Try running make clean before rerunning everything.

I ran make clean before make.
The intermediate directory is empty after running make clean, as mentioned before.