DerrickWood/kraken2

`k2mask` is only single-threaded even with `--threads` and `OMP_NUM_THREADS` set properly

Closed this issue · 8 comments

During kraken2-build --download-library in the masking low-complexity seqs step, k2mask always appears to be single-threaded even though I've specified --threads n and OMP_NUM_THREADS = n in my environment before running. Do you know why this is happening?

Looks like only --build step is using threads. All other steps seem to be single thread.

Looks like only --build step is using threads. All other steps seem to be single thread.

kraken2/CHANGELOG.md

Lines 9 to 10 in 4cbdc5f

- Low complexity masking for nucleotide sequences now performed by our own
multithreaded code (k2mask) instead of the dustmasker program

kraken2/scripts/k2

Lines 457 to 470 in 8f82a7d

def spawn_masking_subprocess(output_file, protein=False):
masking_binary = "segmasker" if protein else "k2mask"
if "MASKER" in os.environ:
masking_binary = os.environ["MASKER"]
masking_binary = find_kraken2_binary(masking_binary)
argv = masking_binary + " -outfmt fasta | sed -e '/^>/!s/[a-z]/x/g'"
if masking_binary.find("k2mask") >= 0:
# k2mask can run multithreaded
argv = masking_binary + " -outfmt fasta -threads 4 -r x"
p = subprocess.Popen(
argv, shell=True, stdin=subprocess.PIPE, stdout=output_file
)

But for some reason in my process list the k2 wrapper script is not used and not found as a parent process, and you can see the k2mask options don't include -threads n like it should:

hermida+    9734  0.0  0.0 223352  3328 pts/7    S+   Jul29   0:00         /bin/bash /home/hermidalc/soft/miniforge3/envs/tcga-wgs-kraken-microbial-quant/share/kraken2-2.1.3-1/libexec/download_genomic_library.sh bact
hermida+   61824  0.0  0.0 223220  3328 pts/7    S+   04:02   0:00           /bin/bash /home/hermidalc/soft/miniforge3/envs/tcga-wgs-kraken-microbial-quant/share/kraken2-2.1.3-1/libexec/mask_low_complexity.sh .
hermida+   61827 13.7  0.0  41020 32204 pts/7    S+   04:02  68:30             k2mask -in ./library.fna -outfmt fasta
hermida+   61828 13.2  0.0 221760  2176 pts/7    S+   04:02  66:19             sed -e /^>/!s/[a-z]/x/g

Looks like only --build step is using threads. All other steps seem to be single thread.

Yep the k2 wrapper script that was added in v2.1.3 is never called when you run kraken2-build --download-library, it runs download_genomic_library.sh which spawns mask_low_complexity.sh which seems to have older k2mask spawning code that is only single-threaded!

https://github.com/DerrickWood/kraken2/blob/v2.1.3/scripts/mask_low_complexity.sh

I just found k2. I think this meant to be used as a independent script.

Can you try this?

$ k2 download-library --db viral --library viral

-threads 4

Not sure why this is hardcoded. Shouldn't this be configurable?

I just found k2. I think this meant to be used as a independent script.

Can you try this?

$ k2 download-library --db viral --library viral

The problem is in the bioconda kraken2 package none of these independent scripts are in the $PATH, only the binaries in the manual kraken2, kraken2-build, kraken2-inspect, etc. Looking at the bioconda kraken2 package install all these scripts are in a libexec subfolder in the package, the aren't copied or linked back to the conda environment bin folder, which they should be if you want to see them. I'm building a pipeline that is intended for others to use and to be reproducible so conda is a must.

-threads 4

Not sure why this is hardcoded. Shouldn't this be configurable?

It's fixed in master but not yet available

kraken2/scripts/k2

Lines 476 to 487 in 4cbdc5f

def spawn_masking_subprocess(output_file, protein=False):
masking_binary = "segmasker" if protein else "k2mask"
if "MASKER" in os.environ:
masking_binary = os.environ["MASKER"]
masking_binary = find_kraken2_binary(masking_binary)
argv = masking_binary + " -outfmt fasta | sed -e '/^>/!s/[a-z]/x/g'"
if masking_binary.find("k2mask") >= 0:
# k2mask can run multithreaded
argv = masking_binary + " -outfmt fasta -threads {} -r x".format(
multiprocessing.cpu_count() // 2
)

The new features in the k2 script implemented the fix to this issue