algbio/themisto

27092 Killed: 9

karel-brinda opened this issue · 4 comments

I tested the method with simplitigs on of the human genome (HG38, k=31, https://zenodo.org/record/3770419/files/hg38-simplitigs31.fa.gz?download=1). I always get the error Killed: 9, which happens after the program gets stuck on Dumping k-mers to disk for a very long time (>1h).

  • OS: OS X
  • Command /Users/karel/github/themisto/build/bin/build_index --mem-megas 10000 --k 31 --input-file hg38-simplitigs31.fasta --n-threads 8 --index-dir index --temp-dir tmp
0.0230 Mon Sep 14 17:25:06 2020 Themisto-v0.2.0-1-gd8e44f5
Input file = hg38-simplitigs31.fasta
Input format = fasta
Index directory = index
Temporary directory = tmp
k = 31
Number of threads = 8
Memory megabytes = 10000
Automatic colors = false
Load BOSS = false
0.0250 Mon Sep 14 17:25:06 2020 Starting
0.0250 Mon Sep 14 17:25:06 2020 Making all characters upper case and replacing non-{A,C,G,T} characters with random characeters from {A,C,G,T}
69.9690 Mon Sep 14 17:26:16 2020 Replaced 0 characters
69.9750 Mon Sep 14 17:26:16 2020 Building BOSS
69.9750 Mon Sep 14 17:26:16 2020 Listing (k+2)-mers
Calling KMC with: kmc -fm -k33 -b -m10 -ci1 -cs1 -cx4294967295 -t8 tmp/seqs-AetcbrX4gJpCZDtoyYuBdjcAh tmp/KMCfyJijWik8oHkSTcWJIAzu3wE9 tmp 
**********************
Stage 1: 100%
Stage 2: 100%
Dumping k-mers to disk
./construct.sh: line 13: 27092 Killed: 9               /Users/karel/github/themisto/build/bin/build_index --mem-megas 10000 --k 31 --input-file hg38-simplitigs31.fasta --n-threads 8 --index-dir index --temp-dir tmp

Confirmed that the issue reproduces on both macOS 10.14.6 and Linux (CentOS 7) when using the v0.2.0 prebuilt binaries. build_index dumps 100G worth of k-mers to disk and then hangs doing nothing. This hang also seems to introduce a memory leak which eventually crashes the program.

I suspect the issue is somehow related to the rather large number of simplitigs (10 210 401) in hg38-simplitigs31.fasta, or how KMC is called from themisto. Unfortunately I can't think of a quick fix/workaround right now for this specific dataset without further investigation into the issue.

I will investigate.

This seems to be caused by a lazy implementation of a subroutine from my part, which is slow and uses on the order of mk^2 bytes of memory, where m is the number of sequences and k is the k-mer k. This becomes a problem when the number of reference sequences is large, as is in this use case. The killed 9 signal is probably due to excessive memory usage. A fix is coming very soon.

Fixed in 44703e8.