attractivechaos/klib

Maximum capacity of hash table?

conchoecia opened this issue · 1 comments

I implemented a DNA De Bruijn graph with canonical kmers over khash but my program keeps freezing when I try to insert more than 2.1 billion kmers.

The program freezing is not due to the system running out of memory (there is plenty), and the process with the De Bruijn graph only occupies ~24 Gb in memory ( each kmer takes 8 bytes for each key I presume + 4 bytes for the struct I store per key). The program is also not freezing due to errors in the input data, verified with some tests.

I'm representing k-mers as uint64_t types. The hash function was something provided by @lh3:

static inline uint64_t hash_64(uint64_t key)
{ // more sophisticated hash function to reduce collisions
  key = (~key + (key << 21)); // key = (key << 21) - key - 1;
  key = key ^ key >> 24;
  key = ((key + (key << 3)) + (key << 8)); // key * 265
  key = key ^ key >> 14;
  key = ((key + (key << 2)) + (key << 4)); // key * 21
  key = key ^ key >> 28;
  key = (key + (key << 31));
  return key;
}

The data type I am storing is simple:

typedef struct DBnode {
  uint16_t count = 0;
  uint16_t flag = 0;
} DBnode;

I am initializing the hash table with KHASH_INIT(64, khint64_t, DBnode, 1, hash_64, kh_int64_hash_equal).

Here is a snippet of code where I insert k-mers into the hash table. lookup is the canonical representation of the k-mer as a uint64_t:

  khint_t k;
  k = kh_get(64, class_h, lookup); // query the hash table
  int is_missing, absent;
  is_missing = (int)(k == kh_end(class_h));
  if (put == 1){ //insert
    if (is_missing){
      k = kh_put(64, class_h, lookup, &absent);
      kh_val(class_h, k).count = 1;
      kh_val(class_h, k).flag = 0;
    } else {
      kh_val(class_h, k).count += 1;
    }

Any ideas why my program might be consistently freezing when I add over 2.1 billion keys to the hash table? Thank you.

khash can only keep 2 billion keys. The right and the better way to implement a huge hash table for de Bruijn graph is to use an ensemble of smaller hash tables. For example, you can use 2**14=16384 hash tables. On insertion, you calculate the hash value before insertion, take the first 14 bits to decide which hash table to insert and then insert the key. This strategy simplifies parallel insertion because the chance that you insert two keys into the same hash table is small. You can lock the table on insertion. The even better way is to use an invertible integer hash function.