jermp/fulgor

Warning due to 64-bit hash codes

tmaklin opened this issue · 9 comments

Hi, I'm trying to index a large and diverse collection of bacterial genomes (~300k sequences from ~4000 different species) but keep getting the following error:

fulgor build -k 31 -m 20 -l multifasta_paths.txt -o index -d tmp-fulgor -g 780 -t 20 --verbose
terminate called after throwing an instance of 'std::runtime_error'
  what():  Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.
/var/spool/slurmd/job273252918/slurm_script: line 35: 2880844 Aborted                 (core dumped)

I found a fix for this in sshash (jermp/sshash#16) but even after uncommenting the relevant line in a fresh clone of fulgor after running git submodule update --init --recursive I keep getting the same error.

I forgot to mention that this happens after fulgor enters the following step:

building minimizers MPHF (PTHash) with 8 threads...
jermp commented

Hi,
yes it means there are more than 1B minimizer for m=20, hence the minimal perfect hash function use in SSHash (which is PTHash) would require 128-bit hash codes to be safe.
It should be enough to change the default hasher in SSHash for minimizers, here: https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L51.

jermp commented

Should you have some update on the Themisto index, I'd love to know. Thanks!

thanks for the quick reply! I think I changed the wrong line (https://github.com/jermp/sshash/blob/master/include/kmer.hpp#L6), will let you know if this works.

Just testing fulgor vs themisto v3 with mSWEEP for a certain downstream application for now.

rob-p commented

Thanks for the info @tmaklin :). One thing (of which I'm certain you are aware) is that the current approaches for reporting the pseudoalignment results are far from space optimal.

I'm aware this is something you've worked on in alignment-writer. If it makes sense to design and converge on a more standard and compact output format for these tools, this is something that we'd certainly be interested in. For example, we have a binary format (the RAD) format that we use in alevin-fry that addresses a variant of this problem. However, it would be nice to generalize and to understand what information makes sense for different use cases.

Cheers,
Rob

Indexing the data overnight worked so closing this. Thanks!

Re the file format, a standardized and compact output for all the different tools sounds great! Alignment-writer (think I need to figure out a better name) is a wrapper around BitMagic and achieves anything from 10x to 100x compression on the test cases I used while developing but the efficiency naturally depends on the complexity of the alignment. I'll be adding support for the format fulgor currently uses soon.

Some issues I've noticed with the formats while developing and using the various tools, roughly in order of headaches caused

  • Total number of alignment targets can't be inferred with certainty from the format.
  • Not printing empty lines for no alignments.
  • Fragment names instead of the position of the read in the fastq files (makes sorting the file difficult and slows down matching the alignments with the reads if the file is not sorted).
  • Total number of reads can't be inferred.
  • Multiple files to store the results (for example unique alignments + their counts).
  • I tend to prefer formats that support streaming the results rather than having to wait for the whole alignment to finish, or conversely read in the whole file before processing the results.

Would be nice to discuss/work on this at some point.

jermp commented

Hi @tmaklin,
yes, a common alignment format would be very nice. Happy to work on this together if you like.

rob-p commented

Likewise. I have some thoughts on this as well :). Shall we open a separate issue for it? Or, even better, @jermp, if you can enable “discussions” on this repo.

jermp commented

Good idea, discussions enabled!