bcgsc/ntHash

Multispaced seeds

JustinChu opened this issue · 6 comments

@mohamadi

I have an old modified version of rollinghash.h that has a function for multiple spaced seeds written for hitseq 2016.

The main part of the modification looks like this:

// mask out specific bases given the masking value positions
// the string iterator is the unmasked k-mer (random access without copying)
// returns cannonical value
inline uint64_t maskValues(uint64_t fkVal, uint64_t rkVal, const unsigned k,
		const std::vector<unsigned> &values,
		const std::string::const_iterator &kmerItr) {
	for (std::vector<unsigned>::const_iterator itr = values.begin();
			itr != values.end(); ++itr) {
		fkVal ^= rol(msTab[(unsigned char) *(kmerItr + *itr)], k - 1 - *itr);
		rkVal ^= rol(msTab[(unsigned char) *(kmerItr + *itr) + cpOff], *itr);
	}
	return (rkVal < fkVal) ? rkVal : fkVal;
}

And a rolling hash iterator that can then take in spaced seed patterns (in the constructor it takes a set of spaced seed patterns). I was planning on updating nthash with this since I'm also using that right now in the current code.

Due to time constraints, I'll be rolling the both versions (nthash for k-mers, rollinghash for multiple spaced seeds) out in the next version of BBT but was wondering if you could update nthash to include a multiple spaced seeds iterator.

You could start with this crude method that simply masks out bases but I guess ideally you would use some of the scheme that was presented in this paper: http://drops.dagstuhl.de/opus/volltexte/2017/7650/.

When you are done, I could update the BBT and the btl_bloomfilter repo to support multiple spaced seeds.

What do you think?

@JustinChu I'll go through it soon, after completing the revision of the paper.

Hamid

Added Branch spacedseeds and we can make another nttest.cpp or something that compares:

  • normal ntHash
  • masking spaced seed ntHash
  • optimal spaced seed ntHash (based on paper implementation)

First testing pattern can be:

000001001100101010010011000000101101110101100010111010011011
111110101000011101100000000101010111001010000001000000000000
001001110011000110001100010010010010001100011000110011100100
000000000000100000010100111010101000000001101110000101011111
110110010111010001101011101101000000110010010101001100100000

We can also take the 3 from the from the paper:

1111011101110010111001011011111
1111101011100101101110011011111
1111101001110101101100111011111
1111010111010011001110111110111
1110111011101111010010110011111
1111101001011100111110101101111
1111011110011010111110101011011
1110101011101100110100111111111
1111110101101011100111011001111

I pushed a branch with a test case:
https://github.com/bcgsc/ntHash/spacedseeds

To run tests

ssTest/sstest -s "1000010111110110111101110011000111010011111011111111110001100100001011101111011 0111001111110111110111111110111000111111011101111101011110011111111101011110111 0011111000001001101110111011111010111111010111011111111001101101111110111011111 1111110111111111101111111100100111101010101011110010011111111011111111110111111 1111101110111111011011001111111101110101111110101111101110111011001000001111100 1110111101011111111100111101011111011101111110001110111111110111110111111001110 1101111011101000010011000111111111101110100101110001100111011110110111110100001" filename.fa

Feel free to update it as you see fit

Closing, done in ntHash2.