barrust/pyprobables

Math domain error

Glfrey opened this issue · 31 comments

Hello,

I'm getting the following error when using print(bloom_filter).

File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
    self.estimate_elements(),
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
    log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
ValueError: math domain error

I'm running the latest version, downloaded from pipit only the other day and I'm using python version 3.8.6.

Sorry you ran into an issue. Can you provide either a code sample that recreates the issue or an exported bloom filter to use for testing?

Thanks!

Hi @barrust sure thing. I'm just rerunning the code now (minus the print command) to get that bloom filter for you as the print command error caused the bloom filter to not export as that command came after the error.

I can also provide the code sample just in case that's quicker:

def bloom_construction(infasta, max_elem, max_err, k_size):
    start = time.time()
    input_file = Path(infasta).stem
    bloom = BloomFilter(est_elements=max_elem, false_positive_rate=max_err)
    with open(infasta) as build_fast:
        for title,seq in SimpleFastaParser(build_fast):
            kmers = kmerize(seq,k_size)
            for k in kmers.keys():
                bloom.add(k)
    end = time.time()
    print("Bloom filter constructed and populated for", input_file, "took", end - start, "seconds", file=sys.stderr)
    print(bloom)
    print("Writing bloom filter to file"
    bloom.export(input_file+"_bf.export")
    return()

That function is first taking in data (in this case DNA strings), fragmenting it (kmerize(seq,k_size)) and then adding those fragments to the bloom filter.

Thanks, what parameters (max_elem and max_err) are you using?

And I am glad you are using this for Bioinformatics; I use it for bioinfo myself!

Ah that's so great, hello fellow bioinformatician!
I'm so excited to use this for my work thank you so much for creating and maintaining it.

As for the parameters I'm using max_elem 210415 and max_err 0.1 which I know is rather high, it's for experimental purposes. I've got other max_err values running at the moment but they're taking days thanks to the sheer size of my data. I've considered parallelizing it but I'm not overly convinced it would provide much benefit given all the threads need access to the same data structure.

So, I took the line of code that is failing (line 350) and wrote a quick script to see where it would fail based on your parameters:

for i in range(37946 + 1):
    try:
        log_n = math.log(1 - (float(i) / float(37946)))
    except:
        print("failed on {}".format(i))

This provided that it only failed on 37946. How did you estimate the number of elements you were adding to the BloomFilter? I wonder if the bloom filter is "full" (i.e., all bits are set so it is well over the desired false positive rate).

Ah potentially, do you want me to run that script somewhere in my code?
I got JellyFish to count the number of unique k-mers in the file and used that to set the maximum.
I've got the others running now, I should be able to tell you how they do in the next day or so. They're running at a 0.01 and 0.001 error rate and are taking much longer (we're at almost 48 hours now).

If the one that failed is "quick" can you print these after it is built?

print(bloom.number_bits)

# count number of bits set
setbits = 0
for i in range(0, self.bloom_length):
    setbits += bin(self._bloom[i]).count("1")
print(setbits)

If they are the same, I think something is wrong with the number of kmers you calculated to insert (I.e., too many are being inserted)

I've got that running now, I'll update you as soon as it's done. Both of the other max_error settings (0.01, 0.001) had the same error math_value error as posted. I'll get them running the code you've given and we'll see what happens.
I'll also use a different k-mer counting software and see if the numbers line up.
There's no chance identical k-mers are being hashed to different placed are there?

Let me know if the kmer count is different using the different software.

As for the same kmer hashing differently, then no; that shouldn't happen. The following code is the "default" hashing algorithm where a for loop adjusts the seed used for successive hashes until one reaches the required number of hashes.

def fnv_1a(key: KeyT, seed: int = 0) -> int:
    max64mod = UINT64_T_MAX + 1
    hval = (14695981039346656037 + (31 * seed)) % max64mod
    fnv_64_prime = 1099511628211
    tmp = list(key) if not isinstance(key, str) else list(map(ord, key))
    for t_str in tmp:
        hval ^= t_str
        hval *= fnv_64_prime
        hval %= max64mod
    return hval

One question though is does JellyFish assume using reverse complement to get reduce the total number of sequence kmers?

I pushed some code to handle the exception that occurs when the number of bits set is equal to the number of bits. Since it can't calculate it in that instance, this will return -1 to signify that the Bloom filter is full. I added tests to show this and catch it for this instance in the future.

setbits = self._cnt_number_bits_set()
if setbits >= self.number_bits:
    return -1  # not sure this is the "best", but it would signal something is wrong
log_n = math.log(1 - (float(setbits) / float(self.number_bits)))

You mentioned speed as an issue, there are ways that you can parallelize the building of a Bloom filter without needing to all access the same bloom. You can build two different blooms (using the total estimated elements and same false positive rates) and then union them together after completion.

blm1 = BloomFilter(est_elements=10000000, false_positive_rate=0.01)
blm2 = BloomFilter(est_elements=10000000, false_positive_rate=0.01) 

# parallel build things

blm3 = blm1.union(blm2)

You can also build multiple Blooms from the same input in a single pass using the add_alt() by first building the hashes once and passing them into each Bloom filter.

I believe this is resolved in v0.5.6; please let me know if your testing finds that it is resolved!

Hi @barrust, sorry for the late reply, I've been completely bogged down with another project.

I got the following error with the code you wrote:

Traceback (most recent call last): File "/home/b56j439/prog/bloom_filter/bloom_test.py", line 70, in <module> pop_bloom_filter = bloom_construction(file, m, e, k) File "/home/b56j439/prog/bloom_filter/bloom_test.py", line 64, in bloom_construction for i in range(0, self.bloom_length): NameError: name 'self' is not defined

I'll run the new version of pyprobables and I'll let you know what happens.

I haven't had chance to rerun k-mer tests, I'm just doing that now.

As for you parallel ideas, thank you so much! I'll give them a go.

Dang! I meant to clean that up. This is what I meant. Let me know what happens with the new version.

print(bloom.number_bits)

# count number of bits set
setbits = 0
for i in range(0, v.bloom_length):
    setbits += bin(bloom.bloom[i]).count("1")
print(setbits)

Hi @barrust.

I got a different error with that code this time which still involves "self":

Traceback (most recent call last):
  File "/prog/bloom_filter/bloom_test.py", line 70, in <module>

  File "/prog/bloom_filter/bloom_test.py", line 64, in bloom_construction
    # count number of bits set
NameError: name 'self' is not defined

Also I did get different counts using different software. The counts I got back were greater than those JellyFish reported. I'm not sure why this would be as all of my parameters were appropriate but I guess I should add a buffer amount to any amount reported to stop the table becoming full.

However, does the new version print anything to console in the event the filter becomes full? I ask because even though I haven't adjusted my parameters, I'm not getting anything back to indicate to me the filter is full.

Not sure why the second one didn't fix the self issue. Not sure what is happening with that.

The new version of pyprobables will now show estimated elements added: -1 when the error happened before (i.e., all the bits were set. You should also see the false positive rate (again, when printing to console) is at or near 100%. number bits set will also be the same as bits.

This is the print out of my example to test the fix:

BloomFilter:
	bits: 63
	estimated elements: 10
	number hashes: 4
	max false positive rate: 0.050000
	bloom length (8 bits): 8
	elements added: 100
	estimated elements added: -1
	current false positive rate: 0.993026
	export size (bytes): 28
	number bits set: 63
	is on disk: no

Thinking about it more, I should probably force the current false positive to be 100% since in this case, there is no possible way for it to not actually be 100%

Hi @barrust,

Unfortunately I'm still getting the math domain error on the reruns with the new version.

Traceback (most recent call last):
  File "/home/user/prog/bloom_filter/bloom_filter_construct.py", line 64, in <module>
    pop_bloom_filter = bloom_construction(file, m, e, k)
  File "/home/user/prog/bloom_filter/bloom_filter_construct.py", line 60, in bloom_construction
    print(bloom)
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
    self.estimate_elements(),
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
    log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
ValueError: math domain error

I also got the error when I multiplied my expected number of insert elements by 10 as a buffer in case the bloom filter was full.

That is odd. The line number is still the same as the previous error. Can you confirm the version of pyprobables you are running?

import probables

print(probables.__version__)

Hi @barrust, I got the following error:

Traceback (most recent call last): File "/home/user/prog/bloom_filter/bloom_filter_construct.py", line 13, in <module> print(probables.__version__) NameError: name 'probables' is not defined

However I can confirm 0.5.6 is installed:

Defaulting to user installation because normal site-packages is not writeable Collecting pyprobables Downloading pyprobables-0.5.6-py3-none-any.whl (35 kB) Installing collected packages: pyprobables Attempting uninstall: pyprobables Found existing installation: pyprobables 0.5.5 Uninstalling pyprobables-0.5.5: Successfully uninstalled pyprobables-0.5.5 Successfully installed pyprobables-0.5.6

That is really strange, on both fronts. Not sure why this is happening if it isn't the all bits are set issue. My tests to figure out the edge case that caused the error showed that it only error'd out when the number of set bits >= number of bits. That line of code that the issue was on changed in the latest version so I am not sure why you are getting the same line number with the new version unless there is something different between your conda env and where it was installed as 0.5.6.

To be able to help more, I may need to get a code sample that can be used to replicate the issue. Perhaps a modified input file and the code to kmerize it into k-mers for insertion, etc. along with the estimate of the number of kmers in the modified input sample. If you would not like to share those items on github, we can find another place to share that data.

I'm more than happy to share my data and code with you, what's the best way of facilitating that?

I'm going to remove and reinstall everything and try again. I'm also going to try on a different server, I just want to make sure the problem isn't something on my end. I'll get back to you as soon as I've run some tests.

Hi @barrust I got the same error with a reinstall. I'm having memory issues with running it on the other server so I might not be able to test it there.

@barrust I managed to get it to run on a smaller dataset successfully on the other server without the math domain error. I'll try a smaller data set on the original server too. If the smaller dataset works, but the larger datasets fail even when allocated a buffer that's 10X greater than any estimated amount, and the memory issue isn't flagged but the math domain error is, do you have any idea what's going on with it?

At this point, I can't really track down what would be happening without the code and perhaps an example data set. If you can email me the code you are using, I might be able to find what is happening. barrust at gmail dot com

Also, are the memory issues related to this library?

Hi @barrust, the memory issues are my fault for not allocating enough memory for the program to run so it was terminated. Also, I have some good news! I've managed to get it to successfully run without any math domain error. It turns out I was using the wrong estimates in my max elements parameter and I was off by miles, which is what you suspected all along. I only realized this when I ran the smaller dataset and checked the estimated number of inserted elements against my max elements parameter and saw a massive discrepancy. When I corrected this it ran beautifully although I still think I need to experiment with allowing for a larger buffer as the error rate printed by pyprobables still exceeds what I specified.

This still means the math domain error is unexplained but likely to be related to the bloom filter being full, although this should be caught by modifications you added for the new version so I've no idea.

Thank you for all your help through this and apologies for not catching my error sooner, I double and triple checked my parameter settings but it turned out I was just simply using the incorrect value.

When you say it was off by miles, can you give me an idea of how far off so that I can try to reproduce it?

Can you run this as a test on your system using the system that found the issue? I am trying to replicate the issue by adding TONS more items to the bloom than was expected (without it taking more than a few minutes to run):

import sys
import probables


if __name__ == '__main__':

    version = tuple([int(i) for i in probables.__version__.split(".")])
    if version < (0, 5, 6):
        print("Probables version < 0.5.6!\nEXITING!!!")
        sys.exit()

    est_elms = 1000
    elms_to_add = 10000000

    blm = probables.BloomFilter(est_elms, 0.1)
    for i in range(elms_to_add):
        blm.add(str(i))

    print(blm)

Hi @barrust sorry for the delay. I'll certainly get that tested for you over the weekend.

As for how far I was off by, try not to smack your head on the desk when I tell you but I was off by over 14 million! I was using the max k-mer count instead of the actual number of elements by mistake.

Hello @Glfrey any update on the test? Thanks!

Hi @barrust, sorry for the delay I'm current an inpatient at hospital. As soon as I can get access to the tests I ran I'll update you. Hopefully that should be in the next day or so!

@Glfrey I am so sorry you are an inpatient. Your health is MUCH more important. I hope you get well soon!

Hi @barrust Thank you so much for your understanding! I'm back at work now and I'll get that test running for you today.

closing for now. If this is still an issue, please re-open!