n-th prime is too slow. Consider using primecount as a backend

Question

n-th prime is too slow. Consider using primecount as a backend

galanakis opened this issue 3 years ago · 21 comments

The n-th prime function of julia is actually really slow.
For example in julia prime(10^7) runs in 14 seconds in my machine.

There is a well known C++ library with a 2-clause BSD license called primecount which can calculate
the prime(10^14) in 2 seconds (of course prime(10^7) is instantaneous.

Mathematica's Prime function is comparable to primecount (it is actually 2 times slower).

So given that, why not use primecount as a backend for the n-th prime function?

I could help with a patch if something like this sounds interesting.

Answer 1 · 2021-12-08T21:20:20.000Z

It would be much better (in my opinion) to get a fastish Julia implementation. The current implementation is a loop that calls nextprime, which is very bad. The decent simple implementation would be a segmented prime sieve that counts up as it finishes the segment. This should be relatively simple to implement and fast.

Answer 2 · 2021-12-10T18:14:44.000Z

Thanks for the response. I am interested in implementing the Meissel-Lehmer algorithm. It will take months of part-time work.
Is anyone else working on an implementation of prime-count?

Answer 3 · 2021-12-10T19:56:06.000Z

#102 is my 1 hour effort. It computes prime(10^7) in .14 seconds which isn't optimal, but is a lot better.

Answer 4 · 2021-12-10T21:17:45.000Z

Yes, it is certainly much better. However primecount is instantaneous even at 10e14, which is really impressive. I am wondering if there is interest in an implementation of the Messel - Lehmer method.

https://www.ams.org/journals/mcom/1996-65-213/S0025-5718-96-00674-6/S0025-5718-96-00674-6.pdf

Is having best of class algorithms in the scope of this project?

Answer 5 · 2021-12-10T21:32:44.000Z

I'd consider it in scope (although finding a reviewer with sufficient knowledge to approve it might be hard). Also, it seems a little premature to me given that we don't even have a segmented prime sieve yet (or efficient primality checking).

Answer 6 · 2021-12-11T15:45:13.000Z

It is true that the algorithm needs a segmented sieve and it is a priority. I could work on that.
Is there a specification or a potential interface for a segmented sieve? Also, it seems that the author of the following project

https://github.com/haampie/FastPrimeSieve.jl

is considering to push it to Primes.jl but is currently lacking an interface.

In general what would be the plan for a segmented prime sieve?

Answer 7 · 2021-12-11T16:01:47.000Z

@haampie would it be reasonable to make Primes.jl depend of FastPrimeSieve?

Answer 8 · 2021-12-11T17:32:08.000Z

If I might, could I ask again what would be the objection in implementing an API to the primecount library? It is very actively maintained (>4000 commits in 5 days) and it is faster than mathematica it self. What are the chances that a good implementation in pure julia will be within an order of magnitude from this?

Answer 9 · 2021-12-11T18:03:45.000Z

Honestly, I think it is probably too big of a dependency, but if you want to use this in Julia, https://github.com/JuliaBinaryWrappers/primecount_jll.jl already exists as a wrapper library. I don't think there is any technical reason why a Julia version would be slower, although it would require a significant amount of effort. One potentially interesting approach would be to use would be a GPU based segmented sieve (eg https://github.com/curtisseizert/CUDASieve).

Answer 10 · 2021-12-11T20:15:27.000Z

#87 notice this PR, but I did not add many tests :( so that's probably why it was not merged

Answer 11 · 2021-12-11T20:17:14.000Z

I'm also afraid that the code is too hard to follow, so that if there's ever an issue with it, I'm the one to fix it, but I may not have the time for it

Answer 12 · 2021-12-11T20:43:32.000Z

I think it's possible to hit a better point on the speed/complexity graph. I'll take a shot at simplifying this.

Answer 13 · 2021-12-11T21:43:45.000Z

The segmented sieve + wheel concept is very nice though :) also the ~~hand~~ macro based unrolled sieving loop. I vaguely remember though that codegen is not optimal compared to primesieve's inner loop (something where LLVM can't be forced to generate a jump table). May have to revisit now that Julia uses LLVM 12

Answer 14 · 2021-12-11T23:21:33.000Z

Do you have a copy of the non macro based version? IMO, that would already be a major simplification.

Answer 15 · 2021-12-12T22:54:24.000Z

If you replace the first line with function segmented_sieve(limit::Int64, L1D_CACHE_SIZE=128*1024) it's 70x faster. Your version has L1D_CACHE_SIZE as a non-const global variable.

Answer 16 · 2021-12-12T23:01:37.000Z

With that change, I'm seeing this code as 5-10x slower than FastPrimeSieve which is pretty good given that this can be significantly optimized.

Answer 17 · 2021-12-12T23:03:29.000Z

I deleted my previous post, because I found a mistake (L1D_CACHE_SIZE must be an argument, not a global)

I translated into julia a C++ code from primesieve.org for a simple segmented prime sieve

https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes

The translation is not very thorough, but the results are encouraging. It is 70% slower than the C++ version. I am wondering if this can be optimized to reach the speed of the C++ version.

I guess the argument is, does it make sense to implement something complex from scratch only to find out that julia gives you a 70% penalty?

function segmented_sieve(limit::Int64, L1D_CACHE_SIZE = 128*1024)

    sqrt = isqrt(limit)
    segment_size = max(sqrt, L1D_CACHE_SIZE)
    count = (limit < 2) ? 0 : 1;

    i = 3
    n = 3
    s = 3

    sieve = zeros(Bool, L1D_CACHE_SIZE)
    is_prime = ones(Bool, sqrt)

    primes = Vector{Int64}()
    multiples = Vector{Int64}()

    for low in 0:segment_size:limit

        fill!(sieve, true)
        high = min(limit, low + segment_size - 1)

        while i*i <= high
            if is_prime[i]
                is_prime[ i*i : i : sqrt ] .= false

            end
            i += 2
        end

        while s*s <= high

            if is_prime[s]
                push!(primes, s)
                push!(multiples, s*s - low)
            end
            s += 2
        end

        for a in 1 : length(primes)

            p = primes[a]
            m = multiples[a]

            j = m
            k = 2*p
            while j < segment_size
                sieve[j] = false
                j += k
            end
            multiples[a] = j - segment_size
        end

        while n <= high
            if sieve[n - low]
                count += 1
            end
            n += 2
        end

    end

    println("Found ", count, " primes")

end

Answer 18 · 2021-12-12T23:13:18.000Z

@inbounds for low in 0:segment_size:limit makes this about 2x faster as a simple optimization.

Answer 19 · 2021-12-12T23:17:35.000Z

It is true that it is a bit faster, but not 2 times. However, I got to 50% away from the C++ version, which is something good. I guess, if I can get with 10%, then it is a deal.

Answer 20 · 2021-12-12T23:18:33.000Z

what number are you testing up to? I saw a reduction from 15 to 8 seconds for segmented_sieve(10^10)

Answer 21 · 2021-12-12T23:23:10.000Z

Hmmm in mine it dropped from 8.9 to 7.5 (with @inbounds) for the same number. The C++ version finishes in 5.26 seconds.