number_of_partitions returns wrong results under the Windows Subsystem for Linux

Question

number_of_partitions returns wrong results under the Windows Subsystem for Linux

soehms opened this issue 6 years ago · 28 comments

Currently, installing Sage under WSL 2 (Ubuntu 18.04) produces 8 doctests failures in sage.combinat.partitions.number_of_partitions for example:

File "src/sage/combinat/partitions.pyx", line 72, in sage.combinat.partitions.number_of_partitions
Failed example:
    number_of_partitions( n - (n % 385) + 369) % 385 == 0
Expected:
    True
Got:
    False
...
File "src/sage/combinat/partitions.pyx", line 95, in sage.combinat.partitions.number_of_partitions
Failed example:
    len([n for n in [1..500] if number_of_partitions(n) != Partitions(n).cardinality(algorithm='pari')])
Expected:
    0
Got:
    21

In more detail:

Failures for 9.1.rc0 and 9.0 (installed from binaries):

sage: [(i, number_of_partitions(i, algorithm='bober') - number_of_partitions(i)) for i in range(17000) if number_of_partitions(i, algorithm='bober') != number_of_partitions(i)]
[(241, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 1),
 (248, 1),
 (250, 1),
 (252, 1),
 (253, 1),
 (256, 2),
 (258, 2),
 (259, 3),
 (262, 1),
 (263, 1),
 (264, 3),
 (266, 4),
 (267, 4),
 (268, 1),
 (269, 5),
 (270, -1),
 (271, -1),
 (9894, -1),
 (9908, -1),
 (16564, -1),
 (16573, -1),
 (16672, -1),
 (16771, -1),
 (16789, -1),
 (16798, -1),
 (16802, -1),
 (16816, -1),
 (16820, -1),
 (16825, -1),
 (16914, -1),
 (16942, -1),
 (16946, -1),
 (16951, -1),
 (16969, -1)]

Failures for 8.1 (installed using apt):

sage: [(i, number_of_partitions(i, algorithm='bober') - number_of_partitions(i)) for i in range(17000) if number_of_partitions(i, algorithm='bober') != number_of_partitions(i)]
[(239, 1),
 (242, 1),
 (248, 2),
 (249, 1),
 (250, 1),
 (251, 2),
 (252, 1),
 (253, 1),
 (256, -1),
 (258, 2),
 (259, 3),
 (260, 3),
 (261, 4),
 (262, -3),
 (263, -4),
 (264, 4),
 (265, 6),
 (266, -2),
 (267, 4),
 (268, 1),
 (269, 5),
 (270, -1),
 (271, 7),
 (9894, -1),
 (9908, -1),
 (16564, -1),
 (16573, -1),
 (16672, -1),
 (16771, -1),
 (16789, -1),
 (16798, -1),
 (16802, -1),
 (16816, -1),
 (16820, -1),
 (16825, -1),
 (16914, -1),
 (16942, -1),
 (16946, -1),
 (16951, -1),
 (16969, -1)]

See also discussions on sage-devel and ticket #28549.

CC: @sagetrac-tmonteil @tscrim @mezzarobba @jdemeyer @dimpase @fredrik-johansson @sagetrac-bober

Component: combinatorics

Keywords: number of partitions WSL

Author: Thierry Monteil

Branch/Commit: 2c37f3d

Reviewer: Matthias Koeppe

Issue created by migration from https://trac.sagemath.org/ticket/29600

Answer 1 · 2020-04-27T19:37:24.000Z

comment:1

This might be related to: microsoft/WSL#830 (long double floating point calculations give inexact results)

Answer 2 · 2020-04-27T23:26:42.000Z

comment:3

The ./src/sage/combinat/partitions_c.cc piece of C code was added into Sage in 2007. I guess it was invented because there was no fast enough existing wheel for that: the doctest of number_of_partitions function says it is *very* fast. Now, there is flint, which is still maintained and much faster:

sage: from sage.misc.sage_timeit import sage_timeit
sage: tf = lambda i : sage_timeit('_=number_of_partitions({})'.format(i), globals() , preparse=True, number=1, seconds=True)
sage: tb = lambda i : sage_timeit('_=number_of_partitions({}, algorithm="bober")'.format(i), globals() , preparse=True, number=1, seconds=True)
sage: [(i,tb(10^i)/tf(10^i)) for i in range(9)]
[(0, 1.076654167643153),
 (1, 51.60920756939386),
 (2, 51.13886939147276),
 (3, 122.66713137190546),
 (4, 471.61390056849905),
 (5, 2741.5143212699077),
 (6, 18715.222483310186),
 (7, 170662.0538251338),
 (8, 1930629.845842312)]

So, let me propose a backtrack version of Sage's motto: "To maintain the bike, forget your own old wheels, replace them with subsequently invented ones".

Why not just remove that piece of code ?

Note that arb and flint implementations are inspired by Bober Sage implementation, see https://arxiv.org/abs/1205.5991

Answer 3 · 2020-04-27T23:47:13.000Z

comment:4

+1

Answer 4 · 2020-04-27T23:53:17.000Z

comment:5

If everyone agree, I can do the removal (let me CC people involved in #24667).

Answer 5 · 2020-04-27T23:58:17.000Z

comment:6

-1

I think it is good to keep it for testing purposes. In this case, I am pretty sure it is exposing an underlying bug rather than an issue with this code itself as nearly all of the main computations are outsourced. There might be some issue with a rounding mode not being handled correctly (a la point 9 in #24667). I think this warrants further investigation rather than sweeping the issue under the rug (or maybe out the door is the better analogy, but you should check the house to make sure things are getting in some other way).

Answer 6 · 2020-04-28T00:21:11.000Z

comment:7

The bug is clearly in the ad-hoc float precision code in that file! It has ifdefs for the behavior of long double.

Answer 7 · 2020-04-28T01:23:57.000Z

comment:8

It doesn't appear to be ad-hoc but due to architecture differences and the code assuming a particular behavior. So if that is indeed the problem, this could be showing any code within Sage that is using long double. From this, I think it is related to this WSL issue.

cc-ing Fredrik to also get his thoughts on this.

Answer 8 · 2020-04-28T01:58:56.000Z

comment:9

I dont get the argument. I am not suggesting to remove a doctest, but the actual code, so that if there is a bug in that code, well, it will disapear as well. If the proposal is to keep a test for flint implementation (besides self-tests), we should perhaps compare with arb implementation instead, which is proven to be correct (or even Gap implementation, to be sure that it was written by unrelated developers), with explicit doctests. If the issue comes from a known WSL bug, i am not sure that this legacy code should be kept just to show another instance of it.

Answer 9 · 2020-04-28T02:13:58.000Z

comment:10

I won't be surprised if the bug in question also shows up on non-x86(_64) hardware. e.g. on ARM (which lacks that 80-bit extended precision).

Answer 10 · 2020-04-28T02:15:31.000Z

comment:11

+1 for removal of outdated code, which is better served by a maintained external library.

Answer 11 · 2020-04-28T03:22:27.000Z

comment:12

Replying to @tscrim:

... this could be showing any code within Sage that is using long double.

That's a very short list of places.

./libs/mpfr/__init__.pxd:23:    int mpfr_set_ld(mpfr_t rop, long double op, mpfr_rnd_t rnd)
./libs/mpfr/__init__.pxd:42:    int mpfr_init_set_ld(mpfr_t rop, long double op, mpfr_rnd_t rnd)
./libs/mpfr/__init__.pxd:50:    long double mpfr_get_ld(mpfr_t op, mpfr_rnd_t rnd)
./libs/mpfr/__init__.pxd:53:    long double mpfr_get_ld_2exp(long *exp, mpfr_t op, mpfr_rnd_t rnd)
./libs/mpfr/__init__.pxd:120:    int mpfr_cmp_ld(mpfr_t op1, long double op2)
./libs/mpc/__init__.pxd:27:    int  mpc_pow_ld(mpc_ptr, mpc_srcptr, long double, mpc_rnd_t)
./libs/mpc/__init__.pxd:48:    int  mpc_set_ld(mpc_ptr, long double, mpc_rnd_t)
./libs/mpc/__init__.pxd:49:    int  mpc_set_ld_ld(mpc_ptr, long double, long double, mpc_rnd_t)
./libs/gsl/math.pxd:36:    long double GSL_MAX_LDBL(long double a, long double b)
./libs/gsl/math.pxd:37:    long double GSL_MIN_LDBL(long double a, long double b)
./matrix/matrix_integer_dense.pyx:142:                'ld': 'long double',
./matrix/matrix_integer_dense.pyx:2810:            - ``'ld'`` -- long doubles (fpLLL only)
./matrix/matrix_integer_dense.pyx:3039:          - ``'ld'`` -- long doubles (fpLLL only)

Answer 12 · 2020-04-28T08:02:08.000Z

comment:13

I propose removing the 'bober' implementation and supporting algorithm='pari' instead.

Pari's numbpart() was improved in the recent past and it's now twice as fast as Bober's code on my computer.

Answer 13 · 2020-04-28T12:03:25.000Z

comment:14

I don't think it is fair to call it outdated code. Since Pari's implementation is at least as fast as this (so we can use that to test against each other) and everyone seems intent on just sweeping the problem away, then go ahead and remove it.

Answer 14 · 2020-04-28T20:37:29.000Z

comment:15

Removing sounds hard in my ears, as well. Wouldn't it be enough to print a warning on the first invocation of a functionality that uses long double? That could be done once per session which is running under WSL or some other platforms having similar problems.

Indeed, there are not many of them:

function number_of_partitions with option algorithm='bober'
method BKZ of dense integer matrices with option fp='ld'
method LLL of dense integer matrices with option fp='ld'

Such a warning should refer to the WSL issue 830 and make clear that wrong results are possible.

Answer 15 · 2020-04-29T12:04:13.000Z

comment:16

I don't see the point of maintaining obsolete code in Sage when there are multiple superior implementations in its standard libraries. It's one thing to have a 20-line reference implementation written in Sage, but in this case it's 2000 lines of C++ code that will be a future maintenance burden if it's a maintenance burden right now.

Don't get me wrong, this code was a terrific addition to Sage when Jon wrote it (lots of credit to him), but it has already more than served its purpose by inspiring even better implementations :-)

Of course, if Jon himself thinks the code should stay, I wouldn't argue with him.

Answer 16 · 2020-04-29T14:22:13.000Z

comment:17

Replying to @fredrik-johansson:

I don't see the point of maintaining obsolete code in Sage when there are multiple superior implementations in its standard libraries.

In principal I agree! But that's a more general question which applies to many other cases, as well. I think, it's not necessary to decide this in the context of this ticket. At the moment, we are not maintaining the code. We are maintaining a WSL issue.

What I want to say is: Shall we kill the messenger of the bad news, or shall we do something against the bad news?

Anyway, I would agree if everyone would like to take the occasion to get rid of that code.

Answer 17 · 2020-08-18T15:39:33.000Z

comment:18

Nobody has worked on fixing this code, so let's please remove it for 9.2. Any takers?

Answer 18 · 2020-08-18T16:21:59.000Z

comment:19

I can take that one.

Answer 19 · 2020-08-19T07:26:59.000Z

comment:20

I'm a bit late to this conversation.

In principle I agree that there is not much reason to leave this code in Sage given that it is so flaky and there is just about no good reason to use it.

On the other hand, if you actually want to trust WSL, it may still be worth trying to figure out what the actual problem is. The WSL ticket you point to seems to claim that floating point issues were fixed in WSL2, which is what you specifically mention as failing the tests. If they haven't been completely fixed then I would expect issues elsewhere.

If you get rid of the 4 lines starting at line 922, (long double ld_partial_sum = ...) and change level_two_precision to level_five_precision on line 873 (for (k = 1; current_precision > level_two_precision; k++)), I think you'll get a version that should work and doesn't use long doubles. (Maybe you get the same effect by setting level_two_precision = level_five_precision or long_double_precision = double_precision towards the top of the file, which is what appears to be done on some systems.)

I don't think I ever put in any of those ifdefs, and the ternary operator that checks if LDBL_MANT_DIG == 106 looks quite strange. (Also, the whole thing reads rather strangely if you don't know that once upon a time it used quad doubles and double doubles.)

Answer 20 · 2020-08-19T17:56:58.000Z

Branch: u/tmonteil/number_of_partitions_returns_wrong_results_under_the_windows_subsystem_for_linux

Answer 21 · 2020-08-19T18:27:13.000Z

Commit: 2c37f3d

Answer 22 · 2020-08-19T18:27:13.000Z

comment:22

I uploaded a branch that removes the code and redirects its uses to the faster flint implementation. To be complete, i am thinking of an additional refactor commit on top of that branch that merges the code of sage.combinat.partition.Partition_n.cardinality and sage.combinat.partition.number_of_partitions which are kind of duplicated (with some algorithms missing in the later), and adds some cross-algorithm testing (note that we have flint, pari and gap implementations).

Now, regarding whether we should merge that branch into Sage, i still do not understand how maintaining that code could benefit to the rest of Sage, even with respect to WSL2, and i do not understand how that piece of code could improve our trust of WSL2 with respect to the rest of Sage code. If the only point is to understand some floating-point subtleties of WSL2, then i am pretty sure that maintaining this code within Sage is not the right way (people who want to understand can think out of Sage source tree). If someone sees how this codes illustrates some WSL2 bug, it might be better to report upstream.

The ticket could be re-purposed to "Refactor the computation of the number of partitions".

New commits:

`f566a3e`	`#29600 : remove partitions_c.cc C code`
`2c37f3d`	`#29600 : remove sage.combinat.partitions.number_of_partitions and redirect calls to it`

Answer 23 · 2020-08-19T18:27:13.000Z

Author: Thierry Monteil

Answer 24 · 2020-08-19T18:43:46.000Z

comment:23

Replying to @sagetrac-tmonteil:

I uploaded a branch that removes the code and redirects its uses to the faster flint implementation.

Looks good to me.

To be complete, i am thinking of an additional refactor commit on top of that branch that merges the code of sage.combinat.partition.Partition_n.cardinality and sage.combinat.partition.number_of_partitions which are kind of duplicated (with some algorithms missing in the later), and adds some cross-algorithm testing (note that we have flint, pari and gap implementations).

I would suggest to use a follow-up ticket for this.

Now, regarding whether we should merge that branch into Sage, i still do not understand how maintaining that code could benefit to the rest of Sage [...]

I would think it's definitely out of the scope of the Sage project to investigate hypothetical errors of other upstream libraries on WSL2 because of ABI problems.

Answer 25 · 2020-08-19T18:44:47.000Z

Reviewer: Matthias Koeppe

Answer 26 · 2020-08-19T18:44:47.000Z

comment:25

If the patchbot light shows green, you can set this to positive review on my behalf

Answer 27 · 2020-08-19T23:03:29.000Z

comment:26

The patchbot doctest failure seems unrelated.

Answer 28 · 2020-08-23T17:51:08.000Z

Changed branch from u/tmonteil/number_of_partitions_returns_wrong_results_under_the_windows_subsystem_for_linux to 2c37f3d