libtom/libtommath

tfm/ltm relation

minad opened this issue · 16 comments

minad commented

I think this has been discussed before. I've seen tfm also considers adding multi precision ints.

Another approach I find interesting is the one gmp takes where ones high-level and low-level functions (which take pre allocated digits as arguments).

Is there some plan on how things will evolve in the future?

minad commented

Some questions:

  • Is there some list of software which uses tfm like you made for ltm?
  • How does code quality compare, maybe tfm has some improvements due to lessons learned.
  • Testing/test coverage
  • Maybe it is possible to just import a few of the fast functions into ltm?
  • Tfm contains asm in contrast to ltm?

Is there some plan on how things will evolve in the future?

No, there's no real plan.

The original idea was to merge ltm into tfm, then reality happened and now ltm has evolved a lot and we should probably think about merging tfm's optimizations into ltm.

Is there some list of software which uses tfm like you made for ltm?

No, there's no list. The ones that come immediately to my mind are ClamAV and libtomcrypt

minad commented

@sjaeckel I updated the list of questions I am having.

  • How does code quality compare, maybe tfm has some improvements due to lessons learned.

I'd say it's similar

  • Testing/test coverage

not that good

  • Maybe it is possible to just import a few of the fast functions into ltm?

definitely possible

  • Tfm contains asm in contrast to ltm?

yes, that's where the "fast" comes from ... but it's not maintained ... e.g. ClamAV uses the plain C version of tfm w/o asm

minad commented

So the algorithms and representation are the same (60 bits of 64 bits etc)? Since I think this is sth where gmp differs for example. They work with the full width afaik.
So the plain c part should be mostly equivalent to ltm with the difference of not allocating?

So the algorithms

tfm has less operations & algorithms implemented

and representation are the same (60 bits of 64 bits etc)?

No, tfm uses the full width.

So the plain c part should be mostly equivalent to ltm with the difference of not allocating?

operation-wise it was equivalent up to some point where ltm started to get more people working on it.

minad commented

@sjaeckel Regarding #420 I looked a bit into the comba multiplier, since this is the one restricting ltm to 60 bits. It seems to me that the restriction is possible to lift as is done in fp_mul_comba. Instead of _W, there three fp_digits c0, c1, c2 are used.

So if one wants to import some of the fast functions to ltm one should at first lift the MP_DIGIT_BIT restriction. Basically the fast functions are comba loop unrolling + primitives in assembly.

Ping @czurnieden

The original idea was to merge ltm into tfm, then reality happened and now ltm has evolved a lot and we should probably think about merging tfm's optimizations into ltm.

It would be great if there are optimizations that can be merged into ltm. Do you think any significant speedups are possible?

minad commented

It would be great if there are optimizations that can be merged into ltm. Do you think any significant speedups are possible?

As far as I see it there are speedups of maybe factor 2 for multiplication etc, since tfm uses some unrolled implementations in that case. Look here for some encryption benchmark graphs #447.

However merging the libraries is not trivial since they are relying on different integer representations. I did some preparatory experiments to unify the representations in #447 but I have given up on that for now.

Do you have specific performance issues with the library as of now? Unrelated question, do you use an unboxed representation for small integers in your scheme implementation?

No specific performance issues, but Cyclone performs poorly on the bigint benchmarks pi and chudnovsky. The latest ltm release did help a bit in this regard.

Overall I like ltm and your licensing terms and prefer to keep using this library, especially if there will be more performance enhancements going forward. I understand it is unlikely to ever be as fast as gmp, though.

Yes, unboxed integers are used when the values are small enough.

minad commented

@justinethier I wonder what other scheme implementations use if they perform better. GMP or some custom implementation? If they are using custom implementations it would be interesting to know, since we might be able to improve some things here. But competing with GMP is out of the question.

Apparently Gambit and Chicken use custom bignum implementations.

There is a series of articles on the new Chicken 5 implementation that may be of interest:
https://www.more-magic.net/posts/numeric-tower-part-2.html

Not sure how helpful this would be:
https://github.com/gambit/gambit/blob/c65e6b33d2d81e7abc4df35c6d49ecce52d7930f/lib/_num.scm#L5546

minad commented

Interesting. Also interesting that gambit/gerbil performs so well. The last time I looked, chez was still best. And cyclone is quickly improving 👍. Maybe @czurnieden would like to take a look at the bignum code, he usually likes fiddling with benchmarks etc.

This is off-topic here, but is it right that cyclone/chicken use cheney on the mta and gambit uses a separate scheme stack? And chez has their own codegen. It would be nice to have a comparison table of the different implementations. I am also interested in the gc, is there a scheme without stop-the-world pauses? For example on the cyclone readme, you write that the major heap can be collected without pauses (probably concurrent mark and sweep?). But you still need stop the world pauses to promote objects from the minor generation to the major generation? Or do you have something like local heaps (e.g., like Doligez/Leroy GC)? @justinethier please let me know if there is a more appropriate forum to ask such questions.

@minad Regarding OT - That is correct for Chicken/Cyclone; I am not sure about Gambit. There is an overview of the Cyclone GC here and here. There is no stop-the-world for major collections. Minor collections are limited to the current thread so yes, that thread is stopped, but other application threads continue to run. Anyway... I need to setup a slack or IRC channel for this sort of discussion. If you would like just create an issue in Cyclone or email me (see my profile) and we can discuss further.

Is there some list of software which uses tfm like you made for ltm?

No, there's no list. The ones that come immediately to my mind are ClamAV and libtomcrypt

You can add 8th as a user of TFM. But since I want unlimited integers, I may switch to LTM... on the other hand, the bigint speed is impressive with TFM.

The entire reason why LTM uses 60-bits per digit (or 28 on 32-bit platforms) is because in order to "multiply and accumulate" inside the mul/sqr code without access to carry bits you need to emulate them in C.

The entire point of TFM was that we would use assembly to access carries (also it was focused more on crypto tasks where "varying precision" isn't that useful normally).

If you make LTM use 64-bit digits then you lose the ability to MAC fast in C. So ya you end up with fewer muls but you end up with more bit twiddling.