smuellerDD/jitterentropy-library

rng-tools + jitterentropy always hang if use the internal timer

Closed this issue · 38 comments

On some board with coarse timer, it will use the internal timer instead. But it make rngd always hang(rngd -f -x hwrng -x rdrand). And we can reproduce this issue in any board if we force to use the internal timer as below.

diff --git a/jitterentropy-base.c b/jitterentropy-base.c
index 6dbb484..07397cb 100644
--- a/jitterentropy-base.c
+++ b/jitterentropy-base.c
@@ -1387,7 +1387,8 @@ int jent_entropy_init(void)
        if (sha3_tester())
                return EHASH;
 
-       ret = jent_time_entropy_init(0);
+       //ret = jent_time_entropy_init(0);
+       ret = 1;
 
 #ifdef JENT_CONF_ENABLE_INTERNAL_TIMER
        jent_force_internal_timer = 0;
-- 
2.17.1
# cat /proc/cpuinfo | grep processor
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7

@smuellerDD
Yes, the hang issue on board with more than one cpu will gone after with the patch nhorman/rng-tools@4f79ad1 as mentioned in nhorman/rng-tools#123. Also look forward your help to guide how to address the hang issue in a better way than nhorman/rng-tools@4f79ad1 on a system with more than one cpu.

BTW on the system with only one cpu, the patch nhorman/rng-tools@4f79ad1 doesn't help. @nhorman

Am Freitag, dem 30.04.2021 um 01:00 -0700 schrieb Mingli-Yu:
Yes, the hang issue on board with more than one cpu will gone after with the patch nhorman/rng-tools@4f79ad1 as mentioned in nhorman/rng-tools#123.
Ah, good, so the internal noise source still works as intended.
But on the system with one one cpu, the patch nhorman/rng-tools@4f79ad1 doesn't work.
Yes, that is an issue I am currently working on. Even the separate jitterentropy-rngd application sometimes suffer from that on systems with one CPU. For the immediate use, I can only recommend to disable the internal timer on such boards. But that means the Jitter RNG will not work due to the coarse timer.

Thanks very much for your information! Look forward the solution for Jitter entropy on the boards with the coarse timer.

@smuellerDD what if you modified the library such that, instead of providing a option to simply enable a timer, you provided a method to register a software timer thread? Such method could only be useable if jitter_entropy_init returned ECOARSETIME. Such a method would allow the application to create/manage and define the timer threads attributes (like affinity) fairly easily

I'd be ok with that approach. The only note I would have would be a need for a way to detect if an external timer function is warranted. I.e. if jitter_entropy_init fails with ECOARSETIME or some such

Neil, can you please have a look at https://github.com/smuellerDD/jitterentropy-library/tree/external_threading

Effectively you need to implement the functions in struct jent_notime_thread and register them with jent_entropy_switch_notime_impl as the first call.

I have updated the internal threading code to use the same logic. Thus the internal threading implementation defined with jent_notime_thread_builtin should serve as a guide.

Ok, for reference I just pushed this for testing:
https://github.com/nhorman/rng-tools/tree/notime

It makes use of the new timer registration work @smuellerDD has done here. I don't have a system with a coarse enough timer to test it throughly but @Mingli-Yu if you could test it please, I would appreciate it. Note that building my branch requires that libjitterentropy be updated such that -fvisibility=hidden is removed from the CFLAGS, as mentioned above.

done, I've updated my branch to allow for the use of your refrerenced flag, and exported it as an option to the jitter source. I've done some testing with that, and it seems to work reasonably well. One thing I noticed, that I hadn't considered previously. the api that you export seems to assume that we will use a separate timer thread per entropy context. Since rngd allocates an entropy context per cpu, thats a difficult model to work with, as it implies the creation of an equal number of timer threads that amplify the thread contention issue, and continue to drive the hang condition. Currently I work around that by limiting rngds thread count to 1 when using soft timers, but I wonder if it would be worth altering the model to allow for the creation of a single thread and a reference count that tracks the number of timer starts and stops, so that we can use a single thread for all soft timer functions. Thoughts?

I'm not sure why we have to force a 1:1 relationship between rng threads and timer threads. All the timer thread is doing is simulating a free running counter. Multiple rng threads can use a single timer thread to do that if they simply read the timer threads counter when they want to 'start' a timer, and again when they 'stop' it, then compute the delta. That seems like it would be much more efficient to me.

I figured out the jent_notime_init trick.

I concur that in a single core system, theres just no way that this mechanism is going to work, at least not without additional co-operation. The threads will contend with each other and one will wind up blocking the other. We could perhaps finagle some yielding mechanism, but I think its probably best to just fail the init for now.

@Mingli-Yu any test result confirmation?

@nhorman @smuellerDD
Thanks very much for your update! It works on the system with more cpus.

So when does https://github.com/nhorman/rng-tools/tree/notime and https://github.com/smuellerDD/jitterentropy-library/tree/external_threading merged to master branch? Or what the release version includes these changes?

I just merged my changes, i've got a few features that could benefit from a tagged release, but why can't you just build from the head of the git tree?

@nhorman
Thanks for your update! I can build from the head for test, but it's better to make it available in one release version and I can see it's included in 6.13 now, thanks very much!

@smuellerDD
BTW, seems the change for jitterentropy-library not merged until now, wait for the update!

Hello, Stephan,

I've got a report that rng-tools-6.13 (with the notime patchset) linked to jitterentropy-3.0.2 (a0c51e2, tag: v3.0.2) lib consumes 100% CPU on a single-core older AMD Athlon 64 CPU. I've got a process core and it looks like the problem is one discussed in this issue #37. Could you please have a look at thread stacks and probably confirm, am I correct or we're seeing some separate new issue?

Stacks are:

(gdb) info threads
  Id   Target Id                        Frame 
  1    Thread 0x7f65f3481840 (LWP 7401) __libc_read (nbytes=16, buf=0x55da04aadd00 <key.lto_priv>, fd=4)
    at ../sysdeps/unix/sysv/linux/read.c:26
^^^^^ sleeps on read()

* 2    Thread 0x7f65f3480640 (LWP 7405) 0x00007f65f4105f80 in jent_get_nstime_internal (ec=0x55da0557bed0, 
    out=0x7f65f347fb98) at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:760

  3    Thread 0x7f65f2c7f640 (LWP 7406) jent_notime_sample_timer (arg=0x55da0557bed0)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:702
^^^^^ these two

(gdb) t 2
[Switching to thread 2 (Thread 0x7f65f3480640 (LWP 7405))]
#0  0x00007f65f4105f80 in jent_get_nstime_internal (ec=0x55da0557bed0, out=0x7f65f347fb98)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:760
760			while (ec->notime_timer == ec->notime_prev_timer)

(gdb) bt
#0  0x00007f65f4105f80 in jent_get_nstime_internal (ec=0x55da0557bed0, out=0x7f65f347fb98)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:760
#1  0x00007f65f410634d in jent_measure_jitter (ec=0x55da0557bed0, loop_cnt=0, ret_current_delta=0x0)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:1036
#2  0x00007f65f4106410 in jent_random_data (ec=0x55da0557bed0)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:1068
#3  0x00007f65f410649e in jent_read_entropy (ec=0x55da0557bed0, 
    data=0x7f65ec000b60 "UV\376\213˷Y/\236\346o\231\352\205\r\aױC-\220", len=16375)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:1124

(gdb) t 3
[Switching to thread 3 (Thread 0x7f65f2c7f640 (LWP 7406))]
#0  jent_notime_sample_timer (arg=0x55da0557bed0)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:702
702			if (ec->notime_interrupt)

(gdb) bt 
#0  jent_notime_sample_timer (arg=0x55da0557bed0)
    at /usr/src/debug/jitterentropy-3.0.2-1.fc34.x86_64/jitterentropy-base.c:702
#1  0x00007f65f3d26299 in start_thread (arg=0x7f65f2c7f640) at pthread_create.c:481

which are:

static inline void jent_get_nstime_internal(struct rand_data *ec, uint64_t *out)
{
        if (ec->enable_notime) {
                while (ec->notime_timer == ec->notime_prev_timer) // <<< LOOPS HERE
                        ;
        ...

static void *jent_notime_sample_timer(void *arg)
{
        struct rand_data *ec = (struct rand_data *)arg;
        ec->notime_timer = 0;

        while (1) {
                if (ec->notime_interrupt) // <<< LOOPS HERE
                        return NULL;
                ec->notime_timer++;
        }
        ...

In case a core itself is needed, please, see a public bz https://bugzilla.redhat.com/show_bug.cgi?id=1974132. The core is attached there:

requested core dump for faulty jitterentropy-3.0.2-1.fc34 and rng-tools-6.13-2.fc34 (762.93 KB, application/gzip)

Surely, I have no wide jitterentropy lib knowledge, still for me the above looks much like this issue#34.

Thank you and best regards,
Vladis

Thats....Odd. I agree that the issue look similar, but I think its subtly different. Looking at the bz, it appears this system has a single cpu, and is (possibly) returning an error from jitterentropy during init indicating its time is too coarse, and as such is starting the software timer. That much is identical to this problem, but in the initial report, the problem was reported on a system with multiple cores. In a multi-core situation, a soft timer is helpful because the timer thread can run in parallel to the rngd thread, allowing for progress to be made, but in this case its impossible (due to only having a single core). In this case jitterentropy probably has no chance of working because the hardware counter is too slow to provide a reasonable timer base, and the software timer will just interfere with normal operaiton. I think, unfortunately, that jitter, in this case, should just fail to initalize. let me create a branch to test that use case for you.

https://github.com/nhorman/rng-tools/tree/onecpu

Not sure if thats the final solution (and its untested), but that should add code that causes jitterentropy to fail if the hw counter is too coarse and theres only a single cpu to execute on. It won't fully solve the reporters issue, but it will prevent jitter from running on a system in this pessimal case, and prevent cpu saturation. They'll likely need to find another entropy source to gather entropy from. (maybe the rdrand souce, assuming their cpu isn't impacted by the AMD rdrand issue)

Hello, @smuellerDD, @nhorman,

Thanks a ton for a help here. I believe "the software timer will just interfere with normal operaiton, jitter, in this case, should just fail to initalize" is indeed a correct approach. I have a number of small concerns still, can we please, discuss?

  1. It just a little strange to me, as 2.2.0 version of the lib was working on that old single-CPU system of a reporter, since 4b062d0 was added in 3.0.0 only. Do I understand this correctly, that actually the lib was not working properly on this system?

  2. Thanks for onecpu patch. I will add it and build. A note, though. Should not it be sysconf(_SC_NPROCESSORS_ONLN) in rngd_notime_init(), as we care only about CPUs we can schedule on?

  3. I believe, this check also should be added to the jitterentropy lib itself. The lib should not check this, if an app requests its own timer thread by {init,start,stop}_func(). In this case - yes, the app should be responsible for the check. But in case a default (a lib's one) timer func is used, the lib should do the check. For now it is rng-tools only in Fedora, but later other apps may want to use the lib, while not using an app's timer thread. @smuellerDD, I would be happy to hear, your thoughts about this suggestion.

Thank you.

@nefigtut in response to your points above:

  1. is a very interesting datapoint, one that I didn't recognize before. While It doesn't confirm it, it suggests that this system has a hardware timer that is fine grained enough to not need a software timer (which would be my expectation). That in turn suggests we are trying to initalize the software timer when we don't need to. It would be worth investigating what the return code is from jent_time_entropy_init(0) in jent_entropy_init. We attempt to start the software timer infrastructure there if that call fails, and if it used to work, it shouldn't be failing now.

  2. yes, you're right, I borrowed that code from the init function, and it likely should be changed in both places. Please feel free to do so.

  3. No strong opinion on this in either direction. We can use the patch I have for testing purposes, and decide on the correct solution when we fully understand the problem, IMO.

The threading code has been merged into the master branch.

Addition of a helper that detects the available number of CPUs. Only when 2 or more CPUs are available, the internal timer thread is started.

With this change I think the discussed issues should be resolved. I am closing this issue - if there are still issues lurking, please feel free to reopen or open a new issue.

Thanks to all for support.

i do not have an intent to re-open this issue. i want just to thank Stephan - thank you Stephan for your responses. i'll build jitterlib-v3 and newer rng-tools for fedora and ask ppl reporting issues if they are willing to test. thank you.

@smuellerDD - Would you consider tagging a 3.0.3 release now that this issue has been fixed? Will make it easier for major distros to rebuild.

@smuellerDD i would like to request to accept one more fix first, if possible. unfortunately, there is one more issue, related to timers. i'll post the PR on Monday, if an issue reporter replies quick, or i'll create a github issue, if he does not (this way i cannot suggest a proper fixing code, only to describe the issue). on the other hand, this is quite a corner case, so it may very well wait for the 3.0.4/.5.

I am going to tag 3.1.0 shortly. All changes I wanted to have in are now in.

@smuellerDD - Ping :D