hashicorp/vault-examples

Problematic code in renew token go sample?

georgeharley opened this issue · 4 comments

Hi,

I found that the renew token sample code for go did not work as hoped when putting it to use in a project. Instead of providing periodic renewing of the current token it triggered an endless series of Vault logins in a very tight loop.

AKAICT te unwelcome behaviour is caused by the inclusion of a couple of lines in the select block of the manageTokenLifecycle function that cause a return from the function and a new login attempt when a nil value is read from the done channel. I've tried to flag the two lines below with the // ???????? comment.

	for {
		select {
		// `DoneCh` will return if renewal fails, or if the remaining lease duration is
		// under a built-in threshold and either renewing is not extending it or
		// renewing is disabled.  In any case, the caller needs to attempt to log in again.
		case err := <-watcher.DoneCh():
			if err != nil {
				log.Printf("Failed to renew token: %v. Re-attempting login.", err)
				return nil
			}
			log.Printf("Token can no longer be renewed. Re-attempting login.") // ????????
			return nil                                                         // ????????

		// Successfully completed renewal
		case renewal := <-watcher.RenewCh():
			log.Printf("Successfully renewed: %#v", renewal)
		}
	}

The sample code only seems to work as expected when the flagged lines are removed (appreciating that it is the return nil is the real differentiator here).

Interestingly, the source of the lifetime watcher contains a block of example usage code that also ignores nil reads from the done channel.

Thank you for pointing this out! Let me explain what's going on here and see if there's a way we can get this working better for you.

The DoneCh returns an actual error if something unexpected went wrong with renewal, but it will return nil if something occurs that is not necessarily an error, just an expected failure because of something like the max TTL for the token being reached. (Once the max TTL of the token is reached, we cannot renew the token anymore.)

The inclusion of the return nil there is meant to ensure that we exit back to the login function once our token has reached its max TTL, so that it can attempt to reauthenticate from scratch (and not just block the for loop forever).

(The example usage snippet from LifetimeWatcher doesn't include it because it's just showing the code as if the renewal loop were in main or something at the top level. It doesn't show what to actually do with the common scenario of the token reaching max TTL and being unable to be renewed anymore, so that's what I was trying to add with this code sample.)

That said, it sounds like this is all happening too quickly in your situation, which makes it sound like the max TTL of your token is being reached immediately (and thus instantly exits to login again), or renewing is disabled or something.

Do you know what the max TTL on your token is?

@digivava Thank you for coming back so quick on this.

When the problems occured for me (i.e. endless tight loop of multiple Vault logins per second) the token TTL was 24 hours.

After removing the two lines mentioned in the issue description what I observe is a periodic token renewal that is consistent with the token TTL configured in the remote Vault server. For instance, when the TTL was adjusted to 5 minutes the renewals were occurring roughly every 3.5 minutes; when the TTL was 10 minutes the renewals were roughly every 7 minutes and so on.

I guess I am a little confused by the statement that the DoneCh returns a nil to signify some kind of failure has taken place that the user code needs to respond to (in this case force a new login). Apologies in advance for missing something obvious but in that scenario I would probably expect a non-nil error to be returned on the channel that contained details of the problem.

As I said, the current behaviour (minus the return nil) is exactly what I was looking for and has worked stable for the past few hours for several different token TTL values. Perhaps that is just dumb good fortune and the code has yet to encounter one of the not-quite-an-error scenarios referred to?

Incidentally, just in case it is relevant, my code is using v1.3.0 of the Vault API module.

Giant comment time! Forgive me if this is all old news to you.

It's worth noting that a token's lease TTL is different from its max lease TTL, as you can read about here. A token cannot be renewed forever, it can only be renewed until its max TTL is reached, at which point you must perform a full login again. Max TTL is by default 768h, and is configurable system-wide in the Vault server config or overridden for particular auth backends.

So if you remove those two lines you mentioned, your code will block infinitely once the token's max TTL has been reached (which is a situation in which DoneCh returns nil, at least in my testing.) The fact that it's worked stable for you for several hours means your max TTL is likely set to something high, so it's just currently only entering the case renewal := <-watcher.RenewCh(): part of the switch statement and happily renewing, but once it inevitably reaches the max TTL (i.e. the point where it's not allowed to be renewed anymore), that manageLifetimeWatcher function will block because it won't have returned from the for loop. The return nil is there to make sure that doesn't happen.

It looks like these are the constants that determine how the DoneCh behaves. Note the top one is the default used by LifetimeWatcherInput. (I'm going to check with some folks who are more familiar with the history behind the LifetimeWatcher and see if they know why it works that way, because I agree that it's kinda awkward that the DoneCh returns nil instead of an error when the token can't be renewed anymore. Either way, you'd want the check for nils coming through that channel though.)

I would love help reproducing your problem where the DoneCh seems to trigger immediately though, as I can't seem to make it do that unless I deliberately force the token to be non-renewable, such as by setting the max TTL to something super tiny.

Hey @digivava thank you for the additional information which has really helped give me a better understanding of the issues around the max TTL. I really appreciate it.

Turns out that when the return nil statement was reinstated in my code the originally described problem did not reoccur! Obviously there was something else fishy in my environment at the time but darned if I can recreate it now. Awkward.

Using the information you supplied I tuned the max and default TTL values in my development Vault server and was able to bring about the scenario you mentioned, where further token renewals were no longer possible, and saw the difference the return nil made.

Going to close this issue now. Once again, sincere thanks for all of your help and for kindly sharing your wisdom on the matter.