Nitrokey/nextbox

Letsencrypt certificate not renewing

Closed this issue · 12 comments

After running my Nextbox since it was first released, I suddenly get E-Mails from Letsencrypt that my certificate is expiring. Checking the system, everything seems fine:

nextuser@nextbox:~ $ sudo systemctl status certbot.service 
● certbot.service - Certbot
   Loaded: loaded (/lib/systemd/system/certbot.service; static; vendor preset: enabled)
   Active: inactive (dead) since Wed 2024-07-10 04:42:05 BST; 7h ago
     Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
           https://letsencrypt.readthedocs.io/en/latest/
  Process: 16876 ExecStart=/usr/bin/certbot -q renew (code=exited, status=0/SUCCESS)
 Main PID: 16876 (code=exited, status=0/SUCCESS)

Jul 10 04:42:01 nextbox systemd[1]: Starting Certbot...
Jul 10 04:42:05 nextbox systemd[1]: certbot.service: Succeeded.
Jul 10 04:42:05 nextbox systemd[1]: Started Certbot.

However, it says /usr/bin/certbot -q renew in the service. So what about the certificate?

nextuser@nextbox:~ $ sudo /usr/bin/certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
No certs found.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Looks like it does not find it and thus, can not renew ist, because it is not in the default place. certbot is called with a specific config dir in this python script, so shouldn't that also do it in the service?

How did that work in the past anyway? My certificate was renewed last time on April 28th. Was there some update that changed this, recently?

  • NextBox Daemon Version: 1.1.9-1
  • NextBox Debian Version: 10.13

Please note that we can't give support for anything you do directly on the NextBox via ssh.
You can check the certificate status in the NextBox App under HTTPS / TLS.
Also your browser should warn you if you tried to access your NextBox over HTTPS with an invalid certificate.

Under the hood NextBox does not use the service and instead renews certificates manually.
As you found out correctly the certbot config directory used is also not default, so when you look up certificates in the default config directory it obviously can't find any.

Under the hood NextBox does not use the service and instead renews certificates manually.

But this seems to not work anymore for about half a year now. I am getting sent E-Mails from Letsencrypt that warn about my Nextbox certificate to expire. I guess, that is not an expected behaviour, right?

But this seems to not work anymore

How do you determine this apart from the e-mails ? is your TLS connection still valid/verified ? Please don't rely on these e-mails...

Let's encrypt sometimes sends those emails w/o reason - please check: https:///apps/nextbox/# there "HTTPS / TLS" - there the certification info should be visible. Generally, please don't mess with the system and/or the services on the system, for you initial question: no, the service is not used therefore the config is not applied for the service

Sorry for not making this any clearer, but of course I did check the actual certificate. I have indeed already had to renew it manually (via SSH 🤫 ) shortly after filing this issue, because there was no response and I could not wait for it to expire. It is curently valid for less than three more weeks, while it is supposed to be renewed before 30 days of expiration.

[openssl-3.0.14 openssl-3.0.14] ~ echo | openssl s_client -connect cloud.mydomain.de:443 2>/dev/null | openssl x509 --dates --noout
notBefore=Jul 20 09:15:38 2024 GMT
notAfter=Oct 18 09:15:37 2024 GMT

and that's what it also says on the "HTTPS / TLS" page you are referring to.

ok, did you check /var/log/letsencrypt/letsencrypt.log for any errors .... best case you check /var/log/nextbox.log for entries like starting worker job: RenewCertificates and check the log messages inside the letsencrypt.log during this timeframe - there you should find some error which will tell us what went wrong.

also there are good chances you broke the mechanism by updating it by hand if you didn't use the same method as the automated way - but no idea - this is non-deterministic territory essentially aaaaand technically also not supported xD

Thanks for the clues. According to the logs, the RenewCertificates job starts well, but errors are found in the letsencrypt.log:

2024-09-30 11:40:48,675:WARNING:certbot._internal.auth_handler:Challenge failed for domain cloud.mydomain.de
2024-09-30 11:40:48,675:INFO:certbot._internal.auth_handler:http-01 challenge for cloud.mydomain.de
2024-09-30 11:40:48,676:DEBUG:certbot._internal.reporter:Reporting to user: The following errors were reported by the server:

Domain: cloud.mydomain.de
Type:   connection
Detail: 123.123.123.123: Fetching https://cloud.mydomain.de/.well-known/acme-challenge/jNtyr9I5TsTwZIIMtDjpciGehhYC_1_wR0niZYs8ioA: Connection refused

To fix these errors, please make sure that your domain name was entered correctly and the DNS A/AAAA record(s) for that domain contain(s) the right IP address. Additionally, please check that your computer has a publicly routable IP address and that no firewalls are preventing the server from communicating with the client. If you're using the webroot plugin, you should also verify that you are serving files from the webroot path you provided.
2024-09-30 11:40:48,682:DEBUG:certbot._internal.error_handler:Encountered exception:
Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/certbot/_internal/auth_handler.py", line 91, in handle_authorizations
    self._poll_authorizations(authzrs, max_retries, best_effort)
    File "/usr/lib/python3/dist-packages/certbot/_internal/auth_handler.py", line 180, in _poll_authorizations
    raise errors.AuthorizationError('Some challenges have failed.')
certbot.errors.AuthorizationError: Some challenges have failed.

So according to this log, it looks like a connection error, right? This must be a temporary issue then, because both port 80 and 443 are open for IPv4 and v6.

Is it possible that this happens, when the Nextcloud is in maintenance mode? I have described an issue about this in the support forum, which I am not sure yet whether it is a bug.

(I removed your IP from the log - sorry for modifying your post - or was this a letsencrypt ip oops)

are you using dedyn.io - so the guided dns ? or are you using the static domain config ? (you might want to

because the challenge for dedyn.io is DNS based, if this is the case then there is something massively wrong!
but I assume you use static dns config from the nextbox app - then connectivity is tested - which would mean that your nextbox is not available from the internet.... did you check your port forwarding ? are you sure this is not a private/CNAT/ds-lite/fake IPv4 ?

I also answered in the forums - so far this issue should spawn at least the following changes:

  • come up with a mechanism to avoid the jobs overlapping to aggressively
    or
  • change interval times for jobs to less "perfect" values ....

and

  • check the other file scanning methods, --all seems to be to excessive

The plan is to bring release a nc30 upgrade next week, so I suppose these will be included there, would be great if you could share more details what (and if something) is working for your setup

I use INWX dyndns service, updated by a curl request in /etc/cron.d/inwx on the Nextbox, no dedyn.io. Port forwardings for 80 and 443 on both IPv4 and IPv6 are set and functional.


After I got rid of the maintenance mode issues, the certificate still would not be auto-renewed, though there were multiple times, the RenewCertificate job ran. However, suddenly there were no more "connection refused" errors in the log, instead it said "No renewals were attempted". That was weird, because the nextbox was still serving the certificate that would expire on Oct 18.

So I thought that maybe I have messed up something while renewing the certificate manually in July. So I was like yolo and hit "Disable HTTPS" in the Nextbox app and then "Enable HTTPS" again.

Very unpleasantly I must report, that the browser frontend then got stuck at counting the seconds for the pending process and never got out of that. After the counter passed 1000s, I shift-refreshed the browser window and found the setting back at "Enable HTTPS" position. Tried two more times to enable it, but it never did any more than showing "pending" and counting seconds. Meanwhile, journalctl -f never had anything about the entire operation, besides the 200s for the /apps/nextbox/forward/status requests.

A short while later I accidentally hit refresh in an old open tab of the NC UI that still used HTTPS and found that it actually did refresh. So I rechecked the certificate using OpenSSL and found that the one now served was valid until next January. After once more refreshing the Nextbox app HTTPS section, it also there no longer showed "Enable HTTPS" but the new certificate information and also the new dates.

Obviously the UI went out of sync with the renewal process. But for now this seems to work again (not saying it's fixed). Will be back here in December, if it won't renew then again.

However, it is still in question, why the auto renewal first stopped working in March this year. I will happily share more logs so far they are not rotated yet. Please get in touch if you want to track this further down. Thanks for the support here anyway.


Side note: #36 is still an issue over here. However, the problem described here only started this March, so probably not related.

pfew, ok this is really going of the rails - as predicted - going towards a generic linux administration channel here. Please don't get me wrong, but you cannot just set up your own tooling on top of what's there, ignore the existing mechanisms (e.g., guided dyndns) and then finally expect us to debug the problems you get if you trigger another functionality that is build on top of the stock nextbox features (TLS/HTTPS).

The browser is more or less expected to freeze after you hit "Disable TLS" - this is a security mechanism, that your browser will usually deny loading a domain without TLS if it had TLS activated just before - the other way around should work .. Also you again miss to explain how you tricked the nextbox to think you have set up dynDNS. Did you use the static domain setup ? If yes, then the reachability problem and the non-renew certificate problem are the same: Let's encrypt cannot connect to your nextbox. The reachability test is strictly dependent on a proper nextbox configuration, given the many changes and modifications you've done it's really hard to tell where it fails - but MOST likely due to its incapability to connect to your nextbox, as it says. Are you sure your nextbox is really reachabable from an outside network, and HOW do you determine this?

On top - why your certificate information is broken now is also very likely because your certbot configuration is messed up by now.

Obviously the UI went out of sync with the renewal process.

yes, of course it did - because most likely it mixes up your certificate storage(s), of which you likely now have more than one as you triggered the service by hand before ...

There are some accusations in your reply, that I did not expect. Let's try to get some structure in what's going on here:

Please don't get me wrong, but you cannot just set up your own tooling on top of what's there, ignore the existing mechanisms (e.g., guided dyndns) and then finally expect us to debug the problems you get if you trigger another functionality that is build on top of the stock nextbox features (TLS/HTTPS).

Are you considering that curl cronjob "my own tooling"? Fair. But what existing mechanisms that I ignored are you referring to exactly? (besides guided dyndns, which I dismissed because it would make that entire project depend on some third party service I had never heard of before)

The browser is more or less expected to freeze after you hit "Disable TLS" - this is a security mechanism, that your browser will usually deny loading a domain without TLS if it had TLS activated just before

Not sure if I understand this correctly. Is this a Nextbox feature or are you explaining HSTS here? I personally consider the browser freezing an unexpected UI behaviour and not a security mechanism whatsoever.

the other way around should work

Are you referring to the state of not having HTTPS turned on? If yes, then that is what I meant, when I said that I saw a "pending" banner and a counter. It "should work" is what I expected, too.

Also you again miss to explain how you tricked the nextbox to think you have set up dynDNS. Did you use the static domain setup ?

Yes, I do use static domain setup, because it runs on my own domain. Didn't feel much like I had to "trick" it into doing that though. However, it is three and a half years ago that I set this up. Is it possible that back then there were not as many configuration options there as they are today?

If yes, then the reachability problem and the non-renew certificate problem are the same: Let's encrypt cannot connect to your nextbox.

Well, it obviously can connect, because it just did, as I described above.

The reachability test is strictly dependent on a proper nextbox configuration, given the many changes and modifications you've done it's really hard to tell where it fails

Sounds like the software you are providing still has some weak points. I am here to help you hunt them down.

Are you sure your nextbox is really reachabable from an outside network, and HOW do you determine this?

Yes. I live in a different city than where the Nextbox is at, and I connect to the browser interface via WWW.

Are you sure this issue is completed? I am not.

nope - not done, the PR closed this issue as I referred (checkboxed) the two points from above.

Are you considering that curl cronjob "my own tooling"? Fair. But what existing mechanisms that I ignored are you referring to exactly? (besides guided dyndns, which I dismissed because it would make that entire project depend on some third party service I had never heard of before)

Yes, clearly yes, with all good faith I have I can only assume you've done this right and did not affect any other services/configurations, so the point is not that you've done it, but moreover that I have to believe you you've done it correctly - and the latter is next to impossible to determine, because you most likely don't know, if you affected NextBox services/mechanisms in any way (if you'd know, you'd not have to create this issue). Don't get me wrong, but this is no accusation, this is just a bare observation - sorry if I sounded rude - that wasn't my intend.

But let's stop this meta discussion - and let's try to debug why your linux machine cannot properly answer a let'sencrypt reach-ability request. One point before that: yes it's essentially HTST, which leads to the browser freezing. This is a mechanism we have been working a lot with - essentially the enable/disable calls lead to window.location.replace calls here - if you have a better way to achieve this w/o most of the current browsers freezing feel free to suggest how this can be improved - we'd happily improve this.

But now the main topic - let me summarize what we know so far:

  • you set up your own dynamic dns update service on the nextbox
  • you configured the nextbox to "use static domain dns"
  • you initially found out that your certificate was not refreshed
  • you tried fixing this with certbot renew and/or the accompanied service, didn't work (expected as letsencrypt root is in /srv/letsencrypt)
  • in the meantime your letsencrypt-logs show that no-renewals-are neeeded (this is super weird is hints a now missconfigured letsencrypt due to manual modifications through e.g., unexpected cmds)
  • you then switched TLS off and on through the NextBox App, which worked apart from some UX ugliness

so there are at least these weird things to check:

  • why does letsencrypt report the wrong cert (hard to tell, I don't know how strict letsencrypt handles multiple configuration roots)
  • disable/enable TLS weird behavior all timeouts are set to 120secs so far I remember, if these are exceeded without reloading the page there might be something blocking this behavior ? javascript blocker? adblocker? etc etc?
  • you can find the nextbox used certbot calls here - if you'd like to look into this deeper it would be interesting to see if these commands lead to expected outputs.

we could wait until december, alternatively you could disable TLS from the nextbox ui and then delete the letsencrypt (contents!) by hand, then reaquire those - but under the line especially as it works for you this highly depends on what you'd like to invest right now (time wise,...)