sous-chefs/nginx

Validating nginx configs to prevent sites from going down on reload of bad config

axelrtgs opened this issue · 5 comments

🗣️ Foreword

Thank for taking the time to fill this feature request fully. Without it we may not be able to , and the issue may be closed without resolution.

:person_frowning: Problem Statement

So we got hit by a small typo in one of our site configs that made the nginx config invalid. This meant that after we fixed it in the template and reran chef before it even got to copying the fixed config it tries to start the service as part of the install which fails due to the bad config and chef exits without even getting a chance to fix the bad config. Solution was to manually fix it but it would be great if we could handle it in the cookbook somehow.

❔ Possible Solution

verifying the config using nginx -t on the config and site templates seemed like a good idea at first glance and i started to type out the PR but quickly realized that this would not work because the site fragment is not a full and valid config and we copy templates to sites-available so when verifying the entire config the files never get checked anyway making the verify option on the template resource not a viable solution here.

⤴️ Describe alternatives you've considered

I have thought about delegating the task to the start/reload of the service and adding an only_if "nginx -t -c #{::File.join(nginx_dir, 'nginx.conf')}" to the service definition so that it would only ever get started/reloaded if the config is valid avoiding taking down an otherwise working web server due to a config error.

The down side to this approach is it would fail silently and you would no longer know the config was bad/not loaded until much later wheras in the current state the chef run will fail and the service will go down so you would get alerted pretty quick.

I feel this can be solved by using a shell-out to verify the config and then only_if { shell_out_result_pass } and have a raise "nginx config error" unless shell_out_result_pass in a delayed notification to run at the end which would avoid taking down the entire server for a config error and raise a chef error so you can see the failed run and fix it quickly hopefully without causing an outage like we did.

➕ Additional context

I am opening an issue before starting the work as i wanted to avoid working on the changes if my shellout verify to block service start/reload and raise an error is overly complex and no one else sees any value in this. For me there is tremendous value in not loading a bad config taking down nginx and being able to use chef to deploy the fixed config without manually having to adjust things when its broken.

Also up for discussion is do i only validate the site configs or do we validate the config resource also. i think both would be good to validate in case someone is manually managing site configs and only using chef to deploy the main config.

This pattern in dhcp might be useful here too. @bmhughes implemented it there and can maybe provide some feedback there.

Another option is to the go the route we went with in haproxy where we manage the systemd unit file and just add a ExecStartPre parameter which does the same thing.

As for as the pattern I've implemented in dhcp and syslog-ng I've had good results with it thus far, no more erroneous service restarts to a bad config. Using guards alone I still ran into situations where the service resource may still get notified depending on resource ordering, hence adding the explicit resource_delete! for the service to ensure it can't run.

My only additional thought around the use of ExecStartPre is how it handles restarts when the config test fails, the docs mention that the unit is marked as failed so without testing I'm not sure whether that'd end up killing the original running process or not? Also come to think of it I'm not sure how it'd work with a reload rather than a restart either actually thinking about it.

copying in a suggestion from slack:

Tensibai  2 days ago
IMHO replicating what's done on apache2 cookbook is the way to go, guard the service by validating the config, this way it doesn't restart/reload a bad Config and chef can fix the config, the drawback is that you end up with a successful run and ninx running on its old Config without something warning you :/
An alternative could be an execute resource notified before the service restart, if the conf check fail that should break the run and avoid the restart

Unless we delay the service restart/reload to the end of the play, the issue is that if the config is broken Chef can't run to fix it.
If we can delay the restart/reload as a delayed notification and still have another notification running before to check if the config is valid (and cancel the reload if needed) then we'll be okay.

Otherwise, on the path of using ExecStartPre and ExecReload, looks like a systemd unit can have multiple ExecReload and they'll be executed sequentially.

https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecReload=

This argument takes multiple command lines, following the same scheme as described for ExecStart= above.

https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStart=

If more than one command is specified, the commands are invoked sequentially in the order they appear in the unit file. If one of the commands fails (and is not prefixed with "-"), other lines are not executed, and the unit is considered failed.

But that would then join @bmhughes question of "I'm not sure whether that'd end up killing the original running process or not?"

I went with a simple solution to avoid raising errors. I guard the service actions to not run unless the config is valid (apache2 style) and log an error message if the config is invalid so that anyone catching these in log monitoring can see them and address hosts with invalid configs that didn't get loaded.

This should prevent the runs from failing when trying to do service actions and allow for chef to continue and be able to replace the invalid config with a hopefully valid one. It also follows a similar pattern to apache2 which i thought was good for consistency.