Financial-Times/dotcom-reliability-kit

We don't differentiate between deployment environment and Node.js environment

Closed this issue · 3 comments

In app-info (and, by extension, opentelemetry) we don't have a way of differentiating between the deployment environment (e.g. development, staging, production) and the Node.js environment (the value of NODE_ENV). This means that we can't tell the difference between production and staging metrics, because in both cases the NODE_ENV environment variable is set to production.

Node.js version: All
Impacted package(s): app-info, opentelemetry

Steps to reproduce

This isn't very easy to reproduce as you need to set up a system in OpenTelemetry that has a long-lived staging environment. It's easier to highlight the code in question.

  • Here you can see that we use process.env.NODE_ENV for appInfo.environment

  • Here we use that same variable for semanticConventions.deployment.environment

  • Here we use this alias to set the OpenTelemetry deployment.environment resource attribute.

  • Check staging vs production environment variables for one of your apps. The NODE_ENV environment variable is almost certainly set to production in both

  • It's not possible to set the NODE_ENV environment variable to something like staging because a lot of our underlying tooling changes behaviour based on the value being production (e.g. Express caching rendered views).

Expected behaviour

I expect us to be able to set the deployment environment to something other than NODE_ENV which has a different purpose.

We do need this value in app-info to ensure that our logs and metrics have the same value, and to make this available to apps that want to detect whether they're running in production or staging. Because of this, I don't think using the OTEL_RESOURCE_ATTRIBUTES environment variable is enough.

Unfortunately there are no built-in resource detectors for this value except for Microsoft Azure which uses the WEBSITE_SLOT_NAME environment variable. I don't think this makes sense for us.

Across the FT we use a few different environment variables:

  • RELEASE_ENV is fairly common
  • DEPLOY_STAGE is uncommon but is closer to the semantic resource attribute name
  • ... there may be more, please let me know!

As we haven't settled on a value, I'd be tempted to try and standardise on a more descriptive one, e.g. DEPLOYMENT_ENVIRONMENT or DEPLOYMENT_ENV.

It'd be possible to cater for all of these using the following:

const environment =
	process.env.DEPLOYMENT_ENVIRONMENT ||
	process.env.DEPLOY_STAGE ||
	process.env.RELEASE_ENV ||
	process.env.NODE_ENV ||
	'development';

I think this is what I'm leaning towards as a solution.

Related things I've briefly thought about, we've got environments in Doppler that might be helpful in some way, plus in Heroku review apps you have app.json to lean on perhaps.

If there is a vote, DEPLOYMENT_ENVIRONMENT would have mine, it aligns well to the OTel attribute name so might eventually be a bit intuitive as to it's purpose.

Unfortunately there are no built-in resource detectors for this value except for Microsoft Azure which uses the WEBSITE_SLOT_NAME environment variable. I don't think this makes sense for us.

Yep, from what I've seen, deployment.environment is always documented to be set via OTEL_RESOURCE_ATTRIBUTES, and I can see why, there just doesn't seem to be much out of the box to lean on for a value across most hosting platforms. So it does feel like a good area to figure out our own convention for.

In Engineering Insights we use two variables to indicate how we want our systems to behave (using NODE_ENV) and which environment they are running in (using ENVIRONMENT)

  • NODE_ENV is left unset locally, set to "test" when running test suites (forced by Jest) and to "production" when the code is deployed anywhere. If we're deploying to Heroku then they will set NODE_ENV to "production" by default.
  • ENVIRONMENT is usually left unset locally and set to "test" or "prod" when deployed to either environment.

There are two issues we find with this:

  1. As Sam alludes to above, it's not clear what the difference is between the two variables
  2. The value "test" to indicate deployment environment should probably be "staging" to avoid confusion.

OK I think it makes sense to choose a main one, I prefer DEPLOYMENT_ENVIRONMENT too because it's explicit.

RELEASE_ENV is used quite extensively and it's also named fairly sensibly so I'll include this one. I'm going to exclude DEPLOY_STAGE because it's very specific to two Customer Products serverless apps and I'm happy to change them.

Although I dislike ENVIRONMENT, it's actually quite widely used so I'll include this too.

We'll end up with this:

const environment =
	process.env.DEPLOYMENT_ENVIRONMENT ||
	process.env.RELEASE_ENV ||
	process.env.ENVIRONMENT ||
	process.env.NODE_ENV ||
	'development';

@sjparkinson @i-like-robots I'll open a PR and we can discuss further there if you're not keen on any decisions I make.