- Remove NixOS AMI generation dependency
- PR to nixpkgs adding the all the regions currently available in order to get NixOS AMIs available on them
- Make sure that the AMIs are available
- Refactor the Terraform config to just use the available AMIs instead of generating them
- Figure out a way to hide the secrets
- Change personal AWS tokens (due to commit history)
- confirm that IOHK tokens have never been committed
- Add 2 different cardano-node versions to niv
- Add a let in replacement for all the servers targetHost
- Get #4196 merged
- Add monitoring node configuration
- There are regions that do not have NixOS AMIs so one need to generate one and upload it
to those:
- jp
- sg
- au
- br
- Export AWS Secrets
For that we need an S3 bucket in each region and a set of specific IAM roles (https://docs.aws.amazon.com/vm-import/latest/userguide/vmimport-image-import.html)It is not possible to copy official NixOS AMIs from other regions to the ones we need, so we need to generate ours and upload themhttps://nixos.wiki/wiki/Install_NixOS_on_Amazon_EC2NixOS/nixpkgs#85857The links above can help.Notes: change home_region, bucket and regions vars and edit lines to make_image_public if needed.
After that get the AMIs for each region and add them to terraform configuration
NOTE: As of NixOS 22.05 release, AMIs for all AWS regions are available, so this step is no longer needed
In folder dev-deployer-terraform, there's main.tf that has the terraform
configuration to deploy NixOS machines on different AWS regions. This
Terraform config also runs 2 bash commands: 1 to create a NixOS image,
and a script to upload the image to an AWS bucket and import it as an image,
making it available in all the regions necessary.
To run the terraform config from a clean AWS configuration do the following:
- Make sure all the regions you want are enabled;
- At the current time we do not have any instance in Bahrain, for example. If we did, then we'd also need to enable global permissions, see: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html?icmpid=docs_iam_console
- Make sure the account has the necessary permissions;
- Make sure your credentials are correctly configured:
aws configure
Make sure you editmain.tfandcreate-ami.shto be in sync (e.g. S3 Bucket name, home-regions, etc.);- Do
terraform init; - Do
terraform planto check you haven't forgotten anything; - If everything looks good do
terraform applyand let it run; - After finished you should have:
An S3 bucket;A role and policy calledvmimport;- A security role for each machine's region enabling traffic;
- A key pair for each machine's region;
- An EC2 instance.
Due to the way S3 buckets work if something goes wrong during the plan execution,
you might not be able to perform terraform destroy. If that is the case you will have to
delete all the stuff by hand if you want to rerun a script from a clean state. Maybe you can
get without deleting everything and only the S3 bucket and then do terraform destroy. On
the other hand if everything finishes successfully you will be able to perform terraform destroy, just make sure the bucket is empty before hand. NOTE: That the AMIs and
respective snapshots won't get deleted so you will have to delete those by hand.
I believe we'll only need to run the deployment once and if needed only rerun the script
to make NixOS AMIs available in new regions. For deployment we should run something like
terraform apply -target=resource.
NOTE: As of NixOS 22.05 release, AMIs for all AWS regions are available, so this information is no longer accurate
THINGS TO HAVE IN MIND:
Thecreate-ami.shscript will cache things in$PWD/ami/ec2-imagesso you might want to delete that when trying to obtain a clean state;You ought to rename the bucket if wanting to rerun the deployment from a previously deleted bucket since AWS might take some time to recognize that bucket was deleted.
After having run the terraform configuration, you have to manually get the public ips
for each machine and add them to the machine-ips.nix file. Then you should create a
new nixops network and deploy it with nixops create -d my-network network.nix after
nixops deploy -d my-network. You should be able to get the IPs for each regions by
running the following command:
terraform show -json | jq '.values.root_module.child_modules[].resources[].values | "\(.availability_zone) : \(.public_ip)"' | grep -v "null : null"
If one updates the NixOS version of the AMIs be sure to also update the nixpkgs version on niv to the same one.
nixops will try to ssh into the machines as root so you might need to run:
> eval `ssh-agent`
> ssh-add ssh-keys/id_rsa_aws
Please NOTE that if the machine you're using to deploy (local machine) has a different or incompatible nixpkgs version with the one in the remote side (remote machine that is going to get deployed) - you will notice this with stange errors such as "service.zfs.expandOnBoot does not exist" - you will need to modify your deployment to use a different nix path. So after creating the deployment and if you get weird errors as the one described previously:
- Run
niv showto get the nixpkgs version and url; - Copy the nixpkgs url being used;
- Run
nixops modify -I nixpkgs=<url> -d my-network network.nix - Try again
If you want to further configure each individual server you can look into: https://github.com/input-output-hk/cardano-node/blob/master/nix/nixos/cardano-node-service.nix#L136 to see all the options available for configuration.
After deploying you can setup cardano-tracer to monitor the nodes with RTView or
Prometheus and EKG servers (Please see more here)
Currently this deployment repository is configured to work with cardano-tracer and monitoring the nodes via RTView (please check the specific RTView page). Succintly there are 2 ways of running RTView:
- Local
- Distributed
In the local setup, cardano-tracer and cardano-node run on the same
machine. In a distributed setup cardano-tracer can be used to monitor multiple cardano-nodes which all run on different machines.
The one we want is the distributed one, where cardano-tracer runs on the deployer
machine and the nodes on the deployed AWS instances.
There are two particular things about the distributed way of setting up cardano-tracer:
it can act as a Server or as a Client. As a Server, cardano-tracer waits for connections
from cardano-node instances (so nodes are clients); and as a Client, cardano-tracer
connects to cardano-node instances (so nodes are servers). We pick which way we want by
configuring cardano-tracer with an AcceptAt or ConnectTo attribute and running the
nodes with a --tracer-socket-connect or --tracer-socket-accept.
For our particular situation, we are going to use cardano-tracer as a client and
cardano-node instances as servers, since we can only establish a one-way connection
from the deployer machine to the AWS instances via SSH. As our configuration is static and
we won't be adding/removing nodes very often this shouldn't be much of a problem.
More details about this tool can be found on its homepage.
This repository has a folder called dev-deployer-cardano-tracer where you can find a
config.json and make-tunnels.nix files. If you have filled the ./machine-ips.nix
and you have the machines running then all it is needed is to run:
> nix-build makeScripts.nix
This is going to generate make-tunnels-mainnet.sh and make-tunnels-testnet.sh that are
responsible for launching cardano-tracer and start up all needed ssh local port
forwardings. These scripts however are stored in the nix store, but their symlinked
counterparts are named result and result-2, which can be found in the same
directory that last command was run. result is for mainnet and result-2 is for
testnet.
At last one can setup one last local port forwarding from its personal
computer to the dev-deployer machine in order to access the WebUI:
> ssh -nNT \
-L 3100:0.0.0.0:3100 \
-L 3101:0.0.0.0:3101 \
-L 3200:0.0.0.0:3200 \
-L 3300:0.0.0.0:3300 \
-o "ServerAliveInterval 60" \
-o "ServerAliveCountMax 120" \
-o "StreamLocalBindUnlink yes" \
dev-deployer
for the mainnet monitor. And
> ssh -nNT \
-L 4100:0.0.0.0:4100 \
-L 4101:0.0.0.0:4101 \
-L 4200:0.0.0.0:4200 \
-L 4300:0.0.0.0:4300 \
-o "ServerAliveInterval 60" \
-o "ServerAliveCountMax 120" \
-o "StreamLocalBindUnlink yes" \
dev-deployer
for the testnet monitor.
NOTE: That this is going to launch several programs in the background, if you wish to completely terminate those make a quick search with:
> ps aux | grep 'ssh -nNT'
> ps aux | grep 'cardano-tracer'
And killing whatever process you wish.
OR you can run clean.sh but be careful that there might be other non-related
processes under the same grep regex.
There a couple of things one should note about the current deployment, the first one being that the cardano-node service depends on a particular commit as one can read in the network.nix file:
# Common configuration shared between all servers
defaults = { config, lib, ... }: {
# import nixos modules:
# - Amazon image configuration (that was used to create the AMI)
# - The cardano-node-service nixos module
imports = [
"${sources.nixpkgs.outPath}/nixos/modules/virtualisation/amazon-image.nix"
# It should not matter if we use the mainnet or testnet ones since we are going to
# overwrite the cardano-node packages in the cardano-node service if needed.
#
# NOTE that currently we need to be running the mainnet one since it is the version
# that is pinned to the bolt12/cardano-node-service-release - this branch has currently:
# - node version 1.35.x with a needed bug fix
# - is rebased on top of bolt12/cardano-node-service which extends the cardano-node-service
# with much needed improvements
#
# While this is the case be sure to include commit 9642ffec16ac51e6aeef6901d8a1fbb147751d72
# (https://github.com/input-output-hk/cardano-node/pull/4196) # in the most recent master version
cardano-node-mainnet.nixosModules.cardano-node
];Current release (1.35.5) does not contemplate PR #4196, for this reason one has to cherry pick the changes on that PR on top of the release 1.35.5 tag. If the cardano-node version used already has this changes ignore this paragraph.
The second thing to note is that currently server-us-west is overwriting service.cardano-node.cardanoNodePackages to test a particular cardano-node revision. One should take this into consideration if wanting to update the node.
If you want to test how a given node version/branch/revision does in mainnet/testnet
all you have to do is to change the services.cardano-node.cardanoNodePackages attribute, for the server's instance of your choosing.
In network.nix you will find more details:
# If you wish to overwrite the cardano-node package to a different one.
# By default it runs the cardano-node-mainnet one.
# You ought to put this on a particular server instead of in the default atttribute
# cardanoNodePackages =
# cardano-node-mainnet.legacyPackages.x86_64-linux.cardanoNodePackages;A good way to do this is to add a new cardano-node version with niv:
niv add input-output-hk/cardano-node -n <name>
niv update <name> -b <branch>
or
niv update <name> -r <rev>
And then add it at the top-level of the network.nix file:
let
...
<name> = (import sources.<name> {});
in
...and use it:
cardanoNodePackages =
<name>.legacyPackages.x86_64-linux.cardanoNodePackages;If you want to have a different configuration for a particular server's instance, e.g. enable a set of traces on testnet but not on mainnet, you can do that by changing the services.cardano-node.extraNodeInstanceConfig.
In network.nix you can find examples of this, e.g.:
- In common
service.cardano-nodeconfiguration:
extraNodeInstanceConfig =
# Custom common node configuration for both mainnet and testnet
# instances.
let custom = i : {
...
};
in
ifMainnetC lib.recursiveUpdate
custom
config.services.cardano-node.environments.mainnet.nodeConfig
config.services.cardano-node.environments.testnet.nodeConfig;- In a particular server's
service.cardano-nodeconfiguration:
# Add particular RTView Config
extraNodeInstanceConfig = i : { TraceOptionNodeName = "server-us-west-${toString i}"; };In the scripts folder you'll find scripts/collect-resources.sh running ./scripts/collect-resources.sh will generate a file called combined.png. That file will have a 2x4 montage of all 8 deployed machines heap consumption information.
- https://zimbatm.com/notes/deploying-to-aws-with-terraform-and-nix
- https://github.com/tweag/terraform-nixos
- https://github.com/colemickens/nixos-flake-example
- https://github.com/edolstra/flake-compat
- https://github.com/serokell/pegasus-infra
- https://github.com/serokell/deploy-rs