status-im/infra-nim-waku

Deploy websockify with nim-waku nodes in test fleet

Closed this issue ยท 31 comments

D4nte commented

Problem

  • nim-waku does not support secure websocket connectivity.
  • js-waku in the browser only supports secure websocket connections

Solution

  1. Get certificates using letsencrypt
  2. Deploy and run websockify to proxy nim-waku tcp connections and wrap them with SSL+Websocket

Steps

For each wakunode2 binary deployed in the fleet, considering wakunode2 is deployed on <testmachine.statusim.net>, listening on tcp port <tcp_port>, on the same machine:

  1. Install certbot:
sudo apt install certbot
  1. Get certificates:
sudo certbot certonly -d <testmachine.statusim.net>
  1. Install websockify
sudo apt install websockify
  1. Start websockify
sudo websockify --cert /etc/letsencrypt/live/<testmachine.statusim.net>/fullchain.pem --key /etc/letsencrypt/live/<testmachine.statusim.net>/privkey.pem 0.0.0.0:443 127.0.0.1:<tcp_port>
  1. Register the websocket address to https://fleets.status.im/

Value should be: /dns4/<testmachine.statusim.net>/tcp/443/wss/p2p/<peer-id>, with the same peer-id value used in the current JSON file.

Notes:

a. If the file system is preserved between deployment, 1, 2 & 3 may not be necessary.
b. Not sure what will be the best way to embed this new value in https://fleets.status.im/
c. Once this is confirmed as working as expected on the test fleet, I'll open an issue for a similar setup on the prod fleet.
d. When running certbot, it detects webserver present and proposes several methods to action the challenge.

Cc @oskarth @jm-clius

Commented in Discord already but: Seems fine, I'd probably put this under a separate namespace from current nim-waku test nodes but still under wakuv2.test, then connect to the other nim-waku nodes. This means one can choose whether to connect to nim-waku, go-waku or all, and triggers aren't messed up.

put this under a separate namespace from current nim-waku test nodes but still under wakuv2.test

How would that work? If it's a separate fleet it should have a separate fleet name. Like gowaku.test or something.

then connect to the other nim-waku nodes

By "other nim-waku nodes" do you mean all the ones we run, or just the specific fleet: wakuv2.test?

How many hosts does it need? Does it need to be spread across multiple DCs? How beefy should the hosts be?

D4nte commented

How many hosts does it need? Does it need to be spread across multiple DCs? How beefy should the hosts be?

I imagine one is enough, it would act as a bridge to let websocket only clients access the waku test fleets. At this stage the only client is a webapp I use for dogfooding. Other users would only use the webapp as a showcase for js-waku.

Like gowaku.test or something.

Yes

just the specific fleet: wakuv2.test?

How many hosts does it need? Does it need to be spread across multiple DCs? How beefy should the hosts be?

1 seems good enough for now

D4nte commented

After further review, we need secure websocket support too (wss) to make the most out of a cluster deployment of go-waku.
We are currently tracking that in waku-org/go-waku#21.
Until this is done, then the cluster deployment would not be that useful and if it cannot be done in go then the deployment would involve using a webserver for the TLS layer.

Closing this until waku-org/go-waku#21 is resolved.

D4nte commented

Thanks to @richard-ramos who already did the wss investigation on the go side, we know that for now, we have to use a webserver to add the TLS layer as go-libp2p is not ready for wss native support. See waku-org/go-waku#21 (comment) for details.

By any chance, would it be possible to include a webserver (nginx or other) as part of the deployment of go-waku?

D4nte commented

@richard-ramos mentinoed that we should be able to setup subdomains for go-waku instances, like go-waku1.status.im, go-waku2.status.im, etc, and use the status.im certificate, that way we could have https, and not have to deal with self signed certificates.

Hm, if possible I'd go for nginx + letsencrypt script. Making it part of status.im is kind of going against the p2p architecture, and these types of design decisions tend to stick around in a bad way. What are the downsdies of self-signed certificates here? From a protocol design POV, status.im should have no privileged position whatsoever.

D4nte commented

Hm, if possible I'd go for nginx + letsencrypt script. Making it part of status.im is kind of going against the p2p architecture, and these types of design decisions tend to stick around in a bad way. What are the downsdies of self-signed certificates here? From a protocol design POV, status.im should have no privileged position whatsoever.

self-signed certificate are fine. However, AFAIK, you cannot use letsencrypt for ip certificate. Which means you would need a domain name, is that an option?

Please note that we tried out an ip self-signed certificate, generate via openssl (@richard-ramos please confirm) and it worked fine without having to install a CA.

Please note that we tried out an ip self-signed certificate, generate via openssl (@richard-ramos please confirm) and it worked fine without having to install a CA.

Yes, I created the self signed certificate using openssl. I was expecting chrome to complain about the validity of the certificate, but surprisingly it worked with no issues.

D4nte commented

Please note that I am now investigation the use of websockify directly with nim-waku. This could add SSL+websocket layer to nim-waku, removing the need to deploying go-waku and nginx.

Best to hold off on this present issue for now.

D4nte commented

FYI @oskarth @jm-clius @jakubgs I have updated the description for a new strategy.

At the moment I am proposing to use letsencrypt certificates, simply because I find it more straightforward and I already know how to do it.
@oskarth if you feel strongly about the ip certificates, let me know and I can revise the proposal after I test them out myself.

Whatever is easiest for now, the guiding principle should be that we eventually ideally want it to work for end-users running Status desktop, or similar. Alternatively - KISS setup for VPS/spare server.

Does it have to use LetsEncrypt? It would be probably simpler if we just used CloudFlare certs.

D4nte commented

Does it have to use LetsEncrypt? It would be probably simpler if we just used CloudFlare certs.

Any valid SSL certificate should do it. Please note that websockify takes the full chain in the --cert argument: --cert /etc/letsencrypt/live/<testmachine.statusim.net>/fullchain.pem so you may have to concatenate the certificates. Let me know if you hit any issue.

Does it have to use LetsEncrypt? It would be probably simpler if we just used CloudFlare certs.

Keep in mind that we want to have this be deployable by random users. CloudFlare might make sense for the specific cluster setup, but in general we want to make this simple to use from a random computer/VPS. Afaik, letsencrypt is more scriptable and doesn't require one to setup an account, etc.

If either works of course that's more convenient.

This repo is for our internal use, and not for external use. I have limited time and CloudFlare setup requires less work which is why I will use it.

D4nte commented

Keep in mind that we want to have this be deployable by random users.

I am not sure this is the right place for this discussion. Do you want me to add some documentation on how to use websockify for nim-waku users?

Repo: Right, agree, this is specifically about cluster setup and not generic setup scripts for nim-waku, so probably wrong place to bring this up.

Do you want me to add some documentation on how to use websockify for nim-waku users?

Probably a good idea, since we should optimize for normal users first and cluster second. Can be a basic md doc in docs folder I suppose?

I have deployed a setup where I pass the origin certs directly to Websockify via --cert and --key, but that fails with:

EOF occurred in violation of protocol (_ssl.c:1131)

And wsping returns:

Error: unable to verify the first certificate.

Related issues:

And nothing of any value in them. Maybe Nginx setup makes more sense.

Oh, I see what's happening. This would require nested wildcard certificate with multiple layers of nested domains, which CloudFlare doesn't seem to support, at least not in a format that would work for all hosts, like *.*.*.*.statusim.net.

So I guess it is LetsEncrypt anyway. But this is useful to know for future setups involving full hostnames.

There's this role, which uses the standalone method: https://github.com/geerlingguy/ansible-role-certbot

As seen here:

certbot_create_command: >-
  {{ certbot_script }} certonly --standalone --noninteractive --agree-tos
  --email {{ cert_item.email | default(certbot_admin_email) }}
  -d {{ cert_item.domains | join(',') }}

https://github.com/geerlingguy/ansible-role-certbot/blob/fdba1c435251341af7fbdfc44b276daafdea632f/defaults/main.yml#L20-L23

And it appears to work by stopping Nginx temporarily. And they do auto-renewal via cron:
https://github.com/geerlingguy/ansible-role-certbot/blob/master/tasks/renew-cron.yml

I'd prefer a systemd timer. I might do our own role.

The tricky thing is, if I run certbot on the host the certificates get created with owner as root, and private key cannot be accessed from the dockremap user within the container. But if I run certbot as non-root user it cannot use the 80 and 443 ports in the standalone mode.

An alternative to allowing certbot to take the 80 and 443 ports is using --manual and exposing the right URL yourself.

This is possible using the --manual-auth-hook and --manual-cleanup-hook flags:
https://certbot.eff.org/docs/using.html#pre-and-post-validation-hooks

But for that to work a non-root user would have to be able to enable and disable an Nginx site config.

Here's the Websockify config without SSL: f4a2674c

I forked geerlingguy/ansible-role-certbot and created https://github.com/status-im/infra-role-certbot, and made some changes:

And then configured the nim-waku role to make use of the certbot via Docker, so the certificates are owned by dockremap:

I've created the certificates using certbot in a Docker container and yet I'm still getting this error:

 EOF occurred in violation of protocol (_ssl.c:1131)

It's always something...

I tried adding py3-openssl py3-asn1 py3-ndg_httpsclient packages to the Docker image as suggested but it didn't work.

Ah, there we have it, you have to pass fullchain.pem to the --cert flag, instead of the cert.pem from LetsEncrypt:

 > ~/temp/node_modules/.bin/wsping wss://node-01.do-ams3.wakuv2.test.statusim.net/
Successfully connected to wss://node-01.do-ams3.wakuv2.test.statusim.net/.

Here's the changes that implement SSL using the new infra-role-certbot: 0de8519e

The Consul definition also includes the multiaddress:

admin@node-01.do-ams3.wakuv2.test:~ % jq '.services[] | select(.name == "nim-waku-v2-websocket").meta' /etc/consul/service_nim_waku_v2.json
{
  "node_enode": "/dns4/node-01.do-ams3.wakuv2.test.statusim.net/tcp/443/wss/p2p/16Uiu2HAmPLe7Mzm8TsYUubgCAW1aJoeFScxrLj8ppHFivPo97bUZ"
}

And here's the update to the Python script that generates contents of https://fleets.status.im/: https://github.com/status-im/infra-eth-cluster/commit/9cf182c1

Looks good:

 > curl -s https://fleets.status.im/ | jq '.fleets["wakuv2.test"]'
{
  "waku": {
    "node-01.ac-cn-hongkong-c.wakuv2.test": "/ip4/47.242.210.73/tcp/30303/p2p/16Uiu2HAmSyrYVycqBCWcHyNVQS6zYQcdQbwyov1CDijboVRsQS37",
    "node-01.do-ams3.wakuv2.test": "/ip4/134.209.139.210/tcp/30303/p2p/16Uiu2HAmPLe7Mzm8TsYUubgCAW1aJoeFScxrLj8ppHFivPo97bUZ",
    "node-01.gc-us-central1-a.wakuv2.test": "/ip4/104.154.239.128/tcp/30303/p2p/16Uiu2HAmJb2e28qLXxT5kZxVUUoJt72EMzNGXB47Rxx5hw3q4YjS"
  },
  "waku-websocket": {
    "node-01.ac-cn-hongkong-c.wakuv2.test": "/dns4/node-01.ac-cn-hongkong-c.wakuv2.test.statusim.net/tcp/443/wss/p2p/16Uiu2HAmSyrYVycqBCWcHyNVQS6zYQcdQbwyov1CDijboVRsQS37",
    "node-01.do-ams3.wakuv2.test": "/dns4/node-01.do-ams3.wakuv2.test.statusim.net/tcp/443/wss/p2p/16Uiu2HAmPLe7Mzm8TsYUubgCAW1aJoeFScxrLj8ppHFivPo97bUZ",
    "node-01.gc-us-central1-a.wakuv2.test": "/dns4/node-01.gc-us-central1-a.wakuv2.test.statusim.net/tcp/443/wss/p2p/16Uiu2HAmJb2e28qLXxT5kZxVUUoJt72EMzNGXB47Rxx5hw3q4YjS"
  }
}

@D4nte please check it out.

D4nte commented

Yes, looks great, thanks!

Can you please direct me where I can open an issue to get CORS Header on fleets.status.im so that any dapp can query fleets.status.im to retrieve the addresses?

Well, as you saw in my previous comment the change that added those services to fleets.status.im was https://github.com/status-im/infra-eth-cluster/commit/9cf182c1 done in the infra-eth-cluster repo, so that's where an issue would go.

But I already fixed that: https://github.com/status-im/infra-eth-cluster/commit/b3554e59

 > curl -sLi fleets.status.im | grep access-control-allow-origin 
access-control-allow-origin: *