ooni/sysadmin

WDC migration and re-numbering

Closed this issue · 29 comments

darkk commented

GH / WDC migrates to Miami.
Hardware will be physically moved, IP addresses will be re-numbered.
So, everything that is there, should be killed, frozen or moved.
This migration will happen in the week 26 Nov ... 02 Dec.

  • munin.ooni.io
  • shwdc.ooni.io
  • shark.ooni.nu
  • a.echo.th.ooni.io
  • c.web-connectivity.th.ooni.io
  • b.collector.ooni.io
  • analytics.ooni.io
darkk commented
  • munin — gone, disk image will be removed in a while, Tor TH will be re-deployed preserving keys
  • shwdc — gone, replaced with another bastion host, disk image is deleted as it had no files in home directories
darkk commented
  • shark.ooni.nu — VM deleted, but disk image is kept for a while. I believe, this VM was used for one-off testing and it has not seen any shell logins since Oct 16, so I'm unsure if allocated resources are still needed. @hellais, please tell, if you still need the data stored there.
darkk commented

a.echo.th.ooni.io

It was unclear to me if it gets any load (as it's not mentioned at bouncer config).

It does. E.g. 2018-09-01 bucket mentions it once, 2018-11-20 also once, both cases mention it by FQDN, not by IP.

tcp-echo test helper 37.218.247.110 from bouncer config get a bit more hits -- 360 in 2018-11-20 bucket, 800 in 2018-09-01

Given uptime of 300 days and following stats from DOCKER iptables chain, I belive that dport:80 is the only service that actualy gets some load for that FQDN. Also, the IP address had no other DNS names pointing to it in ooni zones but a.echo.th.ooni.io.

Chain DOCKER (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    7   339 ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0            172.17.0.2           tcp dpt:57005
    0     0 ACCEPT     udp  --  !docker0 docker0  0.0.0.0/0            172.17.0.2           udp dpt:57004
    6   299 ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0            172.17.0.2           tcp dpt:57003
   45  2535 ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0            172.17.0.2           tcp dpt:57001
1156K   63M ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0            172.17.0.2           tcp dpt:80

I believe, the way to handle that host is to switch a.echo.th.ooni.io/A DNS record to another test-helper doing echo at tcp/80 and just kill that VM at this moment (keeping disk for some time).
Also, the VM has 3 GiB of RAM, but uses only ~300 MiB.

So, I'm pointing a.echo.th.ooni.io to 37.218.247.110 (aka c.echo.th.ooni.io) from bouncer config.

darkk commented

analytics.ooni.io

I believe, it can go down for a week and come back online after migration. Moreover, it uses only 1.2 GB of RAM (470 MB without disk cache) out of 6 GB allocated. The IP will be changed from 199.119.112.42 to 37.218.241.42.

darkk commented

c.web-connectivity.th.ooni.io & b.collector.ooni.io

I decided to avoid migration of these services to another DC as I'm not 100% sure, what can go wrong. So I just disabled corresponding collector and test-helper at the bouncer and will re-enable when VMs are renumbered.

darkk commented

b.collector.ooni.io is hardcoded, so bouncer reply is probably not respected (anyway, traffic did not go away after bouncer configuration change) -- https://github.com/measurement-kit/measurement-kit/blob/v0.9.0-beta/src/libmeasurement_kit/ooni/collector_client.hpp#L18-L19
Seems, endpoint should be migrated somewhere.

Somewhere is 103.104.244.110 (aka a.collector.ooni.io), nginx frontend is a convenience.

darkk commented

UPD from @bassosimone: the collector is indeed hardcoded, but the hardcoded value is used as a fallback when bouncer does not reply. It explains both the existence of traffic and very low amount of measurements hitting b.collector.ooni.io endpoint.

darkk commented

Unfortunately c.web-connectivity.th.ooni.io still gets some traffic to y3zq5fwelrzkkv3s.onion. The source is unclear.

shark.ooni.nu — VM deleted, but disk image is kept for a while. I believe, this VM was used for one-off testing and it has not seen any shell logins since Oct 16, so I'm unsure if allocated resources are still needed. @hellais, please tell, if you still need the data stored there

This is the machine I am using for building probe-cli on linux. The disk image has several build artifacts that it would be useful to preserve (so I don't have to rebuild them) and it would be good to re-instate it at the new location.

darkk commented

it would be good to re-instate it at the new location

Do you need computing power or just the build artifacts?

Do you need computing power or just the build artifacts?

Both. I am ok keeping it in stopped as I don't use it on a daily basis, but I do need access to a linux based host to build probe-cli.

darkk commented

The VMs went down half an hour ago (at ~14:00 UTC).
I've moved y3zq5fwelrzkkv3s.onion frontend to b.web-connectivity.th.ooni.io, seems, it's working.
I hope, that's enough for the clients that had that onion address hardcoded to survive.

darkk commented

FTR, number of test-helpers from 2018-11-20 bucket:

$ ( for f in *.tar.lz4; do tar -I lz4 -xf - --to-stdout < $f; done; for f in *.json.lz4; do lz4cat <$f; done ) | jq -c .test_helpers | sort | uniq -c
  55748 {}
    360 {"backend":"37.218.247.110"}
    387 {"backend":"http://37.218.247.95:80"}
      1 {"backend":"http://a.echo.th.ooni.io"}
     20 {"backend":{"type":"https","address":"https://b.web-connectivity.th.ooni.io"}}
   9268 {"backend":{"type":"https","address":"https://b.web-connectivity.th.ooni.io:443"}}
   9767 {"backend":{"type":"https","address":"https://c.web-connectivity.th.ooni.io:443"}}
 235919 {"backend":{"type":"onion","address":"httpo://y3zq5fwelrzkkv3s.onion"}}
      1 null
darkk commented
  • a.collector.ooni.io handles traffic "correctly", but the report files are for some reason not moved from archive to renamed directory.
darkk commented

Also, I point analytics.ooni.io to IP address of amsmetadb.ooni.nu to get RST fast and prevent "loading...." stall in the OONI Web & OONI Explorer UI.
Seems, the JS is loaded in blocking mode and I hope that frontend will survive inability to load that specific JS asset (that is unreachable anyway).

a.collector.ooni.io handles traffic "correctly", but the report files are for some reason not moved from archive to renamed directory

It seems to not be working as expected, because the daily-task is not running properly due to:

Traceback (most recent call last):
  File "/data/collector/bin/archive-to-renamed.py", line 5, in <module>
    import yaml
ImportError: No module named yaml
darkk commented

Does it mean, that it may actually drop some ancient traffic in yaml format as well?...

Does it mean, that it may actually drop some ancient traffic in yaml format as well?...

That should not be a problem, because the actual collector code is run inside of a docker container. The daily cronjob is scheduled to run outside of docker.

I am re-running the playbook with the addition of an apt-get install of python-yaml.

So it actually seems like something changed in the base image of this host, because when I use the apt module I get the following error:

TASK [ooni-collector : Install pyyaml for the daily-task] ***********************************************
failed: [hkgcollectora.ooni.nu] (item=[u'python-yaml']) => {"changed": false, "item": ["python-yaml"], "msg": "Could not import python modules: apt, apt_pkg. Please install python-apt package."}
	to retry, use: --limit @/Users/x/code/ooni/sysadmin/ansible/deploy-collector.retry

For the time being, to unbrick a.collector, I ran the apt-get install command manually.

In relation to b.collector.ooni.io, I suggest that if we don't want to migrate that one, we at least update the A records to point to the IP of a.collector.ooni.io so we don't risk loosing measurements.

darkk commented

The daily cronjob is scheduled to run outside of docker

Sidenote: the slightly more robust way to handle that may be to schedule docker exec or docker run as a cronjob (or as a systemd timer).

Here is the checklist of the second leg of the migration:

  • keep a.echo.th.ooni.io pointed to c.echo.th.ooni.io
  • check b.collector.ooni.io domain name, it should point to miacollector.ooni.nu
  • clean-up 103.104.244.110 (aka a.collector.ooni.io) from b.collector.ooni.io domain name and such
  • ensure that migrated c.web-connectivity.th.ooni.io gets traffic for y3zq5fwelrzkkv3s.onion
  • clean y3zq5fwelrzkkv3s.onion from b.web-connectivity.th.ooni.io
  • check if analytics.ooni.io points to correct VM
  • downsize analytics.ooni.io VM to reasonable shape: 1 or 2 GiB instead of 6
  • update airflow WRT have_collector
  • restore shark disk image somewhere
  • restore onion keys from a.echo.th.ooni.io to check if it gets traffic
  • create MIA Tor TH (as munin.ooni.io taking that role is gone) with 14D2FDC6ABDCD1A27CA32D13AA2C68566D1E8223 fingerprint
  • update bouncer configuration
darkk commented

keep a.echo.th.ooni.io pointed to c.echo.th.ooni.io

It is. The second echo test-helper is b.echo.th.ooni.io hosted at the VM in bigv location (?: donated by bytemark).

check b.collector.ooni.io domain name, it should point to miacollector.ooni.nu

Good. I'm also renaming the VM in GH control panel to miacollector.ooni.nu.

clean-up 103.104.244.110 (aka a.collector.ooni.io) from b.collector.ooni.io domain name and such

Yes, hkgcollectora.ooni.nu is cleaned. nginx logs show that the last report uploaded to that collector VM with Host: b.collector.ooni.io domain was uploaded at ~ 06/Dec/2018 03:00 UTC.

check if analytics.ooni.io points to correct VM, downsize analytics.ooni.io VM to reasonable shape: 1 or 2 GiB instead of 6

OK. 1 GiB it is. It was 730 MiB (with disk caches, 302 MiB without them) after some warm-up.
It also has too beefy disk allocation: 200 GiB with 1.5 GiB used. But that's a bit harder to fix.

clean y3zq5fwelrzkkv3s.onion from b.web-connectivity.th.ooni.io

OK.

ensure that migrated c.web-connectivity.th.ooni.io gets traffic for y3zq5fwelrzkkv3s.onion

It gets (after container restart to re-publish descriptors, thanks to Facebook sharing some bits of knowledge on onion services operations).

update airflow WRT have_collector

Done.

Also.

  • Update ntp rules and rename wdc group to mia.
darkk commented

update bouncer configuration

Done,

Other pending sub-tasks in this ticket wait for the mail chain with GH to converge (as we're trying to be good netizens).

It seems that this one was done. Can we close this issue?

@hellais Do we need more documentation on this one?

restore shark disk image somewhere

I am OK with the shark disk image not being restored at this point as I now have another build system for linux.

darkk commented

It seems that this one was done.

It was not. There are still two items left. They were waiting for resources to be allocated, and the resources were allocated on the 2019-01-17. So it's time to strike those items :)

  • restore onion keys from a.echo.th.ooni.io to check if it gets traffic
  • create MIA Tor TH (as munin.ooni.io taking that role is gone) with 14D2FDC6ABDCD1A27CA32D13AA2C68566D1E8223 fingerprint

Can we close it now? Do we need @hellais to review this?

@darkk There are a few items in your list in this issue that were not ticket off. Were those done?

darkk commented

Yep.
No traffic was noticed for onion counterpart of a.echo.th.ooni.io and it's not listed in bouncer config, so it's gone.
Tor TH in MIA was rolled out as part of 69892e3

This looks good to me. Moving it to the done column.