ooni/sysadmin

Incident: very slow rsync between ams-ps1 and datacollector

FedericoCeratto opened this issue · 4 comments

The pipeline was rsyncing files very slowly (< 900 Kbps). The issue was reproduced manually with a low success rate (e.g. less than 10%). The pipeline has been then restarted.

See: https://mon.ooni.nu/grafana/d/AE8tFfxWk/collectors-disk-activity?orgId=1&from=1574881731778&to=1574968131778

socket status according to ss -nta -p -e -m -i

	 skmem:(r0,rb359040,t0,tb963072,f32660,w712812,o0,bl0,d1) ts sack reno wscale:7,7 rto:380 rtt:179.943/10.855 ato:40 mss:1448 pmtu:1500 rcvmss:1408 advmss:1448 cwnd:11 ssthresh:5 bytes_acked:1925246329 bytes_received:772097 segs_out:1363340 segs_in:402957 data_segs_out:1349397 data_segs_in:16430 send 708.1Kbps lastsnd:140 lastrcv:490 lastack:140 pacing_rate 849.8Kbps delivery_rate 664.3Kbps busy:24227780ms rwnd_limited:170ms(0.0%) unacked:11 retrans:0/19671 rcv_rtt:175.875 rcv_space:28960 rcv_ssthresh:178060 notsent:675124 minrtt:173

In order to resolve the incident what we did was the following:

  1. pkill rsync from the datacollector host and wait for airflow to mark the fetcher DAG task as failed
  2. Clear the state of the fetcher DAG root task as well as the state of the hist_canning DAG root task
  3. rsync had left over some temporary files which had ended up inside of the 2019-11-27 bucket and the hist_canning DAG was complaining with:
[2019-11-28 18:15:16,623] {base_task_runner.py:95} INFO - Subtask: [2019-11-28 18:15:16,622] {bash_operator.py:94} INFO - __main__.CannerError: ('Unable to parse report filename', '.20191127T060731Z-RU-AS48642-web_connectivity-20191127T060731Z_AS48642_e5Gm6bzrFVhDIlwXGYVmPNdCGpgZdNpnfMLcymVBvacqZMXUXW-0.2.0-probe.json.otkpY7')

Initially we tried to just delete the file in question from /data/ooni/private/reports-raw/2019-11-27, but then the hist_canning was complaining that a shasum was not matching.

This was triggered by the shasum inside of /data/ooni/private/reports-raw-shals/2019-11-27.

  1. We ran from datacollector the following command to rewrite the shals file:
find /data/ooni/private/reports-raw/2019-11-27 -type f -printf '%f %s\n' | LC_ALL=C sort --buffer-size=96M | sha256sum > 2019-11-27
  1. Cleared the state of hist_canning and the pipeline restarted normally

We can close this as it has been resolved.

Here are screenshots on how to resolve this incident:

Screenshot 2020-01-14 at 17 05 01
Screenshot 2020-01-14 at 17 04 56
Screenshot 2020-01-14 at 17 04 51
Screenshot 2020-01-14 at 17 04 45
Screenshot 2020-02-20 at 14 33 39