Incident: very slow rsync between ams-ps1 and datacollector
FedericoCeratto opened this issue · 4 comments
FedericoCeratto commented
The pipeline was rsyncing files very slowly (< 900 Kbps). The issue was reproduced manually with a low success rate (e.g. less than 10%). The pipeline has been then restarted.
FedericoCeratto commented
socket status according to ss -nta -p -e -m -i
skmem:(r0,rb359040,t0,tb963072,f32660,w712812,o0,bl0,d1) ts sack reno wscale:7,7 rto:380 rtt:179.943/10.855 ato:40 mss:1448 pmtu:1500 rcvmss:1408 advmss:1448 cwnd:11 ssthresh:5 bytes_acked:1925246329 bytes_received:772097 segs_out:1363340 segs_in:402957 data_segs_out:1349397 data_segs_in:16430 send 708.1Kbps lastsnd:140 lastrcv:490 lastack:140 pacing_rate 849.8Kbps delivery_rate 664.3Kbps busy:24227780ms rwnd_limited:170ms(0.0%) unacked:11 retrans:0/19671 rcv_rtt:175.875 rcv_space:28960 rcv_ssthresh:178060 notsent:675124 minrtt:173
hellais commented
In order to resolve the incident what we did was the following:
pkill rsync
from the datacollector host and wait for airflow to mark thefetcher
DAG task as failed- Clear the state of the
fetcher
DAG root task as well as the state of thehist_canning
DAG root task rsync
had left over some temporary files which had ended up inside of the2019-11-27
bucket and thehist_canning
DAG was complaining with:
[2019-11-28 18:15:16,623] {base_task_runner.py:95} INFO - Subtask: [2019-11-28 18:15:16,622] {bash_operator.py:94} INFO - __main__.CannerError: ('Unable to parse report filename', '.20191127T060731Z-RU-AS48642-web_connectivity-20191127T060731Z_AS48642_e5Gm6bzrFVhDIlwXGYVmPNdCGpgZdNpnfMLcymVBvacqZMXUXW-0.2.0-probe.json.otkpY7')
Initially we tried to just delete the file in question from /data/ooni/private/reports-raw/2019-11-27
, but then the hist_canning
was complaining that a shasum was not matching.
This was triggered by the shasum
inside of /data/ooni/private/reports-raw-shals/2019-11-27
.
- We ran from datacollector the following command to rewrite the shals file:
find /data/ooni/private/reports-raw/2019-11-27 -type f -printf '%f %s\n' | LC_ALL=C sort --buffer-size=96M | sha256sum > 2019-11-27
- Cleared the state of
hist_canning
and the pipeline restarted normally
hellais commented
We can close this as it has been resolved.
hellais commented