Failure "chunk 0 failed" must be more detailed when environment is full

Question

Failure "chunk 0 failed" must be more detailed when environment is full

Closed this issue 7 years ago · 1 comments

Alba version: 1.3.6

Situation: the environment got full and the policy (12/4/14/2 - 10 nodes) could not be satisfied (disks 99% on 5 nodes). I checked the logfiles and found the below error in the alba-proxy.

(Failure "chunk 0 failed")

Detailed logview:

2017-02-20 10:32:43 981517 +0100 - stor-01.be-g8-3 - 5973/0 - alba/proxy - 1654778 - error - Unexpected exception in proxy while handling request: (Failure "chunk 0 failed"); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
2017-02-20 10:32:43 981637 +0100 - stor-01.be-g8-3 - 5973/0 - alba/proxy - 1654779 - error - Request ApplySequence ("75450521-1cde-4f82-88bf-2332ffff016d",false,[(Nsm_model.Assert.ObjectHasChecksum ("owner_tag",;     Sha1 c66c65175fecc3103b3b587be9b5b230889c8628));   ],[(Proxy_protocol.Protocol.Update.UploadObjectFromFile;     ("00_00000049_00",;      "/mnt/ssd4/vmstor_write_sco_1/75450521-1cde-4f82-88bf-2332ffff016d/00_00000049_00",;      (Some Crc32c 0xdb988818)));   ]) errored and took 0.189443

The first "chunk 0 failed" error started right after lot of the following messages:

2017-02-19 20:04:18 686779 +0100 - stor-01.be-g8-3 - 5973/0 - alba/proxy - 1408944 - warning - fragment upload failed:Asd_protocol.Protocol.Error.Exn(1)

ASD on one of the nodes that got filled up to 99%:

2017-02-19 20:04:22 994301 +0100 - cpu-01.be-g8-3 - 11150/0 - alba/asd - 35688 - info - returning error Asd_protocol.Protocol.Error.Full
2017-02-19 20:04:23 302203 +0100 - cpu-01.be-g8-3 - 11150/0 - alba/asd - 35690 - info - returning error Asd_protocol.Protocol.Error.Full
2017-02-19 20:04:23 848663 +0100 - cpu-01.be-g8-3 - 11150/0 - alba/asd - 35691 - info - returning error Asd_protocol.Protocol.Error.Full

Answer 1 · 2017-05-30T12:14:35.000Z

The proxy can't write the chunk so the error is correct. CheckMK (based upon f.e. the healthcheck) needs to report/alert that env is almost completely full. OPS team, if needed create the necessary tickets on the respective repos (if not yet the case).