Test HAproxy Healthcheck
Closed this issue · 3 comments
User: curl/browser
Site: local docker
Platform: NA
Test Steps: Steps for test
-
get your two docker compose files:
wget https://staging.dev.medicmobile.org/_couch/builds/medic:medic:master/docker-compose/cht-couchdb-clustered.yml wget https://staging.dev.medicmobile.org/_couch/builds/medic:medic:master/docker-compose/cht-couchdb-clustered.yml
-
Start a cluster
COUCHDB_SERVERS="couchdb.1,couchdb.2,couchdb.3" CHT_COMPOSE_PROJECT_NAME=test5 COUCHDB_SECRET=test5 DOCKER_CONFIG_PATH=./ COUCHDB_DATA=./couchdb CHT_COMPOSE_PATH=./ COUCHDB_USER=medic COUCHDB_PASSWORD=password docker-compose -f cht-core.yml -f cht-couchdb-clustered.yml up
-
ensure CHT is up in browser and with
curl
- browse to https://localhost/
- find IP of haproxy with
docker inspect -f '{{.Name}} - {{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
and run the command and see JSON of healthy medic DB:curl -s http://medic:password@172.27.0.3:5984/medic | jq
-
check for a container names with
docker ps --format '{{.Names}}'|grep couchdb
and then stop a couch node:docker stop deleteme_couchdb.2_1
-
ensure CHT is DOWN in browser and with
curl
- browse to https://localhost/
- find IP of haproxy (
172.27.0.3
in this case) and run the command and see JSON of SAD medic DB:curl -s http://medic:password@172.27.0.3:5984/medic | jq
-
start couch node again:
docker start deleteme_couchdb.2_1
Expected Result:
CHT should resume correctly. check in browser and with curl
- browse to https://localhost/
- find IP of haproxy (
172.27.0.3
in this case) and run the command and see JSON of healthy medic DB:curl -s http://medic:password@172.27.0.3:5984/medic | jq
AT Status: Failing
Details
after step 5, both http requests and curl requests return HTML and a 502
or 503
. I would expect HAProxy's port 5984
to return JSON, even if it's serving a 5xx
error:
➜ deleteme git:(7852-cht-4.x-docker-helper) ✗ curl -s http://medic:password@172.27.0.3:5984/medic | jq
{
"db_name": "medic",
"purge_seq": "0-g1AAAAH5eJyFk-SNIP-X4s_qEk",
"update_seq": "115-g1AAAAH5eJ-SNIP-yF0LENwjQI_K0wLfgO3P6i8",
"sizes": {
"file": 1264034,
"external": 1486394,
"active": 798908
},
"other": {
"data_size": 1486394
},
"doc_del_count": 6,
"doc_count": 38,
"disk_size": 1264034,
"disk_format_version": 7,
"data_size": 798908,
"compact_running": false,
"cluster": {
"q": 12,
"n": 1,
"w": 1,
"r": 1
},
"instance_start_time": "0"
}
docker stop deleteme_couchdb.2_1
deleteme_couchdb.2_1
$ curl -s http://medic:password@172.27.0.3:5984/medic | jq
parse error: Invalid numeric literal at line 1, column 20
$ curl -s http://medic:password@172.27.0.3:5984/medic
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
$ curl -k https://localhost/
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.19.6</center>
</body>
</html
However, after step 6, only HAProxy is working again, CHT is not:
curl -s http://medic:password@172.27.0.3:5984/medic | jq
{
"db_name": "medic",
"purge_seq": "0-g1AAAAH5eJzLY-SNIP-qEk",
"update_seq": "115-g1AAAAH5eJy-SNIP-P6i8",
"sizes": {
"file": 1264034,
"external": 1486394,
"active": 798908
},
"other": {
"data_size": 1486394
},
"doc_del_count": 6,
"doc_count": 38,
"disk_size": 1264034,
"disk_format_version": 7,
"data_size": 798908,
"compact_running": false,
"cluster": {
"q": 12,
"n": 1,
"w": 1,
"r": 1
},
"instance_start_time": "0"
}
curl -k https://localhost/
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.19.6</center>
</body>
</html>
When we look at the API and Sentinel containers, we see errors (below).
Work around
The work around was to manually restart API and Sentinel and then CHT was responsive again:
docker restart deleteme_api_1 deleteme_sentinel_1
deleteme_api_1
deleteme_sentinel_1
API and Sentinel errors
docker logs deleteme_api_1
2022-11-02 22:03:28 INFO: xforms updated successfully
REQ 44c12316-befb-483f-9fa9-fd43d80ba47f 172.27.0.1 - GET /medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAH5eJyF0LENwjAUBFALIkACCgpmQKJAAgaAFSDxAN9JEX1FUFEzBStA4iWYIktkBpLPhSCqxFRX-Ml3dqKUmsT9SM3C8yWMI7P_5mqd4KhHysy1Dpj6rWj7QTXRZmltxjRsdZvGgVizE6m6XFNaEzG40__TCxIYtjZl8ly9IJm5ipRMA1cvSHXy1FHrG9PY0QzkAyEeeBLTyNEOlAIhXiJP9wKgEghx0Drvuve3AUNzxN3agmnq3oChBQI_K0wLfgO3P6i8&limit=25 HTTP/1.0
REQ fb164bf8-04b6-4852-9b88-06216e2f2993 172.27.0.1 - GET /medic-user-medic-meta/_changes?include_docs=true&feed=longpoll&heartbeat=10000&since=13-g1AAAAH5eJyF0LERgkAQBdATDU0MbMLAGbUAbUFhC7iFgLlhNDK2CltQ2CasgiaoQVg_Yih30Q_uzf1_Vxhj5vk0M4v0ck3zjA-_XG8KHEXW8JIocXbyF-2-qCfEK5FqzG0HByK8V-3G3FDaE2XcGQd6QRJ2ImWgF6Tim2ob6AXpzjNzIrr7m4FiIMQTT_K3A5VAiLfqy9nIswCoBUIciergBgytEQ-RJrgBQxsEflZh3QeAu6hW&limit=25 HTTP/1.0
REQ 545be762-0993-4957-8e2e-1f7b4d233548 ::ffff:172.27.0.8 - GET /setup/poll HTTP/1.1
RES 545be762-0993-4957-8e2e-1f7b4d233548 ::ffff:172.27.0.8 - GET /setup/poll HTTP/1.1 200 108 13.054 ms
2022-11-02 22:04:28 DEBUG: Checking for a configured outgoing message service
2022-11-02 22:05:28 DEBUG: Checking for a configured outgoing message service
2022-11-02 22:06:28 DEBUG: Checking for a configured outgoing message service
2022-11-02 22:07:28 DEBUG: Checking for a configured outgoing message service
REQ bc09ad8b-4600-4b63-9981-f52328e4f403 172.27.0.1 - GET /medic/_design/medic-client/_view/data_records_by_type?group=true HTTP/1.0
REQ cf3edf66-d012-44cf-a530-a1b9c893f2ff 172.27.0.1 - GET /medic-user-medic-meta/_design/medic-user/_view/read?group=true HTTP/1.0
RES bc09ad8b-4600-4b63-9981-f52328e4f403 172.27.0.1 - GET /medic/_design/medic-client/_view/data_records_by_type?group=true HTTP/1.0 200 - 26.449 ms
RES cf3edf66-d012-44cf-a530-a1b9c893f2ff 172.27.0.1 - GET /medic-user-medic-meta/_design/medic-user/_view/read?group=true HTTP/1.0 200 - 28.299 ms
2022-11-02 22:08:28 DEBUG: Checking for a configured outgoing message service
RES fb164bf8-04b6-4852-9b88-06216e2f2993 172.27.0.1 - GET /medic-user-medic-meta/_changes?include_docs=true&feed=longpoll&heartbeat=10000&since=13-g1AAAAH5eJyF0LERgkAQBdATDU0MbMLAGbUAbUFhC7iFgLlhNDK2CltQ2CasgiaoQVg_Yih30Q_uzf1_Vxhj5vk0M4v0ck3zjA-_XG8KHEXW8JIocXbyF-2-qCfEK5FqzG0HByK8V-3G3FDaE2XcGQd6QRJ2ImWgF6Tim2ob6AXpzjNzIrr7m4FiIMQTT_K3A5VAiLfqy9nIswCoBUIciergBgytEQ-RJrgBQxsEflZh3QeAu6hW&limit=25 HTTP/1.0 200 - 10262.491 ms
2022-11-02 22:08:45 ERROR: Error watching changes, restarting: FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0
at /api/node_modules/node-fetch/lib/index.js:272:32
at processTicksAndRejections (node:internal/process/task_queues:96:5) {
message: 'invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0',
type: 'invalid-json',
[stack]: 'FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0\n' +
' at /api/node_modules/node-fetch/lib/index.js:272:32\n' +
' at processTicksAndRejections (node:internal/process/task_queues:96:5)',
name: 'FetchError'
}
docker logs deleteme_sentinel_1
2022-11-02 22:08:35 INFO: Task dueTasks completed
2022-11-02 22:08:35 INFO: Background cleanup batch: 114 -> 114 (0)
2022-11-02 22:08:35 INFO: Task backgroundCleanup completed
2022-11-02 22:09:54 ERROR: Error watching changes, restarting: FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0
at /sentinel/node_modules/node-fetch/lib/index.js:272:32
at processTicksAndRejections (node:internal/process/task_queues:96:5) {
message: 'invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0',
type: 'invalid-json',
[stack]: 'FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0\n' +
' at /sentinel/node_modules/node-fetch/lib/index.js:272:32\n' +
' at processTicksAndRejections (node:internal/process/task_queues:96:5)',
name: 'FetchError'
}
I made this work by manually editing the HAProxy config in my live haproxy
container. First I edited the default_frontend.cfg
and added this stanza:
http-errors json_503_errors
errorfile 503 /usr/local/etc/haproxy/errors/503.json
And then at the end of the frontend http-in
config in that same file, I added this line:
errorfiles json_503_errors
I created the JSON error file in /usr/local/etc/haproxy/errors/503.json
with this contents:
HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html
{
"code": 503,
"error": "temporarily_unavailable",
"error_description": "Sorry, the service is temporarily unavailable"
}
I then restarted the haproxy
container. Now, when I pulled a node down, I get a much nicer error on the CLI for HAProxy:
curl -s http://medic:password@172.27.0.3:5984/medic | jq
{
"error": "nodedown",
"reason": "progress not possible"
}
And API doesn't die either. It does log an ERROR
though:
2022-11-02 23:24:09 ERROR: Server error: StatusCodeError: 503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}
at new StatusCodeError (/api/node_modules/request-promise-core/lib/errors.js:32:15)
at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
at Request.self.callback (/api/node_modules/request/request.js:185:22)
at Request.emit (node:events:513:28)
at Request.<anonymous> (/api/node_modules/request/request.js:1154:10)
at Request.emit (node:events:513:28)
at IncomingMessage.<anonymous> (/api/node_modules/request/request.js:1076:12)
at Object.onceWrapper (node:events:627:28)
at IncomingMessage.emit (node:events:525:35) {
name: 'StatusCodeError',
statusCode: 503,
message: '503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}',
error: {
code: 503,
error: 'temporarily_unavailable',
error_description: 'Sorry, the service is temporarily unavailable'
},
[stack]: 'StatusCodeError: 503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}\n' +
' at new StatusCodeError (/api/node_modules/request-promise-core/lib/errors.js:32:15)\n' +
' at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:104:33)\n' +
' at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)\n' +
' at Request.self.callback (/api/node_modules/request/request.js:185:22)\n' +
' at Request.emit (node:events:513:28)\n' +
' at Request.<anonymous> (/api/node_modules/request/request.js:1154:10)\n' +
' at Request.emit (node:events:513:28)\n' +
' at IncomingMessage.<anonymous> (/api/node_modules/request/request.js:1076:12)\n' +
' at Object.onceWrapper (node:events:627:28)\n' +
' at IncomingMessage.emit (node:events:525:35)'
}
The same for sentinel - it doesn't die, but does log an error:
2022-11-02 23:20:42 ERROR: Task backgroundCleanup completed with error: {
code: 503,
error: 'temporarily_unavailable',
error_description: 'Sorry, the service is temporarily unavailable',
status: 503,
name: 'temporarily_unavailable',
message: undefined,
docId: '_local/background-seq'
}
Noteworthy, the front end just shows a full screen spinner for an online admin users, with no indication there's an error. Maybe the old full page <center><h1>502 Bad Gateway</h1></center>
output was better?
Starting up the couchdb node fixes everything, no need to restart API or Sentinel b/c they never died in the first place.
With recent PR/ticket merged over here - this is good to go - AT passes