medic/cht-release-testing

Test HAproxy Healthcheck

Closed this issue · 3 comments

User: curl/browser
Site: local docker
Platform: NA
Test Steps: Steps for test

  1. get your two docker compose files:

    wget https://staging.dev.medicmobile.org/_couch/builds/medic:medic:master/docker-compose/cht-couchdb-clustered.yml
    wget https://staging.dev.medicmobile.org/_couch/builds/medic:medic:master/docker-compose/cht-couchdb-clustered.yml
    
  2. Start a cluster

    COUCHDB_SERVERS="couchdb.1,couchdb.2,couchdb.3" CHT_COMPOSE_PROJECT_NAME=test5 COUCHDB_SECRET=test5 DOCKER_CONFIG_PATH=./ COUCHDB_DATA=./couchdb CHT_COMPOSE_PATH=./ COUCHDB_USER=medic COUCHDB_PASSWORD=password docker-compose -f cht-core.yml -f cht-couchdb-clustered.yml up
    
  3. ensure CHT is up in browser and with curl

    1. browse to https://localhost/
    2. find IP of haproxy with docker inspect -f '{{.Name}} - {{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q) and run the command and see JSON of healthy medic DB: curl -s http://medic:password@172.27.0.3:5984/medic | jq
  4. check for a container names with docker ps --format '{{.Names}}'|grep couchdb and then stop a couch node: docker stop deleteme_couchdb.2_1

  5. ensure CHT is DOWN in browser and with curl

    1. browse to https://localhost/
    2. find IP of haproxy (172.27.0.3 in this case) and run the command and see JSON of SAD medic DB: curl -s http://medic:password@172.27.0.3:5984/medic | jq
  6. start couch node again: docker start deleteme_couchdb.2_1

Expected Result:
CHT should resume correctly. check in browser and with curl

  1. browse to https://localhost/
  2. find IP of haproxy (172.27.0.3 in this case) and run the command and see JSON of healthy medic DB: curl -s http://medic:password@172.27.0.3:5984/medic | jq

AT Status: Failing

Details

after step 5, both http requests and curl requests return HTML and a 502 or 503. I would expect HAProxy's port 5984 to return JSON, even if it's serving a 5xx error:

➜  deleteme git:(7852-cht-4.x-docker-helper) ✗ curl -s http://medic:password@172.27.0.3:5984/medic | jq
{
  "db_name": "medic",
  "purge_seq": "0-g1AAAAH5eJyFk-SNIP-X4s_qEk",
  "update_seq": "115-g1AAAAH5eJ-SNIP-yF0LENwjQI_K0wLfgO3P6i8",
  "sizes": {
    "file": 1264034,
    "external": 1486394,
    "active": 798908
  },
  "other": {
    "data_size": 1486394
  },
  "doc_del_count": 6,
  "doc_count": 38,
  "disk_size": 1264034,
  "disk_format_version": 7,
  "data_size": 798908,
  "compact_running": false,
  "cluster": {
    "q": 12,
    "n": 1,
    "w": 1,
    "r": 1
  },
  "instance_start_time": "0"
}

docker stop deleteme_couchdb.2_1                        
deleteme_couchdb.2_1

$ curl -s http://medic:password@172.27.0.3:5984/medic | jq
parse error: Invalid numeric literal at line 1, column 20

$ curl -s http://medic:password@172.27.0.3:5984/medic     
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

$ curl -k https://localhost/
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.19.6</center>
</body>
</html

However, after step 6, only HAProxy is working again, CHT is not:

curl -s http://medic:password@172.27.0.3:5984/medic | jq
{
  "db_name": "medic",
  "purge_seq": "0-g1AAAAH5eJzLY-SNIP-qEk",
  "update_seq": "115-g1AAAAH5eJy-SNIP-P6i8",
  "sizes": {
    "file": 1264034,
    "external": 1486394,
    "active": 798908
  },
  "other": {
    "data_size": 1486394
  },
  "doc_del_count": 6,
  "doc_count": 38,
  "disk_size": 1264034,
  "disk_format_version": 7,
  "data_size": 798908,
  "compact_running": false,
  "cluster": {
    "q": 12,
    "n": 1,
    "w": 1,
    "r": 1
  },
  "instance_start_time": "0"
}

curl -k https://localhost/                              
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.19.6</center>
</body>
</html>

When we look at the API and Sentinel containers, we see errors (below).

Work around

The work around was to manually restart API and Sentinel and then CHT was responsive again:

docker restart deleteme_api_1 deleteme_sentinel_1 
deleteme_api_1
deleteme_sentinel_1

API and Sentinel errors

docker logs deleteme_api_1

2022-11-02 22:03:28 INFO: xforms updated successfully 
REQ 44c12316-befb-483f-9fa9-fd43d80ba47f 172.27.0.1 - GET /medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAH5eJyF0LENwjAUBFALIkACCgpmQKJAAgaAFSDxAN9JEX1FUFEzBStA4iWYIktkBpLPhSCqxFRX-Ml3dqKUmsT9SM3C8yWMI7P_5mqd4KhHysy1Dpj6rWj7QTXRZmltxjRsdZvGgVizE6m6XFNaEzG40__TCxIYtjZl8ly9IJm5ipRMA1cvSHXy1FHrG9PY0QzkAyEeeBLTyNEOlAIhXiJP9wKgEghx0Drvuve3AUNzxN3agmnq3oChBQI_K0wLfgO3P6i8&limit=25 HTTP/1.0
REQ fb164bf8-04b6-4852-9b88-06216e2f2993 172.27.0.1 - GET /medic-user-medic-meta/_changes?include_docs=true&feed=longpoll&heartbeat=10000&since=13-g1AAAAH5eJyF0LERgkAQBdATDU0MbMLAGbUAbUFhC7iFgLlhNDK2CltQ2CasgiaoQVg_Yih30Q_uzf1_Vxhj5vk0M4v0ck3zjA-_XG8KHEXW8JIocXbyF-2-qCfEK5FqzG0HByK8V-3G3FDaE2XcGQd6QRJ2ImWgF6Tim2ob6AXpzjNzIrr7m4FiIMQTT_K3A5VAiLfqy9nIswCoBUIciergBgytEQ-RJrgBQxsEflZh3QeAu6hW&limit=25 HTTP/1.0
REQ 545be762-0993-4957-8e2e-1f7b4d233548 ::ffff:172.27.0.8 - GET /setup/poll HTTP/1.1
RES 545be762-0993-4957-8e2e-1f7b4d233548 ::ffff:172.27.0.8 - GET /setup/poll HTTP/1.1 200 108 13.054 ms
2022-11-02 22:04:28 DEBUG: Checking for a configured outgoing message service 
2022-11-02 22:05:28 DEBUG: Checking for a configured outgoing message service 
2022-11-02 22:06:28 DEBUG: Checking for a configured outgoing message service 
2022-11-02 22:07:28 DEBUG: Checking for a configured outgoing message service 
REQ bc09ad8b-4600-4b63-9981-f52328e4f403 172.27.0.1 - GET /medic/_design/medic-client/_view/data_records_by_type?group=true HTTP/1.0
REQ cf3edf66-d012-44cf-a530-a1b9c893f2ff 172.27.0.1 - GET /medic-user-medic-meta/_design/medic-user/_view/read?group=true HTTP/1.0
RES bc09ad8b-4600-4b63-9981-f52328e4f403 172.27.0.1 - GET /medic/_design/medic-client/_view/data_records_by_type?group=true HTTP/1.0 200 - 26.449 ms
RES cf3edf66-d012-44cf-a530-a1b9c893f2ff 172.27.0.1 - GET /medic-user-medic-meta/_design/medic-user/_view/read?group=true HTTP/1.0 200 - 28.299 ms
2022-11-02 22:08:28 DEBUG: Checking for a configured outgoing message service 
RES fb164bf8-04b6-4852-9b88-06216e2f2993 172.27.0.1 - GET /medic-user-medic-meta/_changes?include_docs=true&feed=longpoll&heartbeat=10000&since=13-g1AAAAH5eJyF0LERgkAQBdATDU0MbMLAGbUAbUFhC7iFgLlhNDK2CltQ2CasgiaoQVg_Yih30Q_uzf1_Vxhj5vk0M4v0ck3zjA-_XG8KHEXW8JIocXbyF-2-qCfEK5FqzG0HByK8V-3G3FDaE2XcGQd6QRJ2ImWgF6Tim2ob6AXpzjNzIrr7m4FiIMQTT_K3A5VAiLfqy9nIswCoBUIciergBgytEQ-RJrgBQxsEflZh3QeAu6hW&limit=25 HTTP/1.0 200 - 10262.491 ms
2022-11-02 22:08:45 ERROR: Error watching changes, restarting: FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0
    at /api/node_modules/node-fetch/lib/index.js:272:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  message: 'invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0',
  type: 'invalid-json',
  [stack]: 'FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0\n' +
    '    at /api/node_modules/node-fetch/lib/index.js:272:32\n' +
    '    at processTicksAndRejections (node:internal/process/task_queues:96:5)',
  name: 'FetchError'
} 
docker logs deleteme_sentinel_1

2022-11-02 22:08:35 INFO: Task dueTasks completed 
2022-11-02 22:08:35 INFO: Background cleanup batch: 114 -> 114 (0) 
2022-11-02 22:08:35 INFO: Task backgroundCleanup completed 
2022-11-02 22:09:54 ERROR: Error watching changes, restarting: FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0
    at /sentinel/node_modules/node-fetch/lib/index.js:272:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  message: 'invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0',
  type: 'invalid-json',
  [stack]: 'FetchError: invalid json response body at http://haproxy:5984/medic/_changes?feed=longpoll&heartbeat=10000&since=115-g1AAAAOReJyV0sFNwzAUgGGLVBQJOHBgBiQOyHFsJz3BCtB6AD_bUhQVOHFmClaA1kswRZfoDLTmOUEKByuST09K5E_-X7ImhFy0hSVX5vXNtBYe_uZducZXJ5rAtVKrri108YwP5tRoEI1MHUgyVc9ERMGt99sozXvJUVqVzKbOJCU2SIh4uA_hOEqCgbZNnTozkRaRAHiv5b86UVsrRG4dIivovN9EaTZIUmoHKWmyDpEtvIdwiNJpL0lWl6B5bh0ix5cZeVLqI1rnvQW2EcAXmX3ILJHB8YXLj9pZrzVlU0lwmY3IbJDB8RPC99i5MFI6ntrYVCcyB2RwPCq1G-_GqXG01vmluLAdjk_v91G7HEo1A-A0vxQXtseBf1qI2s3wRcEwXqU01v0CQ9kbEg&limit=25 reason: Unexpected token < in JSON at position 0\n' +
    '    at /sentinel/node_modules/node-fetch/lib/index.js:272:32\n' +
    '    at processTicksAndRejections (node:internal/process/task_queues:96:5)',
  name: 'FetchError'
} 

I made this work by manually editing the HAProxy config in my live haproxy container. First I edited the default_frontend.cfg and added this stanza:

http-errors json_503_errors
  errorfile 503 /usr/local/etc/haproxy/errors/503.json

And then at the end of the frontend http-in config in that same file, I added this line:

  errorfiles json_503_errors 

I created the JSON error file in /usr/local/etc/haproxy/errors/503.json with this contents:

HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html

{
  "code": 503,
  "error": "temporarily_unavailable",
  "error_description": "Sorry, the service is temporarily unavailable"
}

I then restarted the haproxy container. Now, when I pulled a node down, I get a much nicer error on the CLI for HAProxy:

curl -s http://medic:password@172.27.0.3:5984/medic  | jq

{                                                     
  "error": "nodedown",                                                                                       
  "reason": "progress not possible"
}  

And API doesn't die either. It does log an ERROR though:

2022-11-02 23:24:09 ERROR: Server error: StatusCodeError: 503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}
    at new StatusCodeError (/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at Request.self.callback (/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.<anonymous> (/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:513:28)
    at IncomingMessage.<anonymous> (/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:627:28)
    at IncomingMessage.emit (node:events:525:35) {
  name: 'StatusCodeError',
  statusCode: 503,
  message: '503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}',
  error: {
    code: 503,
    error: 'temporarily_unavailable',
    error_description: 'Sorry, the service is temporarily unavailable'
  },
  [stack]: 'StatusCodeError: 503 - {"code":503,"error":"temporarily_unavailable","error_description":"Sorry, the service is temporarily unavailable"}\n' +
    '    at new StatusCodeError (/api/node_modules/request-promise-core/lib/errors.js:32:15)\n' +
    '    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:104:33)\n' +
    '    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)\n' +
    '    at Request.self.callback (/api/node_modules/request/request.js:185:22)\n' +
    '    at Request.emit (node:events:513:28)\n' +
    '    at Request.<anonymous> (/api/node_modules/request/request.js:1154:10)\n' +
    '    at Request.emit (node:events:513:28)\n' +
    '    at IncomingMessage.<anonymous> (/api/node_modules/request/request.js:1076:12)\n' +
    '    at Object.onceWrapper (node:events:627:28)\n' +
    '    at IncomingMessage.emit (node:events:525:35)'
} 

The same for sentinel - it doesn't die, but does log an error:

2022-11-02 23:20:42 ERROR: Task backgroundCleanup completed with error: {
  code: 503,
  error: 'temporarily_unavailable',
  error_description: 'Sorry, the service is temporarily unavailable',
  status: 503,
  name: 'temporarily_unavailable',
  message: undefined,
  docId: '_local/background-seq'
} 

Noteworthy, the front end just shows a full screen spinner for an online admin users, with no indication there's an error. Maybe the old full page <center><h1>502 Bad Gateway</h1></center> output was better?

Starting up the couchdb node fixes everything, no need to restart API or Sentinel b/c they never died in the first place.

With recent PR/ticket merged over here - this is good to go - AT passes