anchore/anchore-engine

anchore-engine won't sync nvdv2 feed on k8s environment with ipvs kube-proxy mode

jasiam opened this issue · 0 comments

Is this a request for help?:
No

Is this a BUG REPORT or a FEATURE REQUEST?:
BUG REPORT

Since --ipvs-tcp-timeout flag for kube-proxy (IPVS mode) is 900 seconds by default, and the default value for sysctl net.ipv4.tcp_keepalive_time is 7200 in most of the kernels, first time sync of nvdv2 feed will raise a (psycopg2.OperationalError) server closed the connection unexpectedly ERROR on anchore policy-engine container. This is caused by the size of nvdv2 feed which takes longer than 900 seconds when downloading it, and the DDBB session having no traffic during that download interval.

There's no swiss knife solution for this issue by a adding any modification in chart templates (we could set sysctl net.ipv4.tcp_keepalive_time to a value < 900 by securityContext but kubelet has to allow it explicitly, it's forbidden by default) so I think this could be considered a bug (only reproducible in k8s environments with ipvs kube-proxy mode) and some code could be added to the sync process in anchore-engine to avoid this kind of error.

It should be reproducible on any anchore-engine version (including brand new 1.0.0) using a database deployed in the same k8s cluster

What happened:
nvdv2 feed sync failed right after downloading all the pages

What did you expect to happen:
nvdv2 feed syncs successfully

Any relevant log output from /var/log/anchore:
[service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.download/_fetch_group_data()] [INFO] Completed data download of for feed group: nvdv2/nvdv2:cves. Total pages: 172 [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.utils/timer()] [INFO] Execution of data download for group nvdv2/nvdv2:cves took: 1499.7355272769928 seconds [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.download/execute()] [INFO] Feed data download process ending [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.sync/sync()] [INFO] Download complete. Syncing to db (feed=nvdv2, group=nvdv2:cves, operation_id=0316a89b3c6348c9bcb31ee9a4d74c38) [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.sync/sync()] [ERROR] Error syncing nvdv2/nvdv2:cves (operation_id=0316a89b3c6348c9bcb31ee9a4d74c38) [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.sync/notify_event()] [INFO] Event: {"type": "system.feeds.sync.group_failed", "level": "error", "message": "Feed group sync failed", "details": {"cause": "(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n\n[SQL: SELECT feeds.name AS feeds_name, feeds.description AS feeds_description, feeds.access_tier AS feeds_access_tier, feeds.last_full_sync AS feeds_last_full_sync, feeds.last_update AS feeds_last_update, feeds.created_at AS feeds_created_at, feeds.enabled AS feeds_enabled \nFROM feeds \nWHERE feeds.name = %(name_1)s]\n[parameters: {'name_1': 'nvdv2'}]\n(Background on this error at: http://sqlalche.me/e/e3q8)"}, "timestamp": "2021-10-04T13:45:24.336789", "resource": {"user_id": "admin", "type": "feed_group", "id": "nvdv2/nvdv2:cves"}, "source": {"request_id": null, "servicename": "policy_engine", "hostid": "keos-anchore-anchore-engine-policy-5564cf78f6-lrwv4", "base_url": "http://keos-anchore-anchore-engine-policy:8087"}} (operation_id=0316a89b3c6348c9bcb31ee9a4d74c38) [service:policy-engine] 2021-10-04 13:45:24+0000 [-] [Thread-71] [anchore_engine.services.policy_engine.engine.feeds.sync/notify_event()] [INFO] Event: {"type": "system.feeds.sync.feed_failed", "level": "error", "message": "Feed sync failed", "details": {"cause": "One or more groups failed to sync"}, "timestamp": "2021-10-04T13:45:24.618451", "resource": {"user_id": "admin", "type": "feed", "id": "nvdv2"}, "source": {"request_id": null, "servicename": "policy_engine", "hostid": "keos-anchore-anchore-engine-policy-5564cf78f6-lrwv4", "base_url": "http://keos-anchore-anchore-engine-policy:8087"}} (operation_id=0316a89b3c6348c9bcb31ee9a4d74c38)

What docker images are you using:

docker.io/anchore/anchore-engine:v1.0.0

How to reproduce the issue:

In a k8s environment with ipvs kube-proxy mode, just deploy anchore-engine using its helm chart, you should see nvdv2 feed never syncs and the above trace will appear in anchore policy-engine logs

Anything else we need to know:

There are 2 possible workarounds until anchore-engine is able to recover from this timeout during sync process
1- You can allow net.ipv4.tcp_keepalive* sysctls in yout kubelet configuration by editing your kubelet-config.yaml file and adding:
allowedUnsafeSysctls:
- "net.ipv4.tcp_keepalive_time"
- "net.ipv4.tcp_keepalive_intvl"
- "net.ipv4.tcp_keepalive_probes"

Restart your kubelet and then set the following sysctls by securityContext in anchore policy-engine pod to these values:
securityContext:
sysctls:
- name: net.ipv4.tcp_keepalive_time
value: "600"
- name: net.ipv4.tcp_keepalive_intvl
value: "60"
- name: net.ipv4.tcp_keepalive_probes
value: "3"

With this configuration I've checked the sync works ok.

2- You can avoid using the k8s service for the DDBB, and points directly to the anchore-db Pod ip from policy-engine. This makes traffic is not affected by ipvs timeout.