Problem with get_resources_for_api_version
Canta opened this issue · 9 comments
Hi there.
I'm working with kubernetes (not openshift) and ansible.
Ansible uses the "k8s" module in order to deal with kubernetes, and in turn that module uses this client.
Doing that, I'm getting an error from the openshift client, and I believe it may be a bug in the client itself.
Here's what I'm doing:
Ansible code:
- name: Creating secrets
k8s:
definition: "{{ lookup('template', 'secret.yaml.j2') | from_yaml }}"
state: present
with_items: "{{ kubernetes_secrets }}"
when:
- item.stuff is defined
- item.stuff | bool is true
tags:
- kube_secrets
- kubernetes
The template is just any valid secret yaml. The ones in the kubernetes docs (https://kubernetes.io/docs/concepts/configuration/secret/) should be fine.
When I run that, I get this error (edited for pretty print):
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 49: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 24951
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
Traceback (most recent call last):
File \"/tmp/ansible_n56vlM/ansible_module_k8s.py\", line 165, in <module>
main()
File \"/tmp/ansible_n56vlM/ansible_module_k8s.py\", line 161, in main
KubernetesRawModule().execute_module()
File \"/tmp/ansible_n56vlM/ansible_modlib.zip/ansible/module_utils/k8s/raw.py\", line 70, in execute_module
File \"/tmp/ansible_n56vlM/ansible_modlib.zip/ansible/module_utils/k8s/common.py\", line 161, in get_api_client
File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 108, in __init__
self.__init_cache()
File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 121, in __init_cache
self.__resources.update(self.parse_api_groups())
File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 178, in parse_api_groups
new_group[version] = self.get_resources_for_api_version(prefix, group['name'], version, preferred)
File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 194, in get_resources_for_api_version
resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
TypeError: 'NoneType' object is not iterable
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 1
I've looked at the code where the error points out, and found the problem here:
def get_resources_for_api_version(self, prefix, group, version, preferred):
""" returns a dictionary of resources associated with provided groupVersion"""
resources = {}
subresources = {}
path = '/'.join(filter(None, [prefix, group, version]))
resources_response = load_json(self.request('GET', path))['resources']
resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
""" code continues, but the error happens there """
Debugging it, I've found the problem is that resources_response
is None
after load_json
, and checking self.request('GET', path)
I saw that .resources
is indeed null
. This leads to the NoneType...
error later.
So, I did this little change, and after that the code worked as expected:
def get_resources_for_api_version(self, prefix, group, version, preferred):
""" returns a dictionary of resources associated with provided groupVersion"""
resources = {}
subresources = {}
path = '/'.join(filter(None, [prefix, group, version]))
resources_response = load_json(self.request('GET', path))['resources'] or []
""" note the 'or' clause at the end of the line """
It's a simple data validation/filter for default value.
Now, I'm certainly not working with the latest client version, as I've installed it from CentOS repos:
[user@server]$ yum whatprovides openshift
(...)
1:python2-openshift-0.8.6-1.el7.noarch : Python client for the OpenShift API
Repo : epel
Matched from:
Filename : /usr/lib/python2.7/site-packages/openshift
However, if I check the current version, I see the problem still there.
IDK if "null" is a valid output for that .resources
property or there's some other problem in my kubernetes cluster, but I guess the fix is harmless anyway. With that in mind, I guess this is then a bug, and that's why I fill this issue. Please let me know otherwise.
Thanks.
Well I don't think it should be getting null for resources
, but clearly it can and we should definitely be handling that properly. Thanks for the detailed bug report and suggestion for fixing! Did you want to submit a PR with your fix or would you prefer that I pull it in?
I've sent a PR using Github's UI. My first time doing it, out of curiosity, so I kinda don't know what I'm doing. Feel free to cancel it if I'm doing it wrong, and patch the code yourself.
Thanks!
I still don't understand why we've never hit this before. I'd like to understand that before deciding on a fix.
Can you distill this down to a more minimal test case that we could actually run? (i.e. without the when
or with_items
)
My guess would be that either 1. an aggregated API server is messed up and returning this strange response, or 2. if a CRD is created, and then deleted maybe the group can hang around? Regardless an aggregated API server could certainly do it. If the issue is #2 it's hopefully easily testable, but if this only occurs when an aggregated API server is messed up a test case may be hard to create.
Sorry for the delay. I've been struggling to reproduce the problem since last friday, until a few minutes ago when I figured out the cache mechanism: when I added my fix the code ran fine, but that left behind a cache file, so then it ran also fine without the fix, as if the problem weren't there anymore.
I still don't understand why we've never hit this before. I'd like to understand that before deciding on a fix.
Can you distill this down to a more minimal test case that we could actually run? (i.e. without the
when
orwith_items
)
The problem is not in the data I'm passing to the endpoint, but in the cache construction mechanism of the openshift client when it deals with my cluster. I've put a debug line in the client file, like this:
def get_resources_for_api_version(self, prefix, group, version, preferred):
""" returns a dictionary of resources associated with provided groupVersion"""
resources = {}
subresources = {}
path = '/'.join(filter(None, [prefix, group, version]))
resources_response = load_json(self.request('GET', path))['resources']
print("resources_response: ", resources_response, "path", path, "data:", self.request('GET', path).data)
""" That's my debug line """
resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
""" Code goes on """
And I could see that way that get_resources_for_api_version
is called several times for different API paths: api/v1
, apis/apps/v1alpha1
, etc.
There, an API path has a null resourses
property, and I've found which one exactly:
# curl -s "http://localhost:8080/apis/batch/v2alpha1" | python -m json.tool
{
"apiVersion": "v1",
"groupVersion": "batch/v2alpha1",
"kind": "APIResourceList",
"resources": null
}
I'm testing against an old Kubernetes version (v1.3.10
), so I bet that's also part of the problem. Perhaps that's why you never had to deal with this: maybe you've always tested against newer/different versions, which behave differently. But I don't understand what's that "null" value doing there, of even if it's normal.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.