openshift/openshift-restclient-python

Problem with get_resources_for_api_version

Canta opened this issue · 9 comments

Canta commented

Hi there.

I'm working with kubernetes (not openshift) and ansible.
Ansible uses the "k8s" module in order to deal with kubernetes, and in turn that module uses this client.
Doing that, I'm getting an error from the openshift client, and I believe it may be a bug in the client itself.

Here's what I'm doing:
Ansible code:

- name: Creating secrets
  k8s:
    definition: "{{ lookup('template', 'secret.yaml.j2') | from_yaml }}"
    state: present
  with_items: "{{ kubernetes_secrets }}"
  when:
    - item.stuff is defined
    - item.stuff | bool is true
  tags:
  - kube_secrets
  - kubernetes

The template is just any valid secret yaml. The ones in the kubernetes docs (https://kubernetes.io/docs/concepts/configuration/secret/) should be fine.

When I run that, I get this error (edited for pretty print):

OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 49: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 24951
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
Traceback (most recent call last):
  File \"/tmp/ansible_n56vlM/ansible_module_k8s.py\", line 165, in <module>
    main()
  File \"/tmp/ansible_n56vlM/ansible_module_k8s.py\", line 161, in main
    KubernetesRawModule().execute_module()
  File \"/tmp/ansible_n56vlM/ansible_modlib.zip/ansible/module_utils/k8s/raw.py\", line 70, in execute_module
  File \"/tmp/ansible_n56vlM/ansible_modlib.zip/ansible/module_utils/k8s/common.py\", line 161, in get_api_client
  File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 108, in __init__
    self.__init_cache()
  File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 121, in __init_cache
    self.__resources.update(self.parse_api_groups())
  File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 178, in parse_api_groups
    new_group[version] = self.get_resources_for_api_version(prefix, group['name'], version, preferred)
  File \"/usr/lib/python2.7/site-packages/openshift/dynamic/client.py\", line 194, in get_resources_for_api_version
    resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
TypeError: 'NoneType' object is not iterable
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 1

I've looked at the code where the error points out, and found the problem here:

    def get_resources_for_api_version(self, prefix, group, version, preferred):
        """ returns a dictionary of resources associated with provided groupVersion"""

        resources = {}
        subresources = {}

        path = '/'.join(filter(None, [prefix, group, version]))
        resources_response = load_json(self.request('GET', path))['resources']

        resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
        """ code continues, but the error happens there """

Debugging it, I've found the problem is that resources_response is None after load_json, and checking self.request('GET', path) I saw that .resources is indeed null. This leads to the NoneType... error later.

So, I did this little change, and after that the code worked as expected:

    def get_resources_for_api_version(self, prefix, group, version, preferred):
        """ returns a dictionary of resources associated with provided groupVersion"""

        resources = {}
        subresources = {}

        path = '/'.join(filter(None, [prefix, group, version]))
        resources_response = load_json(self.request('GET', path))['resources'] or []
        """ note the 'or' clause at the end of the line  """

It's a simple data validation/filter for default value.

Now, I'm certainly not working with the latest client version, as I've installed it from CentOS repos:

[user@server]$ yum whatprovides openshift  
  (...)
  1:python2-openshift-0.8.6-1.el7.noarch : Python client for the OpenShift API
  Repo        : epel
  Matched from:
  Filename    : /usr/lib/python2.7/site-packages/openshift

However, if I check the current version, I see the problem still there.

resources_response = load_json(self.client.request('GET', path))['resources']

IDK if "null" is a valid output for that .resources property or there's some other problem in my kubernetes cluster, but I guess the fix is harmless anyway. With that in mind, I guess this is then a bug, and that's why I fill this issue. Please let me know otherwise.

Thanks.

Well I don't think it should be getting null for resources, but clearly it can and we should definitely be handling that properly. Thanks for the detailed bug report and suggestion for fixing! Did you want to submit a PR with your fix or would you prefer that I pull it in?

Canta commented

I've sent a PR using Github's UI. My first time doing it, out of curiosity, so I kinda don't know what I'm doing. Feel free to cancel it if I'm doing it wrong, and patch the code yourself.

Thanks!

I still don't understand why we've never hit this before. I'd like to understand that before deciding on a fix.

Can you distill this down to a more minimal test case that we could actually run? (i.e. without the when or with_items)

My guess would be that either 1. an aggregated API server is messed up and returning this strange response, or 2. if a CRD is created, and then deleted maybe the group can hang around? Regardless an aggregated API server could certainly do it. If the issue is #2 it's hopefully easily testable, but if this only occurs when an aggregated API server is messed up a test case may be hard to create.

Canta commented

Sorry for the delay. I've been struggling to reproduce the problem since last friday, until a few minutes ago when I figured out the cache mechanism: when I added my fix the code ran fine, but that left behind a cache file, so then it ran also fine without the fix, as if the problem weren't there anymore.

I still don't understand why we've never hit this before. I'd like to understand that before deciding on a fix.

Can you distill this down to a more minimal test case that we could actually run? (i.e. without the when or with_items)

The problem is not in the data I'm passing to the endpoint, but in the cache construction mechanism of the openshift client when it deals with my cluster. I've put a debug line in the client file, like this:

    def get_resources_for_api_version(self, prefix, group, version, preferred):
        """ returns a dictionary of resources associated with provided groupVersion"""

        resources = {}
        subresources = {}

        path = '/'.join(filter(None, [prefix, group, version]))
        resources_response = load_json(self.request('GET', path))['resources']

        print("resources_response: ", resources_response, "path", path, "data:", self.request('GET', path).data)
        """ That's my debug line """

        resources_raw = list(filter(lambda resource: '/' not in resource['name'], resources_response))
        """ Code goes on """

And I could see that way that get_resources_for_api_version is called several times for different API paths: api/v1, apis/apps/v1alpha1, etc.

There, an API path has a null resourses property, and I've found which one exactly:

# curl -s "http://localhost:8080/apis/batch/v2alpha1" | python -m json.tool
{
    "apiVersion": "v1",
    "groupVersion": "batch/v2alpha1",
    "kind": "APIResourceList",
    "resources": null
}

I'm testing against an old Kubernetes version (v1.3.10), so I bet that's also part of the problem. Perhaps that's why you never had to deal with this: maybe you've always tested against newer/different versions, which behave differently. But I don't understand what's that "null" value doing there, of even if it's normal.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.