coryodaniel/k8s

Surprising result when watching a resource

ebengt opened this issue · 24 comments

Greetings,

Erlang/OTP 24 [erts-12.3]
Interactive Elixir (1.12.3)
Documentation (K8s 1.1.0) make me think that for watch/3 current resource version will be looked up automatically.
What am I doing wrong?

More information: When I get the confingmap directly K8s.Client.run(connection, get_configmap) it has
<<"resourceVersion">> => <<"153760003">>,
Is there anther way to get the current resource version ?
Am I getting the wrong resource ?

{:ok, connection} = K8s.Conn.from_service_account()
get_configmap = K8s.Client.get("v1", :configmap, namespace: "x", name: "x-config")
{:ok, _} = K8s.Client.watch(connection, get_configmap, stream_to: Kernel.self())
result = loop_receive(:ignore)

Result is
{:ok, #{<<"apiVersion">> => <<"v1">>,<<"code">> => 410,
<<"kind">> => <<"Status">>,
<<"message">> =>
<<"too old resource version: 153760003 (153768835)">>,
<<"metadata">> => #{},<<"reason">> => <<"Expired">>,
<<"status">> => <<"Failure">>}}

  defp loop_receive(result) do
    receive do
      %HTTPoison.AsyncStatus{code: 200} -> loop_receive(result)
      %HTTPoison.AsyncStatus{code: c} -> {:error, c}
      %HTTPoison.AsyncHeaders{} -> loop_receive(result)
      %HTTPoison.AsyncChunk{chunk: c} -> loop_receive(Jason.decode(c))
      %HTTPoison.AsyncEnd{} -> result
    end
  end

Hi @ebengt

This should indeed be handled better by the library! However, I think this case boils down to a basic misunderstanding of the Kubernetes watch API. While working on #121, I had to read the api concepts up and down to figure out that you can't watch a single resource. Instead you can only watch a list of resources. E.g. if you change your get_configmap to the following, you watch changes on all configmaps in the default namespace:

get_configmap = K8s.Client.list("v1", :configmap, namespace: "default")

What this library does is it tries to turn a get to a list operation. In your case this seems to not work correctly. Maybe watch should only accept list operations instead of trying to sanitize... (@coryodaniel ?)

I am sorry. I do not know the API. I will try with a list instead.

Actually I think the lib does the right thing and it should work even for a single resource. (The lib turns the get to a list operation and uses a field selector to "filter" the resources by name. I'm not sure why it's not working in your case. But your issue led me to a bug in watch_and_stream (which is the same as watch but returns an Elixir Stream)

You can try the following with the latest commit on the develop branch:

{:ok, connection} = K8s.Conn.from_service_account()
get_configmap = K8s.Client.get("v1", :configmap, namespace: "x", name: "x-config")
{:ok, stream} = K8s.Client.watch_and_stream(connection, get_configmap) 
stream |> Stream.map(&IO.inspect/1) |> Stream.run()     
``

Thank you very much such a quick update.
Would it be possible, for the sake of education, if I asked a follow-up to my original question? New code:

    {:ok, connection} = K8s.Conn.from_service_account()
    list_configmaps = K8s.Client.list("v1", :configmap, namespace: "x")
    {:ok, _} = K8s.Client.watch(connection, list_configmaps, stream_to: Kernel.self())r
    loop_receive( {connection, list_configmaps} )

I get a steady printout from loop_receive of: {closed,timeout}
Even if I change a config map in namespace x there are only timeout.
Perhaps I am making another simple mistake?

  defp loop_receive(result) do
    receive do
      %HTTPoison.AsyncStatus{code: 200} ->
        loop_receive(result)

      %HTTPoison.AsyncStatus{code: c} ->
        {:error, c}

      %HTTPoison.AsyncHeaders{} ->
        loop_receive(result)

      %HTTPoison.AsyncChunk{chunk: c} ->
        loop_receive(Jason.decode(c))

      %HTTPoison.AsyncEnd{} ->
        result

      %HTTPoison.Error{reason: r} ->
         :io.fwrite('error ~p\n', [r])
         {connection, list_configmaps} = result
         {:ok, _} = K8s.Client.watch(connection, list_configmaps, stream_to: Kernel.self())
        loop_receive(result)

      other ->
        :io.fwrite('loop_next ~p\n', [other])
        loop_receive(result)
    end
  end

Watch requests do time out. It is expected and should be handled by your loop_receive method. It should resume the watch.

You can see an example in the implementation of watch_and_stream

Thank you. Please let me expand my question to:
What is wrong with the way my code tries to handle timeout? I repeat the call to watch/3.
watch_and_stream adds a resource version to call watch/4
When I do that I get too old resource version: 154703519 (154705821)"
It is the resource version of the config map that I add.

You want to keep track of the resource version in order to not miss events that occur between a timeout and a restart of the watch. Exmple:

  • start watching, resource version (rv) is 1
  • resource is updated => rv is 2
  • watch times out
  • resource is updated => rv is 3
  • watch resumes (gets current rv which is now 3. The update 2 => 3 was missed)

One more thing: The ConfigMap has a resource version. But what you're watching is not the ConfigMap but the ConfigMapList. So you need to use the resource version of the list as reference. You get that with every event and it is easiest to just keep track of it. Or you can get it using K8s.Client.list/N:

# Resource Version of ConfigMapList
list_cm_op = K8s.Client.list("v1", "ConfigMap", namespace: "x")
{:ok, cm_list} = K8s.Client.run(connection, list_cm_op, params: [fieldSelector: "metadata.name=x-config]")
IO.puts(cm_list["metadata"]["resourceVersion"])
# Resource Version of ConfigMap
get_cm_op = K8s.Client.get("v1", "ConfigMap", namespace: "x", name: "x-config")
{:ok, cm} = K8s.Client.run(connection, get_cm_op)
IO.puts(cm["metadata"]["resourceVersion"])

Closing this issue. Let me know if it's still unclear. Also if you don't mind me asking @ebengt: Why don't you just use the watch_and_stream functionality?

Hello, sorry to comment on a closed issue but I've been struggling for days to watch job resources reliably and I was hoping to get some light here :)

As I understand it K8s.Client.watch does not and is not meant to handle any kind of disconnection. So when getting an AsyncEnd event on a job watch, I close the hackney connection and initiate a new request. Then I sometimes get the error too old resource version. I don't understand why because I don't manipulate any version number myself. For now I'm just trying to initiate a new watch when I get disconnected. I suspected that the k8s library might be maintaining some internal state, but I get this error even if I restart the beam. The documentation states that Current resource version will be looked up automatically., which made me think that I probably shouldn't have to worry about it, but clearly I'm missing something.

I have read the Kubernetes API concepts as well as this library's documentation but I have to admit that things still aren't very clear in my mind. For example I still wonder what the 2 different numbers refer to in the error message, i.e in too old resource version: 154703519 (154705821), which one is the new version and which one is the version I'm requesting?

I understand that I should probably use watch_and_stream instead as this is meant to handle connection timeouts and 410 errors but I'm not sure how to use it. From the examples that were linked from #152, it seems that I have to run it in a separate process and somehow send the events back to the handling process, similarly to what watch does?

Greetings,

Yes, I changed to watch_and_stream. Apart from getting error logs when the process exit normally it works fine.
#153

@ebengt Thank you for your reply. As I understand it, that requires an extra process that manages the stream and sends messages back to the parent process as side effects?

Yes, that is correct. I also link to the extra process so that I know when it exits. That way I can restart it when needed.
You will see error logs when it exits, but it is easier to have the program do the restart.

@ebengt Thank you so much for sharing your experience! This is very valuable!

@mruoss Once I figure out how to use the k8s library properly, would you be open to a documentation patch bringing together bits of advice from issues?

I've spent more time reading the code and there's something that bothers me. From what I can see, and as @mruoss mentioned, get operations are converted to list operations, but the resource version is extracted from a get operation before that conversion happens. So if I'm not mistaking we provide to the list request a resource version which is in fact the resource version of an object. As I understand it, in kubernetes objects and collections have distinct resource versions but with the current code it seems we're mixing them up. If my interpretation is correct, that would explain why we get the version mismatch error reported by this issue.
In this branch I've tried converting the get operation to a list operation higher up in the call stack, before we get the resource version, but this is causing issues in our app so I'm probably missing something.

Okay I think I have an idea of what could be the issue at hand. Indeed, resource collections have different resource versions than resources. I might have been still confused about this when I wrote the watch_and_stream functionality. I had a very quick look at the code of the Stream resource functionality where I keep track of the collection's resource version (initially received from the list operation). It looks like I swap out the resource version of the collection with the one of the resource here. This might lead to the "too old resource version" error later on in the process.

The problem is, k8s watch only returns events for resources together with their resource version. It does not return the current resource version of the collection.

I'm currently very short on time so it might take me some time to look into this more closely. Not sure if watch bookmarks could help out here...

@mruoss Thank you so much for taking the time to reply despite being very busy! Now that you've confirmed that there might indeed be an issue, we'll work on it with more confidence on our end and hopefully come up with a PR.

I took a go at this in #159. Does somebody care to test?

Greetings,

This correction is for watch_and_stream/3, right?
I have no code that trigger an error there.
My problem was with watch/3

@mruoss Sorry it seems there was a bit of misunderstanding, similarly to @ebengt I had the issue with watch. Following advice given here we've now switched to using watch_and_stream and it doesn't seem to trigger the error so far. I'm letting it run over the weekend to see how it goes. Even if I don't get the error with the current release then I can test your changes at least to validate that it still works. We also still have a branch using watch that we can test on a dev server if you'd like to make changes to that function as well.

Oh I see. Now I understand.

#160 should address the problem with watch/3. Note that you can workaround this problem by getting the resource version of the resource collection yourself and pass it to watch/4.

I've tested our code based on watch/3 against the develop branch of this library and I can confirm that it solves the issue, no more "too old resource version"!

Our branch based on watch_and_stream has been in production with the latest release of this library (so without your latest changes) for a week and it's working well so far.

Thanks for testing. I have seen "too old resource version" on watch_and_stream on my end. Hence the patch.
Will make a release as soon as I get to it.