SeldonIO/seldon-core

Seldon timeout even with annotation set to high value

samuel-sujith opened this issue · 7 comments

Describe the bug

I have set the below values in my seldon deployment annotations
spec:
annotations:
seldon.io/rest-timeout: "100000"
seldon.io/grpc-timeout: "100000"

Seldon deployment generates istio virtual services for the model. how can i add timeout details to this virtual service or to this istio setting. If i try to manually add timeouts to the virtual services created by Seldon, it overwrites them back to default.

Expected behaviour

I have a seldon model API which typically gives response after 70-80 seconds. I need the response to come back without the below timeout error
samuel@samuel-dev-vm001:~/repos/dbg$ python inference-llm.py
Elapsed time: 60.62758827209473
<Response [504]>

Environment

k8s1.23 with Seldon 1.16.0

      value: docker.io/seldonio/seldon-core-executor:1.16.0-dev
    image: docker.io/seldonio/seldon-core-operator:1.16.0-dev

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:21:19Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.13", GitCommit:"49433308be5b958856b6949df02b716e0a7cf0a3", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.24) exceeds the supported minor version skew of +/-1
samuel@samuel-dev-vm001:~$

Model Details

No issue with model, only timeout while response is coming back from model API

agrski commented

Hi @samuel-sujith,

According to these docs, the timeout field in Istio HTTP routes is disabled by default. See also this example for further confirmation.

The SDep controller in Core v1 does not seem to set a timeout for these routes; the function is defined here for reference. It does set an idle timeout of 60 seconds, but according to the docs, this only applies when there are no active connections, which should not be the case here.

Similarly, there is a default timeout of 1 second per retry, but if that were the thing causing issues, then it'd be much shorter than the 60 seconds you're seeing.

Are you able to determine more precisely where this timeout is coming from, as it doesn't seem likely to be Istio, unless you have some additional config active?

I have tried the below also
spec:
annotations:
seldon.io/rest-timeout: "40000"
seldon.io/grpc-timeout: "40000"

and the timeout comes after 40 seconds

samuel@samuel-dev-vm001:~/repos/dbg$ python inference-llm.py
Elapsed time: 40.64111614227295
<Response [500]>

Hence I am pretty sure its the seldon timeout settings which is the problem, setting them to 100000 makes timeout at 60 seconds.

For more clarity on my situation, I have encapsulated an LLM model inside an MLserver code.

Below is the MLserver code
class MyCustomRuntime(MLModel):
"""
Model template. You can load your model parameters in init from a location accessible at runtime
"""

async def load(self) -> bool:
    """
    Add any initialization parameters. These will be passed at runtime from the graph definition parameters defined in your seldondeployment kubernetes resource manifest.
    """
    print("Initializing.................")
    self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True,  token=access_token)
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, token=access_token)
    return True

@decode_args
async def predict(self, chat: np.ndarray) -> np.ndarray:
    print(chat[0])
    X = self.tokenizer(chat[0], return_tensors="pt").to("cuda:0")
    y = self.model.generate(**X)
    y = self.tokenizer.decode(
        y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print("answer ", y)
    return np.asarray([y])

I have done the same with the 7B model and it returns fine within 24 seconds without any issue, Since the 70B is a bigger model and needs more than 70 seconds to respond, I get this timeout.

agrski commented

There could be multiple timeouts interacting, such that one being large enough means another comes into force.

I can't find anything that looks relevant in the Core v1 operator or executor - mostly no timeouts are given, i.e. components will wait indefinitely.

What happens if you run that MLServer model directly, outside Core? If that succeeds, could you try running inference against the SDep without going via Istio?

Without being able to pin this down, it won't be possible to resolve.

Another thing to note here is that when i try to set the timeout variable in my python requests module, it sets correctly for 1 sec, 5 secs and 10 secs. But when i set to 100, the 504 comes after 60 secs

Now, i built own seldon image by inserting timeout in the virtual service of Istio to 100s. And now the model times out after 100 seconds.

Next I am trying onl doing MLserver expose without Seldon or Istio. I will post on this channel with the results of that too.

Running mlserver directly also gives the same timeout at 60 seconds
samuel@samuel-dev-vm001:~/repos/dbg$ python inference-llm.py
Elapsed time: 60.90878462791443
<Response [504]>

Is there any way to increase this setting in MLserver

I think I got the culprit
@click.option(
"--timeout",
default=60,
envvar="MLSERVER_INFER_CONNECTION_TIMEOUT",
help="Connection timeout to be passed to tritonclient.",
)

This piece of code in the MLServer sets default to 60

MLserver/mlserver/cli line number 181

I will try setting this env var to 200 and see, mostly this should resolve.

mlserver start does not have a --timeout option, so I am back to zero.