temporalio/sdk-python

[Bug] Operation was canceled when start_workflow

Opened this issue ยท 7 comments

What are you really trying to do?

  • Hi team, I am having an issue when trying to start_workflow and signal_workflow

Describe the bug

  • It happens when I called method start_workflow. Maybe it cant connect to create workflow on temporal and return temporal_sdk_bridge.RPCError: (1, 'operation was canceled', b'')
  • I started 10 workflows but received 6 success and 4 error cancelled
  • I want to know why it happens, does it due to network or anything else ? How can I fix that ? E.x: Add retry policy when start_workflow,...

Environment/Versions

  • OS and processor: Mac M2
  • Temporal Version: ^1.6.0
  • Are you using Docker or Kubernetes or building Temporal from source: Using Docker

Additional context

I had check my logs again, this error also happens when I call signal to workflow.

After tracing this issue, I saw it happened at this line, maybe error when it made a rpc call to temporal
Screenshot 2024-09-10 at 11 25 47

cretz commented

Can you replicate this reliably? If so, can you alter a sample to show how to replicate? And is it against Temporal cloud or self-hosted server? We are releasing a fix in the next couple of days for a similar error at temporalio/sdk-core#807, but we believe that only affected 1.7.0.

Hi @cretz , thanks for your reply, I am using Temporal as self-hosted server and I can't always replicate it, sometime it happened and not. I investigated and assumed that it caused at point in above image. Currently, I added retry when call start_workflow and this error still happen but less than before. About my code, it just sample like this:

  • Create a client with connect
    temporal_client = await Client.connect(target_host=...,namespace=...)
  • Call start_workflow (maybe many calls at the same time)
    handler = await temporal_client.start_workflow(workflow, args=[arg], id="workflow_id", task_queue="task_queue")

I am using version 1.6.0 so maybe it not similar to temporalio/sdk-core#807

cretz commented

I am using Temporal as self-hosted server and I can't always replicate it, sometime it happened and not

Even if it takes a minute to replicate, any replication would help us debug.

I am afraid there's not much to go on here. We have many samples/users starting hundreds/thousands of workflows without any issues on self-hosted servers. Can you make sure you're not doing something like accidentally blocking the thread in an async def call thereby causing asyncio to stop working properly?

Okk @cretz , thank you for your response. I will continue monitor it ๐Ÿ˜„