[Bug] Operation was canceled when start_workflow

Question

[Bug] Operation was canceled when start_workflow

Opened this issue 23 days ago · 7 comments

duy-nguyen-ts commented 23 days ago

What are you really trying to do?

Hi team, I am having an issue when trying to start_workflow and signal_workflow

Describe the bug

It happens when I called method start_workflow. Maybe it cant connect to create workflow on temporal and return temporal_sdk_bridge.RPCError: (1, 'operation was canceled', b'')
I started 10 workflows but received 6 success and 4 error cancelled
I want to know why it happens, does it due to network or anything else ? How can I fix that ? E.x: Add retry policy when start_workflow,...

Environment/Versions

OS and processor: Mac M2
Temporal Version: ^1.6.0
Are you using Docker or Kubernetes or building Temporal from source: Using Docker

Additional context

Answer 1 · 2024-09-10T03:57:25.000Z

I had check my logs again, this error also happens when I call signal to workflow.

Answer 2 · 2024-09-10T04:27:33.000Z

After tracing this issue, I saw it happened at this line, maybe error when it made a rpc call to temporal

Answer 3 · 2024-09-10T12:11:19.000Z

Can you replicate this reliably? If so, can you alter a sample to show how to replicate? And is it against Temporal cloud or self-hosted server? We are releasing a fix in the next couple of days for a similar error at temporalio/sdk-core#807, but we believe that only affected 1.7.0.

Answer 4 · 2024-09-11T02:43:03.000Z

Hi @cretz , thanks for your reply, I am using Temporal as self-hosted server and I can't always replicate it, sometime it happened and not. I investigated and assumed that it caused at point in above image. Currently, I added retry when call start_workflow and this error still happen but less than before. About my code, it just sample like this:

Create a client with connect
temporal_client = await Client.connect(target_host=...,namespace=...)
Call start_workflow (maybe many calls at the same time)
handler = await temporal_client.start_workflow(workflow, args=[arg], id="workflow_id", task_queue="task_queue")

Answer 5 · 2024-09-11T02:45:11.000Z

I am using version 1.6.0 so maybe it not similar to temporalio/sdk-core#807

Answer 6 · 2024-09-11T13:10:05.000Z

I am using Temporal as self-hosted server and I can't always replicate it, sometime it happened and not

Even if it takes a minute to replicate, any replication would help us debug.

I am afraid there's not much to go on here. We have many samples/users starting hundreds/thousands of workflows without any issues on self-hosted servers. Can you make sure you're not doing something like accidentally blocking the thread in an async def call thereby causing asyncio to stop working properly?

Answer 7 · 2024-09-13T02:49:53.000Z

Okk @cretz , thank you for your response. I will continue monitor it 😄