argoproj-labs/old-argo-dataflow

Step retrying even after updating to not retry

Closed this issue · 5 comments

Using 0.1.0

I deployed a new pipeline with default retry strategy and encountered some errors inside my container which started resulting in Dataflow retrying for those particular messages.
I updated my spec to below added snippet and expected that the retry would stop but the retries didn't stop for that message
Is this expected behaviour?

Step spec:

spec:
  steps:
  - container:
      image: abc
    name: enrich-query
    scale:
      desiredReplicas: limit(currentReplicas + pendingDelta / (60 * 250) + pending
        / (10 * 60 * 250), 1, 3, 1)
      peekDelay: |-
        "20m"
      scalingDelay: |-
        "1m"
    sources:
    - kafka:
        name: TOPIC
        topic: TOPIC
        brokers:
          - 'kafka-headless.kafka.svc.cluster.local:9092'
        startOffset: Last
      retry:
        steps: 0

Referenced: https://raw.githubusercontent.com/argoproj-labs/argo-dataflow/main/examples/301-erroring-pipeline.yaml

Even after creating a new pipeline with the above spec, retry still happens

Logs from init inside Step:

W
time="2021-10-21T12:40:30Z" level=info msg=cpu numCPU=8
time="2021-10-21T12:40:30Z" level=info msg=process pid=1
time="2021-10-21T12:40:30Z" level=info msg=version version=v0.1.0
time="2021-10-21T12:40:30Z" level=info msg="not enabling pprof debug endpoints"
time="2021-10-21T12:40:30Z" level=info msg="copying binary" name=/var/run/argo-dataflow/kill
time="2021-10-21T12:40:30Z" level=info msg="copying binary" name=/var/run/argo-dataflow/prestop
time="2021-10-21T12:40:30Z" level=info msg="creating authorization file"
time="2021-10-21T12:40:30Z" level=info msg="creating out fifo"
time="2021-10-21T12:40:30Z" level=info msg=cloning checkout=/var/run/argo-dataflow/checkout url="git@github.com:atlanhq/marketplace-scripts"
time="2021-10-21T12:40:30Z" level=info msg="getting secret for auth" SSHPrivateKeySecret="{\"name\":\"git-ssh\",\"key\":\"private-key\"}"
Enumerating objects: 117, done.
Counting objects: 100% (117/117), done.
Compressing objects: 100% (100/100), done.
Total 117 (delta 6), reused 68 (delta 0), pack-reused 0
time="2021-10-21T12:40:34Z" level=info msg="moving checked out code" path=/var/run/argo-dataflow/checkout/marketplace_scripts/query_enrichment wd=/var/run/argo-dataflow/wd

Logs from Sidecar:

time="2021-10-21T12:58:10Z" level=info msg=retry backoff="{\"Duration\":819200000000,\"Factor\":2,\"Jitter\":0.1,\"Steps\":7,\"Cap\":0}" source=default
time="2021-10-21T12:58:10Z" level=info msg="failed to send process message" backoffSteps=7 err="HTTP request failed: \"500 Internal Server Error\" \"Got an unexpected exception:
 KeyError, 'queryMetadata'\"" giveUp=false source=default

So giveUp=true is logged when we give up retrying. It seems that your step have a retry with 7 steps?

Stale issue message