RuntimeError: provenance-checker took too much time to finish on stage

Question

RuntimeError: provenance-checker took too much time to finish on stage

mayaCostantini opened this issue 3 years ago · 15 comments

Describe the bug
Provenance-checker is taking too much time to finish for both scenarios tested (When thamos provenance-check is run for provenance_flask / provenance_flask_error asynchronously).

To Reproduce
See latest integration tests report for stage.

Expected behavior
Provenance-checks finish on time.

Answer 1 · 2022-04-06T06:43:55.000Z

Could the pod issues in the cluster (pods not spawned) be the reason behind this?

Answer 2 · 2022-04-06T06:47:44.000Z

Yes, that's probably the reason as this error seems to have appeared at the same time as the cluster issue.

Answer 3 · 2022-04-06T06:50:54.000Z

Let's keep this open for now and I will close it when the corresponding tests are green again 👍🏻 Thanks @fridex !
/kind bug

Answer 4 · 2022-04-13T10:04:15.000Z

The staging cluster runs data ingestion workloads like solver, a security indicator, and rev-solver runs. The whole middle-tier namespace is most of the time occupied by these workloads.
The provenance check test would also execute in this namespace, as the namespace is occupied the kafka message is kept on the pending state before getting consumed.
We should turn of provenance check test in stage cluster.
and proceed with its testing in prod cluster.

Answer 5 · 2022-04-13T10:16:00.000Z

Provenance checks are executed in the backend namespace, similarly to adviser runs.

Answer 6 · 2022-04-13T10:17:52.000Z

Oh my bad, i will investigate then what is hindering its execution.

Answer 7 · 2022-04-13T10:43:46.000Z

Investigating trigger provenance run via cli

Checking the whole pipeline:

Provenance triggered from integration-tests:
user-api receives the request successfully and pass the message to kafka
Kafka message is available in kafka and reading for consumption:
Investigator logs don't show anything regarding this, so check on the investigator metrics.

Metrics are not showing up , investigating the root cause.
No workflow yet triggered .for provenance:

oc get wf -n thoth-backend-stage
NAME                                                   STATUS      AGE
kebechet-administrator-220413103926-26b72118299451de   Succeeded   3m54s
kebechet-administrator-220413104044-d156acf22dd72a70   Running     2m36s
kebechet-administrator-220413104144-207555c22e7706c1   Running     96s
kebechet-administrator-220413104243-55498c868494c205   Running     37s
kebechet-administrator-220413104303-91ecc6ae9958dc29   Running     17s

Answer 8 · 2022-04-13T10:52:26.000Z

Better is back, after syncing the investigator deployment.
http://investigator-faust-web-thoth-infra-stage.apps.ocp4.prod.psi.redhat.com/metrics

all metrics looks good, the consumption of provenance message is not halted from looks of the metrics

Answer 9 · 2022-04-13T10:57:10.000Z

Investigator v0.16.0 is not able to consume provenance kafka messages and schedule workload.
http://investigator-faust-web-thoth-infra-stage.apps.ocp4.prod.psi.redhat.com/metrics
Tested with: https://stage.thoth-station.ninja/api/v1/provenance/python/provenance-checker-220413103246-afc8d63e2ac31a4f

Answer 10 · 2022-04-18T16:37:53.000Z

i have tested this in prod setup, it works without any issue:

Feature: Provenance checks of Python software stacks # features/thamos_provenance_check.feature:1

  @seizes_backend_namespace @provenance
  Scenario Outline: Run provenance check for a Python software stack -- @1.1 Provenance  # features/thamos_provenance_check.feature:14
    Given deployment is accessible using HTTPS                                           # features/steps/common.py:26 1.981s
    When thamos provenance-check is run for provenance_flask asynchronously              # features/steps/provenance_check.py:32 2.582s
    Then wait for provenance-checker to finish successfully                              # features/steps/provenance_check.py:51 8.684s
    Then I should be able to retrieve provenance-checker results                         # features/steps/provenance_check.py:84 1.177s
    Then I should be able to see successful provenance check results                     # features/steps/provenance_check.py:97 0.001s

  @seizes_backend_namespace @provenance
  Scenario Outline: Run provenance check for a Python software stack -- @1.2 Provenance  # features/thamos_provenance_check.feature:15
    Given deployment is accessible using HTTPS                                           # features/steps/common.py:26 1.565s
    When thamos provenance-check is run for provenance_flask_error asynchronously        # features/steps/provenance_check.py:32 2.505s
    Then wait for provenance-checker to finish successfully                              # features/steps/provenance_check.py:51 8.581s
    Then I should be able to retrieve provenance-checker results                         # features/steps/provenance_check.py:84 1.424s
    Then I should be able to see failed provenance check results

still investigating on stage.

Answer 11 · 2022-04-18T17:01:23.000Z

The Kafka topics in the stage were outdated, updated them by recreating it.
It has allowed the investigator to provision the provenance_check.

oc get wf | grep provenance-checker
provenance-checker-220418165642-aa82ea3ac80f787e

Answer 12 · 2022-04-18T17:08:31.000Z

This is fixed in stage.
Please take your time in verifying this.

The thoth-backend-stage namespace gets busy with the kebechet-administration runs.
please check the argo-workflows scheduled.
can use this command: oc get wf -n thoth-backend-stage

Answer 13 · 2022-05-04T07:16:54.000Z

I scheduled a provenance-check in the stage cluster and it completed successfully. The workflow is also present in the thoth-backend-stage namespace. Thanks a lot Harshad for the investigation 💯 👍

@fridex would you like to verify that everything is working again before closing the issue?

Answer 14 · 2022-05-09T11:10:13.000Z

I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! 👍🏻

/close

Answer 15 · 2022-05-09T11:10:16.000Z

@fridex: Closing this issue.

In response to this:

I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! 👍🏻

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.