RuntimeError: provenance-checker took too much time to finish on stage
mayaCostantini opened this issue ยท 15 comments
Describe the bug
Provenance-checker is taking too much time to finish for both scenarios tested (When thamos provenance-check is run for provenance_flask / provenance_flask_error asynchronously
).
To Reproduce
See latest integration tests report for stage.
Expected behavior
Provenance-checks finish on time.
Could the pod issues in the cluster (pods not spawned) be the reason behind this?
Yes, that's probably the reason as this error seems to have appeared at the same time as the cluster issue.
Let's keep this open for now and I will close it when the corresponding tests are green again ๐๐ป Thanks @fridex !
/kind bug
The staging cluster runs data ingestion workloads like solver, a security indicator, and rev-solver runs. The whole middle-tier namespace is most of the time occupied by these workloads.
The provenance check test would also execute in this namespace, as the namespace is occupied the kafka message is kept on the pending state before getting consumed.
We should turn of provenance check test in stage cluster.
and proceed with its testing in prod cluster.
Provenance checks are executed in the backend namespace, similarly to adviser runs.
Oh my bad, i will investigate then what is hindering its execution.
Investigating trigger provenance run via cli
Checking the whole pipeline:
-
user-api receives the request successfully and pass the message to kafka
-
Kafka message is available in kafka and reading for consumption:
-
Investigator logs don't show anything regarding this, so check on the investigator metrics.
Metrics are not showing up , investigating the root cause. -
No workflow yet triggered .for provenance:
oc get wf -n thoth-backend-stage
NAME STATUS AGE
kebechet-administrator-220413103926-26b72118299451de Succeeded 3m54s
kebechet-administrator-220413104044-d156acf22dd72a70 Running 2m36s
kebechet-administrator-220413104144-207555c22e7706c1 Running 96s
kebechet-administrator-220413104243-55498c868494c205 Running 37s
kebechet-administrator-220413104303-91ecc6ae9958dc29 Running 17s
Better is back, after syncing the investigator deployment.
http://investigator-faust-web-thoth-infra-stage.apps.ocp4.prod.psi.redhat.com/metrics
all metrics looks good, the consumption of provenance message is not halted from looks of the metrics
Investigator v0.16.0 is not able to consume provenance kafka messages and schedule workload.
http://investigator-faust-web-thoth-infra-stage.apps.ocp4.prod.psi.redhat.com/metrics
Tested with: https://stage.thoth-station.ninja/api/v1/provenance/python/provenance-checker-220413103246-afc8d63e2ac31a4f
i have tested this in prod setup, it works without any issue:
Feature: Provenance checks of Python software stacks # features/thamos_provenance_check.feature:1
@seizes_backend_namespace @provenance
Scenario Outline: Run provenance check for a Python software stack -- @1.1 Provenance # features/thamos_provenance_check.feature:14
Given deployment is accessible using HTTPS # features/steps/common.py:26 1.981s
When thamos provenance-check is run for provenance_flask asynchronously # features/steps/provenance_check.py:32 2.582s
Then wait for provenance-checker to finish successfully # features/steps/provenance_check.py:51 8.684s
Then I should be able to retrieve provenance-checker results # features/steps/provenance_check.py:84 1.177s
Then I should be able to see successful provenance check results # features/steps/provenance_check.py:97 0.001s
@seizes_backend_namespace @provenance
Scenario Outline: Run provenance check for a Python software stack -- @1.2 Provenance # features/thamos_provenance_check.feature:15
Given deployment is accessible using HTTPS # features/steps/common.py:26 1.565s
When thamos provenance-check is run for provenance_flask_error asynchronously # features/steps/provenance_check.py:32 2.505s
Then wait for provenance-checker to finish successfully # features/steps/provenance_check.py:51 8.581s
Then I should be able to retrieve provenance-checker results # features/steps/provenance_check.py:84 1.424s
Then I should be able to see failed provenance check results
still investigating on stage.
The Kafka topics in the stage were outdated, updated them by recreating it.
It has allowed the investigator to provision the provenance_check.
oc get wf | grep provenance-checker
provenance-checker-220418165642-aa82ea3ac80f787e
I scheduled a provenance-check in the stage cluster and it completed successfully. The workflow is also present in the thoth-backend-stage
namespace. Thanks a lot Harshad for the investigation ๐ฏ ๐
@fridex would you like to verify that everything is working again before closing the issue?
I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! ๐๐ป
/close
@fridex: Closing this issue.
In response to this:
I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! ๐๐ป
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.