thoth-station/integration-tests

RuntimeError: provenance-checker took too much time to finish on stage

mayaCostantini opened this issue ยท 15 comments

Describe the bug
Provenance-checker is taking too much time to finish for both scenarios tested (When thamos provenance-check is run for provenance_flask / provenance_flask_error asynchronously).

To Reproduce
See latest integration tests report for stage.

Expected behavior
Provenance-checks finish on time.

Could the pod issues in the cluster (pods not spawned) be the reason behind this?

Yes, that's probably the reason as this error seems to have appeared at the same time as the cluster issue.

Let's keep this open for now and I will close it when the corresponding tests are green again ๐Ÿ‘๐Ÿป Thanks @fridex !
/kind bug

The staging cluster runs data ingestion workloads like solver, a security indicator, and rev-solver runs. The whole middle-tier namespace is most of the time occupied by these workloads.
The provenance check test would also execute in this namespace, as the namespace is occupied the kafka message is kept on the pending state before getting consumed.
We should turn of provenance check test in stage cluster.
and proceed with its testing in prod cluster.

Provenance checks are executed in the backend namespace, similarly to adviser runs.

Oh my bad, i will investigate then what is hindering its execution.

Investigating trigger provenance run via cli

Checking the whole pipeline:

  1. Provenance triggered from integration-tests:
    provenance-triggered

  2. user-api receives the request successfully and pass the message to kafka
    post-provenance

  3. Kafka message is available in kafka and reading for consumption:
    kafka-provenance-message

  4. Investigator logs don't show anything regarding this, so check on the investigator metrics.
    metric-na
    Metrics are not showing up , investigating the root cause.

  5. No workflow yet triggered .for provenance:

oc get wf -n thoth-backend-stage
NAME                                                   STATUS      AGE
kebechet-administrator-220413103926-26b72118299451de   Succeeded   3m54s
kebechet-administrator-220413104044-d156acf22dd72a70   Running     2m36s
kebechet-administrator-220413104144-207555c22e7706c1   Running     96s
kebechet-administrator-220413104243-55498c868494c205   Running     37s
kebechet-administrator-220413104303-91ecc6ae9958dc29   Running     17s

Better is back, after syncing the investigator deployment.
http://investigator-faust-web-thoth-infra-stage.apps.ocp4.prod.psi.redhat.com/metrics

all metrics looks good, the consumption of provenance message is not halted from looks of the metrics

i have tested this in prod setup, it works without any issue:

Feature: Provenance checks of Python software stacks # features/thamos_provenance_check.feature:1

  @seizes_backend_namespace @provenance
  Scenario Outline: Run provenance check for a Python software stack -- @1.1 Provenance  # features/thamos_provenance_check.feature:14
    Given deployment is accessible using HTTPS                                           # features/steps/common.py:26 1.981s
    When thamos provenance-check is run for provenance_flask asynchronously              # features/steps/provenance_check.py:32 2.582s
    Then wait for provenance-checker to finish successfully                              # features/steps/provenance_check.py:51 8.684s
    Then I should be able to retrieve provenance-checker results                         # features/steps/provenance_check.py:84 1.177s
    Then I should be able to see successful provenance check results                     # features/steps/provenance_check.py:97 0.001s

  @seizes_backend_namespace @provenance
  Scenario Outline: Run provenance check for a Python software stack -- @1.2 Provenance  # features/thamos_provenance_check.feature:15
    Given deployment is accessible using HTTPS                                           # features/steps/common.py:26 1.565s
    When thamos provenance-check is run for provenance_flask_error asynchronously        # features/steps/provenance_check.py:32 2.505s
    Then wait for provenance-checker to finish successfully                              # features/steps/provenance_check.py:51 8.581s
    Then I should be able to retrieve provenance-checker results                         # features/steps/provenance_check.py:84 1.424s
    Then I should be able to see failed provenance check results  

still investigating on stage.

The Kafka topics in the stage were outdated, updated them by recreating it.
It has allowed the investigator to provision the provenance_check.

oc get wf | grep provenance-checker
provenance-checker-220418165642-aa82ea3ac80f787e                   

provenance

This is fixed in stage.
Please take your time in verifying this.

The thoth-backend-stage namespace gets busy with the kebechet-administration runs.
please check the argo-workflows scheduled.
can use this command: oc get wf -n thoth-backend-stage

I scheduled a provenance-check in the stage cluster and it completed successfully. The workflow is also present in the thoth-backend-stage namespace. Thanks a lot Harshad for the investigation ๐Ÿ’ฏ ๐Ÿ‘

@fridex would you like to verify that everything is working again before closing the issue?

I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! ๐Ÿ‘๐Ÿป

/close

@fridex: Closing this issue.

In response to this:

I can confirm the provenance-checks now work properly. Hence closing this issue. Fixed. Thanks! ๐Ÿ‘๐Ÿป

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.