Dependency Monkey tests are taking too long to complete

Question

Dependency Monkey tests are taking too long to complete

Closed this issue 3 years ago · 8 comments

Describe the bug

Integration tests for the dependency_monkey feature are failing because Adviser analysis took more than 15min to complete:

@seizes_inspection_namespace
  Scenario Outline: Run Dependency Monkey -- @1.1 Dependency Monkey                                                                           # features/dependency_monkey.feature:13
    Given deployment is accessible using HTTPS                                                                                                # features/steps/common.py:26 0.839s
    When I schedule Dependency Monkey 1 times for simple_tensorflow example with dry run set to True with predictor AUTO and configuration {} # features/steps/dependency_monkey.py:33 0.773s
    Then wait for Dependency Monkey to finish successfully                                                                                    # features/steps/dependency_monkey.py:83 1283.094s
      Traceback (most recent call last):
        File "/home/mcostant/.local/lib/python3.8/site-packages/behave/model.py", line 1329, in run
          match.run(runner.context)
        File "/home/mcostant/.local/lib/python3.8/site-packages/behave/matchers.py", line 98, in run
          self.func(context, *args, **kwargs)
        File "features/steps/dependency_monkey.py", line 93, in wait_for_dependency_monkey_to_finish
          raise RuntimeError("Adviser analysis took too much time to finish")
      RuntimeError: Adviser analysis took too much time to finish

To Reproduce

Run dependency_monkey tests on stage and see the error.

Expected behavior

Adviser analysis is completed on time.

Answer 1 · 2022-02-22T13:33:11.000Z

/priority critical-urgent

Answer 2 · 2022-02-25T15:43:06.000Z

@fridex it looks like the tests are still failing with the increased timeout, I think we need to try with a higher timeout or investigate to find out is there is another cause to this failure.

Answer 3 · 2022-03-07T11:13:32.000Z

/kind bug
/assign

Answer 4 · 2022-03-07T18:54:16.000Z

The first problem here was that we scaled down Argo's workflow controller because we run data aggregation using selinon. I've scaled up Argo deployment (and scaled down selinon deployment) which caused that workflows are now scheduled. However there is another issue:

ImagePullBackOff: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/thoth-backend-stage/adviser:latest"

Answer 5 · 2022-03-07T20:01:39.000Z

It looks like inspections finish, but there is another deployment issue:

Unable to attach or mount volumes: unmounted volumes=[kafka-secrets], unattached volumes=[argo-artifact-repository-secrets input-artifacts kafka-secrets kube-api-access-vdvjm podmetadata]: timed out waiting for the condition

MountVolume.SetUp failed for volume "kafka-secrets" : references non-existent secret key: kafka_user.crt

Answer 6 · 2022-03-08T10:31:04.000Z

looks like you are working on it @mayaCostantini ?
/lifecycle active

Answer 7 · 2022-03-08T10:38:24.000Z

@goern I think @fridex and @harshad16 are working on this as this is a deployment issue

Answer 8 · 2022-03-08T19:48:37.000Z

PR #287 also fixes this issue:

thoth.adviser.exceptions.PipelineConfigurationError: Filed to initialize pipeline unit configuration for 'PlatformBoot' with configuration {'default_platform': 'linux-x86_64'}: extra keys not allowed @ data['default_platform']

dependency-monkey-220308141455-d2c59fcd85f210-3260034838-main.log