thoth-station/integration-tests

Adviser tests are failing because of allocated CPU time exceeded

Opened this issue ยท 18 comments

Describe the bug
Tests for the thamos_advise feature are producing the following error in stage:

ERROR    thoth.adviser.run:155: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

To Reproduce
Steps to reproduce the behavior:
See last integration tests report for stage environment.

Expected behavior
Tests complete successfully.

/priority critical-urgent

/kind bug

To test the resolver is such cases, I try to create a lock file using Pipenv and submit an advise with the lock file as generated by Pipenv. In that case, resolver reports why it removes packages Pipenv resolved:

it might be a good idea to experiment with requirements (and possibly constraints as well) to narrow down to the issue one wants to debug. An example can be a failure when adviser was not able to find a resolution that would satisfy requirements. In such a case, it might be good to generate a lock file with expected pinned set of packages using other tools (e.g. Pipenv, pip-tools) and submit the lock file to the recommender system. The logs produced during the resolution and stack level justifications might give hints why the given resolution was rejected.

See docs.

/sig stack-guidance
/priority critical-urgent

Failing tests:

  • runtime environment ps-cv-pytorch , without user stack supplied and without static analysis

Failure:

2022-03-07 15:24:20,728  23 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733  23 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467  23 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-statio
n.ninja/j/install_error
2022-03-07 15:24:21,468  23 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-tensorflow , without user stack supplied and without static analysis

Failure:

2022-03-07 15:20:19,411  22 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:20:20,288  22 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:20:20,288  22 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis

Failure:

2022-03-07 15:22:28,279  22 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,284  22 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,979  22 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:22:28,979  22 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
  • runtime environment ps-nlp-tensorflow-gpu , without user stack supplied and without static analysis

Failure:

2022-03-07 15:24:20,728  23 INFO     thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733  23 INFO     thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467  23 WARNING  thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:24:21,468  23 INFO     thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack

Based on the lock file we use in repos, it looks like that thoth-solver was not able to solve jupyter-tensorboard==0.2.0' in the given runtime environment.

However, for some schenarios adviser was able to resolve application dependencies when triggered manually. I've created a new integration-tests job to confirm if these tests are still failing. Nevertheless, it would be great to check why thoth-solver did not solve jupyter-tensorboard in the given runtime environment.

thoth-solver fails to install jupyterlab-tensorboard with the following error:

Command exited with non-zero status code (1):     ERROR: Command errored out with exit status 1:
     command: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard
         cwd: /tmp/pip-install-0i94_y48/jupyter-tensorboard/
    Complete output (71 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib
    creating build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/application.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/tensorboard_manager.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/api_handlers.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/__init__.py -> build/lib/jupyter_tensorboard
    copying jupyter_tensorboard/handlers.py -> build/lib/jupyter_tensorboard
    creating build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/tensorboardlist.js -> build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/style.css -> build/lib/jupyter_tensorboard/static
    copying jupyter_tensorboard/static/tree.js -> build/lib/jupyter_tensorboard/static
    running build_scripts
    creating build/scripts-3.8
    copying scripts/jupyter-tensorboard -> build/scripts-3.8
    changing mode of build/scripts-3.8/jupyter-tensorboard from 644 to 755
    running install_lib
    creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/application.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/tensorboard_manager.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/api_handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/__init__.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    copying build/lib/jupyter_tensorboard/handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
    creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/tensorboardlist.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/style.css -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    copying build/lib/jupyter_tensorboard/static/tree.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/application.py to application.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/tensorboard_manager.py to tensorboard_manager.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/api_handlers.py to api_handlers.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/__init__.py to __init__.cpython-38.pyc
    byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/handlers.py to handlers.cpython-38.pyc
    running install_egg_info
    running egg_info
    writing jupyter_tensorboard.egg-info/PKG-INFO
    writing dependency_links to jupyter_tensorboard.egg-info/dependency_links.txt
    writing entry points to jupyter_tensorboard.egg-info/entry_points.txt
    writing requirements to jupyter_tensorboard.egg-info/requires.txt
    writing top-level names to jupyter_tensorboard.egg-info/top_level.txt
    reading manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
    writing manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
    Copying jupyter_tensorboard.egg-info to /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard-0.2.0-py3.8.egg-info
    running install_scripts
    copying build/scripts-3.8/jupyter-tensorboard -> /opt/app-root/src/solver-venv/bin
    changing mode of /opt/app-root/src/solver-venv/bin/jupyter-tensorboard to 755
    Installing jupyter-tensorboard script to /opt/app-root/src/solver-venv/bin
    writing list of installed files to '/tmp/pip-record-vgv1i21f/install-record.txt'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 52, in <module>
        setup(
      File "/opt/app-root/src/solver-venv/lib64/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 47, in run
        enable_extension_after_install()
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 30, in enable_extension_after_install
        from jupyter_tensorboard.application import (
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/__init__.py", line 3, in <module>
        from .handlers import load_jupyter_server_extension   # noqa
      File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/handlers.py", line 3, in <module>
        from tornado import web
    ModuleNotFoundError: No module named 'tornado'
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard Check the logs for full command output.

The issue here is that jupyter-tensorboard executes code after installation that expects tornado present in the environment. As we install jupyter-tensorboard without dependencies, the code behind executing the post-install procedure to register the extension fails.

In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.

Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.

/close

@fridex: Closing this issue.

In response to this:

In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.

Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@fridex: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Today's aws-prod tests show green.
Worth ensuring stage tests are also green

/assign @fridex
/lifecycle active

Scheduled integration-tests for stage, we should receive an email report after the integration tests finish.

Right now, integration tests in stage are not running (thoth-station/thoth-application#2599)
/remove-lifecycle active
until this is addressed

In yesterday's run of the integration tests in aws-prod, one of the adviser tests failed (ps-cv-pytorch):

... Then I ask for an advise for the cloned application for runtime environment ps-cv-pytorch , without user stack supplied and without static analysis (965.794s) 
...
2022-06-28 03:17:46,572 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Captured logging:
INFO:thamos.lib:Using 'latest' recommendation type - see https://thoth-station.ninja/recommendation-types/
WARNING:thamos.lib:The user stack found in the lock file will not be supplied as requested
INFO:thamos.lib:Successfully submitted advise analysis 'adviser-220628030145-f174942db191749e' to 'https://api.prod.thoth-station.ninja/api/v1'

Another anecdotal update: yesterday's aws-prod integration test runs have 2 tests failing due to allocated CPU time exceeded: ps-cv-pytorch and ps-cv-tensorflow

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale
/lifecycle frozen