Adviser tests are failing because of allocated CPU time exceeded
Opened this issue ยท 18 comments
Describe the bug
Tests for the thamos_advise
feature are producing the following error in stage:
ERROR thoth.adviser.run:155: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded
To Reproduce
Steps to reproduce the behavior:
See last integration tests report for stage environment.
Expected behavior
Tests complete successfully.
/priority critical-urgent
/kind bug
To test the resolver is such cases, I try to create a lock file using Pipenv and submit an advise with the lock file as generated by Pipenv. In that case, resolver reports why it removes packages Pipenv resolved:
it might be a good idea to experiment with requirements (and possibly constraints as well) to narrow down to the issue one wants to debug. An example can be a failure when adviser was not able to find a resolution that would satisfy requirements. In such a case, it might be good to generate a lock file with expected pinned set of packages using other tools (e.g. Pipenv, pip-tools) and submit the lock file to the recommender system. The logs produced during the resolution and stack level justifications might give hints why the given resolution was rejected.
See docs.
/sig stack-guidance
/priority critical-urgent
Failing tests:
- runtime environment ps-cv-pytorch , without user stack supplied and without static analysis
Failure:
2022-03-07 15:24:20,728 23 INFO thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733 23 INFO thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467 23 WARNING thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-statio
n.ninja/j/install_error
2022-03-07 15:24:21,468 23 INFO thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
- runtime environment ps-nlp-tensorflow , without user stack supplied and without static analysis
Failure:
2022-03-07 15:20:19,411 22 INFO thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:20:20,288 22 WARNING thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:20:20,288 22 INFO thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
- runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis
Failure:
2022-03-07 15:22:28,279 22 INFO thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,284 22 INFO thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:22:28,979 22 WARNING thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:22:28,979 22 INFO thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
- runtime environment ps-nlp-tensorflow-gpu , without user stack supplied and without static analysis
Failure:
2022-03-07 15:24:20,728 23 INFO thoth.adviser.resolver:1175: Scoring user's stack - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:20,733 23 INFO thoth.adviser.resolver:612: Scoring user's stack based on the lock file submitted - see https://thoth-station.ninja/j/user_stack
2022-03-07 15:24:21,467 23 WARNING thoth.adviser.sieves.solved:127: Removing package ('jupyter-tensorboard', '0.2.0', 'https://pypi.org/simple') due to installation time error in the software environment - see https://thoth-station.ninja/j/install_error
2022-03-07 15:24:21,468 23 INFO thoth.adviser.resolver:624: User's stack was removed based on sieves - see https://thoth-station.ninja/j/rm_user_stack
Based on the lock file we use in repos, it looks like that thoth-solver was not able to solve jupyter-tensorboard==0.2.0' in the given runtime environment.
However, for some schenarios adviser was able to resolve application dependencies when triggered manually. I've created a new integration-tests job to confirm if these tests are still failing. Nevertheless, it would be great to check why thoth-solver did not solve jupyter-tensorboard in the given runtime environment.
thoth-solver fails to install jupyterlab-tensorboard with the following error:
Command exited with non-zero status code (1): ERROR: Command errored out with exit status 1:
command: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard
cwd: /tmp/pip-install-0i94_y48/jupyter-tensorboard/
Complete output (71 lines):
running install
running build
running build_py
creating build
creating build/lib
creating build/lib/jupyter_tensorboard
copying jupyter_tensorboard/application.py -> build/lib/jupyter_tensorboard
copying jupyter_tensorboard/tensorboard_manager.py -> build/lib/jupyter_tensorboard
copying jupyter_tensorboard/api_handlers.py -> build/lib/jupyter_tensorboard
copying jupyter_tensorboard/__init__.py -> build/lib/jupyter_tensorboard
copying jupyter_tensorboard/handlers.py -> build/lib/jupyter_tensorboard
creating build/lib/jupyter_tensorboard/static
copying jupyter_tensorboard/static/tensorboardlist.js -> build/lib/jupyter_tensorboard/static
copying jupyter_tensorboard/static/style.css -> build/lib/jupyter_tensorboard/static
copying jupyter_tensorboard/static/tree.js -> build/lib/jupyter_tensorboard/static
running build_scripts
creating build/scripts-3.8
copying scripts/jupyter-tensorboard -> build/scripts-3.8
changing mode of build/scripts-3.8/jupyter-tensorboard from 644 to 755
running install_lib
creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
copying build/lib/jupyter_tensorboard/application.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
copying build/lib/jupyter_tensorboard/tensorboard_manager.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
copying build/lib/jupyter_tensorboard/api_handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
copying build/lib/jupyter_tensorboard/__init__.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
copying build/lib/jupyter_tensorboard/handlers.py -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard
creating /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
copying build/lib/jupyter_tensorboard/static/tensorboardlist.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
copying build/lib/jupyter_tensorboard/static/style.css -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
copying build/lib/jupyter_tensorboard/static/tree.js -> /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/static
byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/application.py to application.cpython-38.pyc
byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/tensorboard_manager.py to tensorboard_manager.cpython-38.pyc
byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/api_handlers.py to api_handlers.cpython-38.pyc
byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/__init__.py to __init__.cpython-38.pyc
byte-compiling /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard/handlers.py to handlers.cpython-38.pyc
running install_egg_info
running egg_info
writing jupyter_tensorboard.egg-info/PKG-INFO
writing dependency_links to jupyter_tensorboard.egg-info/dependency_links.txt
writing entry points to jupyter_tensorboard.egg-info/entry_points.txt
writing requirements to jupyter_tensorboard.egg-info/requires.txt
writing top-level names to jupyter_tensorboard.egg-info/top_level.txt
reading manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
writing manifest file 'jupyter_tensorboard.egg-info/SOURCES.txt'
Copying jupyter_tensorboard.egg-info to /opt/app-root/src/solver-venv/lib/python3.8/site-packages/jupyter_tensorboard-0.2.0-py3.8.egg-info
running install_scripts
copying build/scripts-3.8/jupyter-tensorboard -> /opt/app-root/src/solver-venv/bin
changing mode of /opt/app-root/src/solver-venv/bin/jupyter-tensorboard to 755
Installing jupyter-tensorboard script to /opt/app-root/src/solver-venv/bin
writing list of installed files to '/tmp/pip-record-vgv1i21f/install-record.txt'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 52, in <module>
setup(
File "/opt/app-root/src/solver-venv/lib64/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 47, in run
enable_extension_after_install()
File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py", line 30, in enable_extension_after_install
from jupyter_tensorboard.application import (
File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/__init__.py", line 3, in <module>
from .handlers import load_jupyter_server_extension # noqa
File "/tmp/pip-install-0i94_y48/jupyter-tensorboard/jupyter_tensorboard/handlers.py", line 3, in <module>
from tornado import web
ModuleNotFoundError: No module named 'tornado'
----------------------------------------
ERROR: Command errored out with exit status 1: /opt/app-root/src/solver-venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0i94_y48/jupyter-tensorboard/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-vgv1i21f/install-record.txt --single-version-externally-managed --compile --install-headers /opt/app-root/src/solver-venv/include/site/python3.8/jupyter-tensorboard Check the logs for full command output.
The issue here is that jupyter-tensorboard executes code after installation that expects tornado
present in the environment. As we install jupyter-tensorboard without dependencies, the code behind executing the post-install procedure to register the extension fails.
In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.
Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.
/close
@fridex: Closing this issue.
In response to this:
In the recent report, the adviser was able to find a resolution to this issue - that is using an older version of jupyter-tensorboard that does not perform any post-install procedure.
Closing this as integration tests are green. Nevertheless, we should report this upstream and see what their opinion is on this one.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@fridex: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Today's aws-prod tests show green.
Worth ensuring stage tests are also green
Scheduled integration-tests for stage, we should receive an email report after the integration tests finish.
Right now, integration tests in stage are not running (thoth-station/thoth-application#2599)
/remove-lifecycle active
until this is addressed
In yesterday's run of the integration tests in aws-prod, one of the adviser tests failed (ps-cv-pytorch
):
... Then I ask for an advise for the cloned application for runtime environment ps-cv-pytorch , without user stack supplied and without static analysis (965.794s)
...
2022-06-28 03:17:46,572 thoth.adviser.run ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded
Captured logging:
INFO:thamos.lib:Using 'latest' recommendation type - see https://thoth-station.ninja/recommendation-types/
WARNING:thamos.lib:The user stack found in the lock file will not be supplied as requested
INFO:thamos.lib:Successfully submitted advise analysis 'adviser-220628030145-f174942db191749e' to 'https://api.prod.thoth-station.ninja/api/v1'
Another anecdotal update: yesterday's aws-prod integration test runs have 2 tests failing due to allocated CPU time exceeded: ps-cv-pytorch
and ps-cv-tensorflow
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen