[QUESTION] Why are our Windows distribution tests failing?
Swiddis opened this issue · 4 comments
I think this is meant to be a bug report (either in our plugin or in this build repo) but I'm not 100% sure what the bug is or where it is. I'd like help to root cause so I can convert this to a proper bug report.
Problem
In dashboards-observability we've been getting an autocut issue for a failing distribution since February. The pipeline in question seems to fluctuate a lot between passing/unstable/failing. I've been diving through the logs to figure out what the issue is and am coming up blank: usually there's an integ-test
section that says which tests are failing but this hasn't been present in any of the recent runs I've checked. It says "observabilityDashboards" is in the failing plugins list, but I can't locate the error.
Possibly-Related Info
One hint I can find is that there are bootstrap issues in some of the logs, one suspicion I have is that we may need to tweak the pipeline to run with --single-version=loose
or --single-version=ignore
:
ERROR [single_version_dependencies] Multiple version ranges for the same dependency
were found declared across different package.json files. Please consolidate
those to match across all package.json files. Different versions for the
same dependency is not supported.
If you have questions about this please reach out to the operations team.
The conflicting dependencies are:
cypress
9.5.4 => opensearch-dashboards
^13.6.0 => observability-dashboards
I also have asked some other coworkers about the logs and got directed to this dashboards issue about failing Windows tests, which also seems related, citing that the logs mention permissions issues for deleting test files (or perhaps something about running shell on Windows):
Traceback (most recent call last):
File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\run_build.py", line 113, in <module>
sys.exit(main())
File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\run_build.py", line 93, in main
builder.build(build_recorder)
File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\build_workflow\builder_from_source.py", line 56, in build
self.git_repo.execute(build_command)
File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\git\git_repository.py", line 85, in execute
subprocess.check_call(command, cwd=cwd, shell=True)
File "C:\Users\ContainerAdministrator\scoop\apps\python39\3.9.13\lib\subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'bash C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\scripts\components\OpenSearch-Dashboards\build.sh -v 3.0.0 -p windows -a x64 -d zip -s false -o builds' returned non-zero exit status 1.
script returned exit code 1
Question
Why is the distribution failing? Is it a problem with our plugin, or is it an issue in the build pipeline? Is pipeline.log
even the right place to look to debug this?
Context
I'm working on understanding and resolving these issues for our goal to fix all the flaky distribution tests by 2.14: opensearch-project/dashboards-observability#1670.
@peterzhuamazon can you take a look and help out here? I think all/most plugins are getting this autocut on main
Tagging @rishabh6788 to help here.
[Triage]
As the 2.14.0 release is moving forward, I assume this issue is fixed, @rishabh6788 can you please let us know?
Thanks
One hint I can find is that there are bootstrap issues in some of the logs, one suspicion I have is that we may need to tweak the pipeline to run with --single-version=loose or --single-version=ignore
We should be using --single-version=loose
wherever OSD is being bootstrapped. The ignore
option should only be used for debugging and never in production or test builds.
... got directed to opensearch-project/OpenSearch-Dashboards#5688 about failing Windows tests
Cleaning all empty folders recursively
and other folder deleting failures are caused by a race to delete a folder and its parent by different parallel processes. I am working on rewriting this functionality in OSD. If this is a widespread pain, I can prioritize it.