opensearch-project/opensearch-build

[QUESTION] Why are our Windows distribution tests failing?

Swiddis opened this issue · 4 comments

I think this is meant to be a bug report (either in our plugin or in this build repo) but I'm not 100% sure what the bug is or where it is. I'd like help to root cause so I can convert this to a proper bug report.

Problem

In dashboards-observability we've been getting an autocut issue for a failing distribution since February. The pipeline in question seems to fluctuate a lot between passing/unstable/failing. I've been diving through the logs to figure out what the issue is and am coming up blank: usually there's an integ-test section that says which tests are failing but this hasn't been present in any of the recent runs I've checked. It says "observabilityDashboards" is in the failing plugins list, but I can't locate the error.

Possibly-Related Info

One hint I can find is that there are bootstrap issues in some of the logs, one suspicion I have is that we may need to tweak the pipeline to run with --single-version=loose or --single-version=ignore:

ERROR [single_version_dependencies] Multiple version ranges for the same dependency
      were found declared across different package.json files. Please consolidate
      those to match across all package.json files. Different versions for the
      same dependency is not supported.

      If you have questions about this please reach out to the operations team.

      The conflicting dependencies are:

        cypress
          9.5.4 => opensearch-dashboards
          ^13.6.0 => observability-dashboards

I also have asked some other coworkers about the logs and got directed to this dashboards issue about failing Windows tests, which also seems related, citing that the logs mention permissions issues for deleting test files (or perhaps something about running shell on Windows):

Traceback (most recent call last):
  File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\run_build.py", line 113, in <module>
    sys.exit(main())
  File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\run_build.py", line 93, in main
    builder.build(build_recorder)
  File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\build_workflow\builder_from_source.py", line 56, in build
    self.git_repo.execute(build_command)
  File "C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\src\git\git_repository.py", line 85, in execute
    subprocess.check_call(command, cwd=cwd, shell=True)
  File "C:\Users\ContainerAdministrator\scoop\apps\python39\3.9.13\lib\subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command 'bash C:\Users\Administrator\jenkins\workspace\distribution-build-opensearch-dashboards\scripts\components\OpenSearch-Dashboards\build.sh -v 3.0.0 -p windows -a x64 -d zip -s false -o builds' returned non-zero exit status 1.

script returned exit code 1

Question

Why is the distribution failing? Is it a problem with our plugin, or is it an issue in the build pipeline? Is pipeline.log even the right place to look to debug this?

Context

I'm working on understanding and resolving these issues for our goal to fix all the flaky distribution tests by 2.14: opensearch-project/dashboards-observability#1670.

@peterzhuamazon can you take a look and help out here? I think all/most plugins are getting this autocut on main

Tagging @rishabh6788 to help here.

[Triage]
As the 2.14.0 release is moving forward, I assume this issue is fixed, @rishabh6788 can you please let us know?
Thanks

One hint I can find is that there are bootstrap issues in some of the logs, one suspicion I have is that we may need to tweak the pipeline to run with --single-version=loose or --single-version=ignore

We should be using --single-version=loose wherever OSD is being bootstrapped. The ignore option should only be used for debugging and never in production or test builds.

... got directed to opensearch-project/OpenSearch-Dashboards#5688 about failing Windows tests

Cleaning all empty folders recursively and other folder deleting failures are caused by a race to delete a folder and its parent by different parallel processes. I am working on rewriting this functionality in OSD. If this is a widespread pain, I can prioritize it.