ros2/ci

Transition Windows/Mac CI farms to cloud virtualization

Closed this issue · 9 comments

This issue tracks the progress of transitioning the Windows CI/Mac CI machines to cloud virtualization. The goal of the project is to create a more reproducible, deployable and scalable Jenkins CI farm. Because Microsoft has embraced containerization in the last couple of years, it is feasible to run Windows CI instances inside docker containers running on VMs from a cloud provider. On the mac side, MacStadium offers potential for virtualization with mac instances.

Window cloud virtualization tasks:

  • Demonstrate a CI build by running run_ros2_batch.py from a docker container
  • Deploy a Jenkin's build job to an EC2 instance
    https://citest.ros2.org/job/ci_windows/87/
  • Merge windows_docker_resources PR (#361) (1/31/2019)
  • Cloud Jenkins agents running side-by-side with current CI farm (1/31/2019)
  • Windows configuration management of EC2 VMs with Chef (1/31/2019)
  • Retire bare metal CI Windows servers (3/1/2019)

Mac Cloud Virtualization tasks (TBD)

  • Investigate MacStadium's cloud virtualization offering
  • Demostrate CI build by running run_ros2_batch.py from virtual mac instance
  • Deploy a Jenkin's build job to a MacStadium instance
  • Create and Merge Mac CI PR
  • Cloud Jenkins agents running side-by-side with current CI farm
  • Mac configuration management of MacStadium VMs with Orka
  • Retire bare metal CI Mac servers

I've run into an issue attempting to build in a directory mounted with docker run -v. When building, cmake runs a compiler check on a simple program in debug mode and with the /Zi build flag. This creates a pdb file with debug symbols. Any references to the debug symbols are requested through mspdbsrv.exe. I suspect this has to do with file handle access across the mounted directory, between the containerized OS and the host OS. Building in a directory that's not in a mounted location, but copying build/install results back to the mounted directory seems to work.

Example of output failure.

cl /c /Zi /W3 /WX- /diagnostics:classic /Od /Ob0 /D WIN32 /D _WINDOWS /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Z
c:forScope /Zc:inline /Fo"cmTC_c31aa.dir\Debug\\" /Fd"cmTC_c31aa.dir\Debug\vc141.pdb" /Gd /TC /errorReport:queue C:\TEMP\workdir\ws\build\poco_vendor\CMakeFiles\CMakeTmp
\testCCompiler.c
[19.406s]
[19.406s]       testCCompiler.c
[19.406s]     LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006E7)' [C:\TEMP\workdir\ws\build\poco_vendor\CMakeFiles\CMakeTmp\cmTC_c31aa.vcxproj]

To run a containerized Windows OS on Windows, the containerized OS has compatibility requirements with the host OS. See https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility?tabs=windows-server-1909%2Cwindows-10-1909

This means that without Hyper-V enabled, the containerized OS must match the host OS because they both use the same kernel. With Hyper-V, it generally means the Release Id of the OS of the container must the same or older than the host OS. I've added logic in my PR for the Release ID to be passed into the docker image when building on a jenkins job.

However, this also means that if someone downloads the image from the cloud instance to run on their own machine, they will have to run a matching Release Id of the cloud instance, or one compatible through Hyper-V.

To find the Release Id on a windows machine, run:

powershell $(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion').ReleaseId

Currently, it is not possible to install RTI Connext through the command line on Windows. The installer provides a headless mode and a text-based mode. Unfortunately, the headless mode is not available for the evaluation installer, and the text-base mode is not available on Windows.

https://community.rti.com/static/documentation/connext-dds/5.2.0/doc/manuals/connext_dds/html_files/RTI_ConnextDDS_CoreLibraries_GettingStarted/Content/GettingStarted/Installing_ConnextDDS.htm

@brawner I assigned you in order to avoid the issue appearing again in our triagging process.
Feel free to unassing yourself if you think that the assignment is wrong.

Thanks to @cottsay (brawner#1), PR #361 now has Connext functionality. He added a git submodule pointing to a private OSRF repo to download a professional Connext installer. The professional installer can be installed headless in a docker container, which was not possible with the evaluation installer.

ci_windows and ci_windows-container build differences

ci_windows-container catches build errors with more up-to-date installs

Turtlesim incompatible with Qt 5.12.7. (Addressed by moving to 5.14.1 in #383)

  • ci_windows 9203 Build Status ci_windows-container 88 Build Status
  • ci_windows 9157 Build Status ci_windows-container 39 Build Status
  • ci_windows 9133 Build Status ci_windows-container 14 Build Status

Caught MSBuild warnings in newer versions of Visual Studio. This was already fixed in ros2/rclcpp#963

  • ci_windows 9172 Build Status ci_windows-container 50 Build Status
  • ci_windows 9156 Build Status ci_windows-container 38 Build Status

cppcheck 1.90 issues: Addressed by:

  • ci_windows 9400 Build Status ci_windows-container 238 Build Status
  • ci_windows 9397 Build Status ci_windows-container 235 Build Status
  • ci_windows 9395 Build Status ci_windows-container 234 Build Status
  • ci_windows 9394 Build Status ci_windows-container 233 Build Status
  • ci_windows 9391 Build Status ci_windows-container 231 Build Status
  • ci_windows 9390 Build Status ci_windows-container 230 Build Status
  • ci_windows 9366 Build Status ci_windows-container 211 Build Status
  • ci_windows 9331 Build Status ci_windows-container 188 Build Status
  • ci_windows 9330 Build Status ci_windows-container 187 Build Status
  • ci_windows 9316 Build Status ci_windows-container 177 Build Status
  • ci_windows 9300 Build Status ci_windows-container 160 Build Status
  • ci_windows 9305 Build Status ci_windows-container 169 Build Status
  • ci_windows 9269 Build Status ci_windows-container 135 Build Status
  • ci_windows 9255 Build Status ci_windows-container 127 Build Status

Open issues for ci_windows-container

Timing related flakiness

  • ci_windows 9397 Build Status ci_windows-container 235 Build Status
  • ci_windows 9394 Build Status ci_windows-container 233 Build Status
  • ci_windows 9391 Build Status ci_windows-container 231 Build Status
  • ci_windows 9390 Build Status ci_windows-container 230 Build Status
  • ci_windows 9346 Build Status ci_windows-container 200 Build Status
  • ci_windows 9337 Build Status ci_windows-container 194 Build Status
  • ci_windows 9331 Build Status ci_windows-container 188 Build Status
  • ci_windows 9328 Build Status ci_windows-container 185 Build Status
  • ci_windows 9316 Build Status ci_windows-container 177 Build Status
  • ci_windows 9261 Build Status ci_windows-container 130 Build Status
  • ci_windows 9255 Build Status ci_windows-container 127 Build Status
  • ci_windows 9272 Build Status ci_windows-container 138 Build Status
  • ci_windows 9275 Build Status ci_windows-container 142 Build Status
  • ci_windows 9277 Build Status ci_windows-container 144 Build Status
  • ci_windows 9280 Build Status ci_windows-container 147 Build Status
  • ci_windows 9283 Build Status ci_windows-container 150 Build Status
  • ci_windows 9282 Build Status ci_windows-container 149 Build Status
  • ci_windows 9205 Build Status ci_windows-container 88 Build Status (This one failed for a java related reason)
  • ci_windows 9184 Build Status ci_windows-container 58 Build Status (This one failed for a java related reason)
  • ci_windows 9174 Build Status ci_windows-container 51 Build Status
  • ci_windows 9176 Build Status ci_windows-container 52 Build Status
  • ci_windows 9166 Build Status ci_windows-container 45 Build Status
  • ci_windows 9165 Build Status ci_windows-container 44 Build Status
  • ci_windows 9150 Build Status ci_windows-container 32 Build Status
  • ci_windows 9146 Build Status ci_windows-container 26 Build Status
  • ci_windows 9132 Build Status ci_windows-container 13 Build Status

A lot of failed tests, hard to tell which ones were different

  • ci_windows 9130 Build Status ci_windows-container 12 Build Status

Issues with ci_windows

ci_windows failed tests that ci_windows-container succeeded at

  • ci_windows 9123 Build Status ci_windows-container 9 Build Status

Test failure caught by ci_windows-container, but not ci_windows. Isolating tests fixed this issue:

  • ci_windows 9185 Build Status ci_windows-container 67 Build Status

Flake8 on ci_windows didn't catch failure (ci_windows-container matches non-windows builds)

  • ci_windows 9268 Build Status ci_windows-container 133 Build Status

ci_windows-container matches other build types (ci_windows is the odd one out)

  • ci_windows 9272 Build Status ci_windows-container 138 Build Status

Java agent heap size

4GB Heap space (Node was just restarted)

  • ci_windows 9389 Build Status ci_windows-container 229 Build Status
  • ci_windows 9383 Build Status ci_windows-container 225 Build Status
  • ci_windows 9380 Build Status ci_windows-container 222 Build Status
  • ci_windows 9376 Build Status ci_windows-container 218 Build Status

Default Heap size (1GB)

  • ci_windows 9192 Build Status ci_windows-container 75 Build Status
  • ci_windows 9190 Build Status ci_windows-container 73 Build Status
  • ci_windows 9181 Build Status ci_windows-container 55 Build Status

ci_windows build failures

  • ci_windows 9360 Build Status ci_windows-container 208 Build Status
  • ci_windows 9359 Build Status ci_windows-container 207 Build Status
  • ci_windows 9335 Build Status ci_windows-container 192 Build Status
  • ci_windows 9334 Build Status ci_windows-container 191 Build Status
  • ci_windows 9333 Build Status ci_windows-container 190 Build Status
  • ci_windows 9332 Build Status ci_windows-container 189 Build Status
  • ci_windows 9319 Build Status ci_windows-container 179 Build Status
  • ci_windows 9318 Build Status ci_windows-container 178 Build Status
  • ci_windows 9294 Build Status ci_windows-container 157 Build Status
  • ci_windows 9288 Build Status ci_windows-container 156 Build Status
  • ci_windows 9286 Build Status ci_windows-container 153 Build Status
  • ci_windows 9285 Build Status ci_windows-container 152 Build Status
  • ci_windows 9284 Build Status ci_windows-container 151 Build Status
  • ci_windows 9283 Build Status ci_windows-container 150 Build Status
  • ci_windows 9282 Build Status ci_windows-container 149 Build Status
  • ci_windows 9277 Build Status ci_windows-container 144 Build Status
  • ci_windows 9275 Build Status ci_windows-container 142 Build Status
  • ci_windows 9261 Build Status ci_windows-container 130 Build Status
  • ci_windows 9255 Build Status ci_windows-container 127 Build Status
  • ci_windows 9249 Build Status ci_windows-container 123 Build Status
  • ci_windows 9248 Build Status ci_windows-container 122 Build Status
  • ci_windows 9247 Build Status ci_windows-container 121 Build Status
  • ci_windows 9198 Build Status ci_windows-container 84 Build Status
  • ci_windows 9112 Build Status ci_windows-container 4 Build Status

ci_windows-container build failures

Qt Installer stalled (I killed the docker container)

  • ci_windows 9219 Build Status ci_windows-container 103 Build Status

Java agent heap size was too small

4GB Heap space (Node was just restarted)

  • ci_windows 9389 Build Status ci_windows-container 229 Build Status
  • ci_windows 9383 Build Status ci_windows-container 225 Build Status
  • ci_windows 9380 Build Status ci_windows-container 222 Build Status
  • ci_windows 9376 Build Status ci_windows-container 218 Build Status

Default heap 1Gb

  • ci_windows 9216 Build Status ci_windows-container 100 Build Status
  • ci_windows 9190 Build Status ci_windows-container 73 Build Status
  • ci_windows 9181 Build Status ci_windows-container 55 Build Status

Qt 5.12.7 incompatibility with turtlesim

  • ci_windows 9203 Build Status ci_windows-container 88 Build Status
  • ci_windows 9157 Build Status ci_windows-container 39 Build Status
  • ci_windows 9133 Build Status ci_windows-container 14 Build Status

hudson.remoting.RequestAbortedException

  • ci_windows 9192 Build Status ci_windows-container 75 Build Status
  • ci_windows 9184 Build Status ci_windows-container 58 Build Status

@nuclearsandwich Can this be closed in favor of new MacOS investigations?