Transition Windows/Mac CI farms to cloud virtualization
Closed this issue · 9 comments
This issue tracks the progress of transitioning the Windows CI/Mac CI machines to cloud virtualization. The goal of the project is to create a more reproducible, deployable and scalable Jenkins CI farm. Because Microsoft has embraced containerization in the last couple of years, it is feasible to run Windows CI instances inside docker containers running on VMs from a cloud provider. On the mac side, MacStadium offers potential for virtualization with mac instances.
Window cloud virtualization tasks:
- Demonstrate a CI build by running run_ros2_batch.py from a docker container
- Deploy a Jenkin's build job to an EC2 instance
https://citest.ros2.org/job/ci_windows/87/ - Merge windows_docker_resources PR (#361) (1/31/2019)
- Cloud Jenkins agents running side-by-side with current CI farm (1/31/2019)
- Windows configuration management of EC2 VMs with Chef (1/31/2019)
- Retire bare metal CI Windows servers (3/1/2019)
Mac Cloud Virtualization tasks (TBD)
- Investigate MacStadium's cloud virtualization offering
- Demostrate CI build by running run_ros2_batch.py from virtual mac instance
- Deploy a Jenkin's build job to a MacStadium instance
- Create and Merge Mac CI PR
- Cloud Jenkins agents running side-by-side with current CI farm
- Mac configuration management of MacStadium VMs with Orka
- Retire bare metal CI Mac servers
I've run into an issue attempting to build in a directory mounted with docker run -v
. When building, cmake runs a compiler check on a simple program in debug mode and with the /Zi
build flag. This creates a pdb file with debug symbols. Any references to the debug symbols are requested through mspdbsrv.exe. I suspect this has to do with file handle access across the mounted directory, between the containerized OS and the host OS. Building in a directory that's not in a mounted location, but copying build/install results back to the mounted directory seems to work.
Example of output failure.
cl /c /Zi /W3 /WX- /diagnostics:classic /Od /Ob0 /D WIN32 /D _WINDOWS /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Z
c:forScope /Zc:inline /Fo"cmTC_c31aa.dir\Debug\\" /Fd"cmTC_c31aa.dir\Debug\vc141.pdb" /Gd /TC /errorReport:queue C:\TEMP\workdir\ws\build\poco_vendor\CMakeFiles\CMakeTmp
\testCCompiler.c
[19.406s]
[19.406s] testCCompiler.c
[19.406s] LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006E7)' [C:\TEMP\workdir\ws\build\poco_vendor\CMakeFiles\CMakeTmp\cmTC_c31aa.vcxproj]
To run a containerized Windows OS on Windows, the containerized OS has compatibility requirements with the host OS. See https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility?tabs=windows-server-1909%2Cwindows-10-1909
This means that without Hyper-V enabled, the containerized OS must match the host OS because they both use the same kernel. With Hyper-V, it generally means the Release Id of the OS of the container must the same or older than the host OS. I've added logic in my PR for the Release ID to be passed into the docker image when building on a jenkins job.
However, this also means that if someone downloads the image from the cloud instance to run on their own machine, they will have to run a matching Release Id of the cloud instance, or one compatible through Hyper-V.
To find the Release Id on a windows machine, run:
powershell $(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion').ReleaseId
Currently, it is not possible to install RTI Connext through the command line on Windows. The installer provides a headless mode and a text-based mode. Unfortunately, the headless mode is not available for the evaluation installer, and the text-base mode is not available on Windows.
@brawner I assigned you in order to avoid the issue appearing again in our triagging process.
Feel free to unassing yourself if you think that the assignment is wrong.
Thanks to @cottsay (brawner#1), PR #361 now has Connext functionality. He added a git submodule pointing to a private OSRF repo to download a professional Connext installer. The professional installer can be installed headless in a docker container, which was not possible with the evaluation installer.
ci_windows
and ci_windows-container
build differences
ci_windows-container
catches build errors with more up-to-date installs
Turtlesim incompatible with Qt 5.12.7. (Addressed by moving to 5.14.1 in #383)
- ci_windows 9203 ci_windows-container 88
- ci_windows 9157 ci_windows-container 39
- ci_windows 9133 ci_windows-container 14
Caught MSBuild warnings in newer versions of Visual Studio. This was already fixed in ros2/rclcpp#963
cppcheck
1.90 issues: Addressed by:
- #387
- ros2/rclcpp#1000
- ros2/system_tests#400
- https://gitlab.com/micro-ROS/ros_tracing/ros2_tracing/-/merge_requests/140
- Tracetools may still fail though. (https://gitlab.com/micro-ROS/ros_tracing/ros2_tracing/issues/69)
- ci_windows 9400 ci_windows-container 238
- ci_windows 9397 ci_windows-container 235
- ci_windows 9395 ci_windows-container 234
- ci_windows 9394 ci_windows-container 233
- ci_windows 9391 ci_windows-container 231
- ci_windows 9390 ci_windows-container 230
- ci_windows 9366 ci_windows-container 211
- ci_windows 9331 ci_windows-container 188
- ci_windows 9330 ci_windows-container 187
- ci_windows 9316 ci_windows-container 177
- ci_windows 9300 ci_windows-container 160
- ci_windows 9305 ci_windows-container 169
- ci_windows 9269 ci_windows-container 135
- ci_windows 9255 ci_windows-container 127
Open issues for ci_windows-container
Timing related flakiness
- ci_windows 9397 ci_windows-container 235
- ci_windows 9394 ci_windows-container 233
- ci_windows 9391 ci_windows-container 231
- ci_windows 9390 ci_windows-container 230
- ci_windows 9346 ci_windows-container 200
- ci_windows 9337 ci_windows-container 194
- ci_windows 9331 ci_windows-container 188
- ci_windows 9328 ci_windows-container 185
- ci_windows 9316 ci_windows-container 177
- ci_windows 9261 ci_windows-container 130
- ci_windows 9255 ci_windows-container 127
- ci_windows 9272 ci_windows-container 138
- ci_windows 9275 ci_windows-container 142
- ci_windows 9277 ci_windows-container 144
- ci_windows 9280 ci_windows-container 147
- ci_windows 9283 ci_windows-container 150
- ci_windows 9282 ci_windows-container 149
- ci_windows 9205 ci_windows-container 88 (This one failed for a java related reason)
- ci_windows 9184 ci_windows-container 58 (This one failed for a java related reason)
- ci_windows 9174 ci_windows-container 51
- ci_windows 9176 ci_windows-container 52
- ci_windows 9166 ci_windows-container 45
- ci_windows 9165 ci_windows-container 44
- ci_windows 9150 ci_windows-container 32
- ci_windows 9146 ci_windows-container 26
- ci_windows 9132 ci_windows-container 13
A lot of failed tests, hard to tell which ones were different
Issues with ci_windows
ci_windows failed tests that ci_windows-container succeeded at
Test failure caught by ci_windows-container, but not ci_windows. Isolating tests fixed this issue:
Flake8 on ci_windows didn't catch failure (ci_windows-container matches non-windows builds)
ci_windows-container
matches other build types (ci_windows
is the odd one out)
Java agent heap size
4GB Heap space (Node was just restarted)
- ci_windows 9389 ci_windows-container 229
- ci_windows 9383 ci_windows-container 225
- ci_windows 9380 ci_windows-container 222
- ci_windows 9376 ci_windows-container 218
Default Heap size (1GB)
ci_windows build failures
- ci_windows 9360 ci_windows-container 208
- ci_windows 9359 ci_windows-container 207
- ci_windows 9335 ci_windows-container 192
- ci_windows 9334 ci_windows-container 191
- ci_windows 9333 ci_windows-container 190
- ci_windows 9332 ci_windows-container 189
- ci_windows 9319 ci_windows-container 179
- ci_windows 9318 ci_windows-container 178
- ci_windows 9294 ci_windows-container 157
- ci_windows 9288 ci_windows-container 156
- ci_windows 9286 ci_windows-container 153
- ci_windows 9285 ci_windows-container 152
- ci_windows 9284 ci_windows-container 151
- ci_windows 9283 ci_windows-container 150
- ci_windows 9282 ci_windows-container 149
- ci_windows 9277 ci_windows-container 144
- ci_windows 9275 ci_windows-container 142
- ci_windows 9261 ci_windows-container 130
- ci_windows 9255 ci_windows-container 127
- ci_windows 9249 ci_windows-container 123
- ci_windows 9248 ci_windows-container 122
- ci_windows 9247 ci_windows-container 121
- ci_windows 9198 ci_windows-container 84
- ci_windows 9112 ci_windows-container 4
ci_windows-container build failures
Qt Installer stalled (I killed the docker container)
Java agent heap size was too small
4GB Heap space (Node was just restarted)
- ci_windows 9389 ci_windows-container 229
- ci_windows 9383 ci_windows-container 225
- ci_windows 9380 ci_windows-container 222
- ci_windows 9376 ci_windows-container 218
Default heap 1Gb
- ci_windows 9216 ci_windows-container 100
- ci_windows 9190 ci_windows-container 73
- ci_windows 9181 ci_windows-container 55
Qt 5.12.7 incompatibility with turtlesim
- ci_windows 9203 ci_windows-container 88
- ci_windows 9157 ci_windows-container 39
- ci_windows 9133 ci_windows-container 14
hudson.remoting.RequestAbortedException
@nuclearsandwich Can this be closed in favor of new MacOS investigations?