dd010101/vyos-jenkins

Need help testing installer

Closed this issue ยท 17 comments

Hey,

I have finished my initial setup for the installer.
I would like to get someone else to test it - @dd010101 can you help with that?

The steps are easy:

Once done, you should end up with a Jenkins setup with all the packages, an apt repo with all the packages and the needed docker images.

Then it is just running the build-iso.sh script to build the ISO ๐Ÿ˜Š

Looking forward to hearing if it works for someone else than me.
Currently, there is still a bit of polishing that is missing in the scripts - and I need to write some documentation.

Step 1 - Good.

Step 2 - Exception thrown and the end:

Download Jenkins CLI...
Installing plugins...

Installing Job DSL plugin...
Installing Copy Artifact plugin...
Installing SSH Agent plugin...
Installing Docker plugin...
Installing Docker Pipeline plugin...
Installing Pipeline Utility Steps plugin...

Stopping jenkins...
Setting executors to 128...
Configuring labels...
Configuring Environment variables...
Configuring global libraries...
Configuring declarative pipeline...
Restarting Jenkins...
Creating SSH key credential...
io.jenkins.cli.shaded.org.glassfish.tyrus.client.exception.DeploymentHandshakeException: Handshake error.
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.exception.Exceptions.deploymentException(Exceptions.java:38)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:622)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:657)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:810)
        at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:460)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.lambda$connectToServer$2(ClientManager.java:313)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.tryCatchInterruptedExecutionEx(ClientManager.java:324)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:313)
        at hudson.cli.CLI.webSocketConnection(CLI.java:360)
        at hudson.cli.CLI._main(CLI.java:313)
        at hudson.cli.CLI.main(CLI.java:101)
Caused by: io.jenkins.cli.shaded.org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 404.
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:308)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter.processRead(ClientFilter.java:167)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:295)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:279)
        at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
        at java.base/sun.nio.ch.Invoker.invokeDirect(Invoker.java:160)
        at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.implRead(UnixAsynchronousSocketChannelImpl.java:573)
        at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:276)
        at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:297)
        at java.base/java.nio.channels.AsynchronousSocketChannel.read(AsynchronousSocketChannel.java:425)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter._read(TransportFilter.java:279)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$3.completed(TransportFilter.java:191)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$3.completed(TransportFilter.java:185)
        at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
        at java.base/sun.nio.ch.Invoker$2.run(Invoker.java:221)
        at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:113)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

Part 2 of the installer is now done.
Please run part three to set up the reprepro repositories.

The Dashboard Administrace CredentialsSystem Global credentials (unrestricted) was empty

This credential domain is empty. How about adding some credentials?

So the SSH credentials step failed? I guess because it tried to communicate with Jenkins too soon after restart? There should be error handling so it doesn't say part 2 is done if error occurred.

I did create the SSH credentials by hand as workaround, key was already generated.

Also step 2 assigned 128 number of executors? I have 8 logical processors...

root@bookworm:~# nproc --all
128
root@bookworm:~# nproc
8

Step 3 - Good.

Step 4 - Good.

Step 5 - Asks for username - unclear what username, then asked for Jenkins token. I think it would be good to store the username and token from previous step into /tmp so it doesn't ask multiple times, if the /tmp data is missing then ask but clearly say that Jenkins username is expected.

Step 6 - All indexing failed with:

/var/lib/jenkins/jobs/vyos-strongswan/branches/equuleus/builds/1/libs/2dd8beab4a27c62b0096ad2bd29295c12725721c1677e23217c3eca61fc685b3/vars/buildPackage.groovy: 76: Invalid agent type "docker" specified. Must be one of [any, label, none] @ line 76, column 29.

The Jenkins plugins are missing. Something is wrong, I guess because of partial failure of step 2? That's where I did end.


I'm not quite sure where I'm and what I should run next.

  1. It would be nice to include the step name in the header - like the name of the script thus I would see what I'm running right now.
  2. It would be also good if you included the script names in the end, like step 1 (1-prereqs.sh) is done now continue with second step (2-jenkins.sh)...

I also not sure if something is wrong - like there is long step 5 where containers are being build - there should be heads up message to say that this is expected to take long time so people know it's expected and nothing is wrong.

I'll take a look at those later ๐Ÿ˜Š

Did you enter your username and the token you created? (and not the temp admin token)

Yes, it's possible I copied something wrong, later steps didn't have issue though, if this is about wrong credentials then it should stop right away? I assume the credentials were correct since steps before that didn't throw error.

I tried again and I was careful to copy the right thing and result is the same - script does okay, then restart and fails the same way. Also it says it installs plugin but it doesn't,

I think there should be better error handling - so we can see what failed, like all scripts should start with set -e so they terminate when command return non-zero exit code. You need to check if some commands fail internationally though. If command should fail then append || true like bad_command || true.

There is also a lot of "shut up" like wget http://$IP_ADDRESS:8080/jnlpJars/jenkins-cli.jar > /dev/null 2>&1, that's not good idea, this will hide errors and make debugging impossible.

If I remove "shut ups" then I get errors:

Installing plugins...

io.jenkins.cli.shaded.org.glassfish.tyrus.client.exception.DeploymentHandshakeException: Handshake error.
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.exception.Exceptions.deploymentException(Exceptions.java:38)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:622)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:657)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:810)
        at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:460)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.lambda$connectToServer$2(ClientManager.java:313)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.tryCatchInterruptedExecutionEx(ClientManager.java:324)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:313)
        at hudson.cli.CLI.webSocketConnection(CLI.java:360)
        at hudson.cli.CLI._main(CLI.java:313)
        at hudson.cli.CLI.main(CLI.java:101)
Caused by: io.jenkins.cli.shaded.org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 404.
        at io.jenkins.cli.shaded.org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:308)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter.processRead(ClientFilter.java:167)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:295)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:279)
        at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
        at java.base/sun.nio.ch.Invoker.invokeDirect(Invoker.java:160)
        at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.implRead(UnixAsynchronousSocketChannelImpl.java:573)
        at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:276)
        at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:297)
        at java.base/java.nio.channels.AsynchronousSocketChannel.read(AsynchronousSocketChannel.java:425)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter._read(TransportFilter.java:279)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$3.completed(TransportFilter.java:191)
        at io.jenkins.cli.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$3.completed(TransportFilter.java:185)
        at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
        at java.base/sun.nio.ch.Invoker$2.run(Invoker.java:221)
        at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:113)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

For everything.

It's bad practice to use "shut ups" - you should never use shut up unless it's required - like when command returns stderr when it shouldn't.

The error is due to wrong $IP_ADDRESS, it parses IP but gets 2 IPs instead of one. Thus the resulting command is invalid. You can use 172.17.17.17 there to avoid such issues. The 2 IPs are local LAN IP + docker IP.

I also see lot of issue with parameters - you need to use quotes for variable parameters everywhere otherwise things will brake. Like:

wget http://$IP_ADDRESS:8080/jnlpJars/jenkins-cli.jar

should be

wget "http://$IP_ADDRESS:8080/jnlpJars/jenkins-cli.jar"

Since here two IPs were injected and thus the result was:

wget http://127.0.0.1 127.0.0.1:8080/jnlpJars/jenkins-cli.jar

And this didn't fail, with proper quoting it would fail. This applies to all variable arguments everywhere, all need double quotes.

If I use 172.17.17.17 instead $IP_ADDRESS then it works.

The biggest issue I see is the error handling - this way it will be very difficult to debug issue if something fails for someone since all errors are suppressed, this should be resolved by removing "shut ups" 2>&1 in nearly all places with few exceptions and combined with proper exit code handling with set -e.

The 128 executors are way too much, it tries to melt down my machine.

It also causes the builds to fails, as it bogs down the CPU then it results in other tasks failing:

Receiving objects:  79% (190377/240983), 115.73 MiB | 1.73 MiB/s
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
16:19:20  error: 1197 bytes of body are still expected
16:19:20  fetch-pack: unexpected disconnect while reading sideband packet
16:19:20  fatal: early EOF
16:19:20  fatal: fetch-pack: invalid index-pack output
16:19:20  
16:19:20  	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)
16:19:20  	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:2185)
16:19:20  	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:635)
16:19:20  	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:871)
16:19:20  	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1222)
16:19:20  	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1305)
16:19:20  	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:136)
16:19:20  	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:101)
16:19:20  	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:88)
16:19:20  	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
16:19:20  	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
16:19:20  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
16:19:20  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
16:19:20  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
16:19:20  	at java.base/java.lang.Thread.run(Thread.java:840)
16:19:21  ERROR: Error cloning remote repo 'origin'
16:19:21  ERROR: Maximum checkout retry attempts reached, aborting

https://github.com/GurliGebis/vyos-jenkins/blob/4f230418223f62ffc3878b613d10a408b4271a85/build-iso.sh#L45
Should this not use a variable for the IP?

https://github.com/GurliGebis/vyos-jenkins/blob/4f230418223f62ffc3878b613d10a408b4271a85/build-iso.sh#L61
https://github.com/GurliGebis/vyos-jenkins/blob/4f230418223f62ffc3878b613d10a408b4271a85/build-iso.sh#L76
You hard code the IP here as well.

More importantly,
https://github.com/GurliGebis/vyos-jenkins/blob/4f230418223f62ffc3878b613d10a408b4271a85/build-iso.sh#L63
You've defined a release type image, but are then adding the vyos-1x-smoketest package.
To do this properly, you should simply define it as a development build:
https://github.com/vyos/vyos-build/blob/sagitta/data/build-types/development.toml
A release image shouldn't include smoketests.
I would also use their naming scheme, 1.3-stable-$date and 1.4-stable-$date, or provide the user the ability to specify the complete version string, perhaps with replaceable variables (i.e. 1.3-stable-{{date}})

Granted, this is essentially entirely for internal consumption so the naming of things is less important as they mean something to you. I just think it would be good to not conflate terms and follow the project's accepted naming standards.

How does a user using this script specify building with additional packages? There's no variable for this, and the script (and seemingly the Jenkins build process without modifications/patches) doesn't allow a user to provide their own build flavor to handle this either.

Should this not use a variable for the IP?
You hard code the IP here as well.

That's correct for our setup since we use 172.17.17.17 internally, if you would want to make this as generic script that anyone can use anywhere then sure, there could be optional override via environment variable but that's not the target use-case.

To do this properly, you should simply define it as a development build

That's not correct. See the official build script. You should have release + smoketest. We don't want development build, we specifically want release build we can use for production with option to test the image - that's not development build. The development build also generates random image name and that's annoying. We don't need development packages, the smoketest is enough.

You should have release + smoketest.

Only if

                    if (params.TEST_SMOKETESTS)

from

        booleanParam(name: 'TEST_SMOKETESTS', defaultValue: true, description: 'Run Smoketests after ISO build')

By default, Jenkins produces a smoketest image and is not uploaded to the official S3 bucket.
The released, published official .iso would never contain the smoketest package.

You can even see in the nightly builds that are handled entirely via GitHub actions, that they do as you say, release+smoketests, but only for the smoketest .iso. They then turn around and produce an actual release image without the smoketest package.

Completely user preference to include additional packages that may not be used, but it does deviate from the official releases which I desire to emulate as close as possible without having an official package list.

That's correct for our setup since we use 172.17.17.17 internally, if you would want to make this as generic script that anyone can use anywhere then sure

This comment was more about consistency. In one script, detection is done to find its IP. In another, the IP is hard coded. Which is it?

@Crushable1278

By default, Jenkins produces a smoketest image and is not uploaded to the official S3 bucket.
The released, published official .iso would never contain the smoketest package.

That's flawed logic. You make one image, you test it, you delete it, you create another image and you use the untested image? Doesn't make sense. Smoketest should be included in every build, so you can test the ISO you will use, because testing some other ISO may test something else than you will use... The chance of this happening isn't huge but it for sure exists... The cost for you is slightly larger ISO - less important than to know that you will use the thing you tested...

I know that's how it's officially done - but it doesn't make sense for us to do it as the team does it. The team can have policy - let's not push anything to git if we are building release or whatever - but we build release independently thus there is real chance someone will do git push in the meanwhile. I don't like the idea that you don't test the thing you use anyway - it's just logically wrong, it rubs me the wrong way. If you spend 2 hours on testing then you may as well use the thing you tested...

This comment was more about consistency. In one script, detection is done to find its IP. In another, the IP is hard coded. Which is it?

The detection is used mainly to tell people where they find Jenkins - we should use internal IP for everything since other IP may have firewall and why not, so auto detection is fragile and should not be used for internal traffic.


@GurliGebis

With corrected $IP_ADDRESS in step 2 I got to step 8, there was issue with frr - it failed randomly - then worked when I did trigger manual build - the git fails download seemingly because the CPU is loaded?

The step 8 did report error at end:

#################################################
# Unofficial VyOS package mirror installer v1.0 #
#################################################

Configuring NGINX...
Removing default NGINX configuration...
Copying apt-mirror NGINX configuration file...
Linking apt-mirror NGINX configuration file...
Restarting NGINX

Part 8 of the installer is now done.
./8-nginx.sh: line 49: unexpected EOF while looking for matching `"'

But nginx works.

The build-iso.sh doesn't have executable flag in git but it does produce ISO successfully:

###################################
# Unofficial VyOS ISO builder v1.0 #
####################################

Please enter which branch you want to build (equuleus or sagitta): sagitta
Please enter your email address: nope@nope.com

Cloning the VyOS build repository...
Checking out the sagitta branch...
Downloading apt signing key...
Building the ISO...

ISO build is complete.
The file is called: vyos-1.4-release-20240617-iso-amd64.iso.

Cleaning up...

$ stat vyos-1.4-release-20240617-iso-amd64.iso
  File: vyos-1.4-release-20240617-iso-amd64.iso
  Size: 465567744       Blocks: 909320     IO Block: 4096   regular file
Device: 8,1     Inode: 1707017     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-06-17 16:55:38.452393857 +0200
Modify: 2024-06-17 16:49:15.000000000 +0200
Change: 2024-06-17 16:55:39.152410060 +0200
 Birth: 2024-06-17 16:55:32.656259215 +0200

So it all works correctly apart of few minor issues, like the $IP_ADDRESS - we can simply use 172.17.17.17 instead for all operations, also the cpu count is detected incorrectly, nproc --all reports all installed cpus whatever that means, it doesn't mean the actual count - my vm has 1 cpu with 8 cores = 8 logical processors but reports 128 logical processors installed. There is major issue with the error handling though - it should be improved otherwise people will report something is broken and there is no way to see why, that's not a issue if everything works but it will be hell if something stops working.

Thanks, I hope to get some time to go through all the findings tomorrow ๐Ÿ™‚

It's impressive you found a way to automate everything in such short time - I see a lot of steps that you did need to invent solution on your own. What was your experience with bash / jq / xmlstarlet before this project? Did you know most of the things or you did need to learn it as you did go?

I would also include the Testing jenkins connection from step 6 to every step that does work with token, for sure step 2 should have it - to tell you if the username/token is wrong. Maybe also include some master run script that will invoke all steps in sequence? It's good to have steps if you need to repeat step but the initial setup could be chained one step after the other?

Jq and xmlstarlet was none.
Bash only a little.

I do however have close to 20 years of professional experience as a developer and devops engineer ๐Ÿ™‚, so that helps with figuring out how to automated stuff

We could make build-iso.sh more universal as @Crushable1278 suggested. This should be as easy as making few variables.

$APT_REPOSITORY (default: http://172.17.17.17) - for people who want to use non-local repository.
$APT_KEY (default: http://172.17.17.17/apt.gpg.key) - again for non-local repository, also it would be nice to have if [ -f "$$APT_KEY"] condition to allow people use local file or URL.
$CUSTOM_PACKAGES (default: vyos-1x-smoketest) - if someone wants extra packages or doesn't want smoketest.
$RELEASE_NAME (default: current format) - if someone wants their own name.

That should be enough?

Perhaps also include option to run make test and make testc for QEMU-KVM smoketest? It's as easy as to cd into vyos-build directory and run make, but you need some apt install dependencies too.

I have created #27 , which we can use for suggestions and reviewing.
I hope to get around to the comments here later today, and be able to add more error handling.

I have gone through file 1-4, the rest will have to wait

I'll close this ticket, since we have the review PR to work on it instead ๐Ÿ™‚