Tribler/py-ipv8

Mutation tests failing

qstokkink opened this issue · 10 comments

It appears our mutation testing machine, zulu-ipv8-mutation-tester (ipv8-mutation-tester IPv8), is now missing a dependency:

00:11:18 + run_all_mutation_tests.py ./py-ipv8 .
00:11:18 /tmp/jenkins8324097251270560127.sh: 3: run_all_mutation_tests.py: not found

"The operation was a success, but the patient died":

12:01:23 java.nio.channels.ClosedChannelException
12:01:23 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
12:01:23 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
12:01:23 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
12:01:23 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
12:01:23 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
12:01:23 	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
12:01:23 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
12:01:23 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
12:01:23 	at java.base/java.lang.Thread.run(Thread.java:840)
12:01:23 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from <Server Name>/<Server IP>:<Server Port>' is disconnected.
12:01:23 	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
12:01:23 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
12:01:23 	at jdk.proxy2/jdk.proxy2.$Proxy123.isAlive(Unknown Source)
12:01:23 	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1212)
12:01:23 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1204)
12:01:23 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:195)
12:01:23 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:145)
12:01:23 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
12:01:23 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
12:01:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
12:01:23 	at hudson.model.Build$BuildExecution.build(Build.java:199)
12:01:23 	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
12:01:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
12:01:23 	at hudson.model.Run.execute(Run.java:1895)
12:01:23 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
12:01:23 	at hudson.model.ResourceController.execute(ResourceController.java:101)
12:01:23 	at hudson.model.Executor.run(Executor.java:442)

The build executed correctly for several hours, but the build executor lost connection due to another issue.

Agent restarted. Hopefully this was just a one time thing 🤞

Not a one time thing. The builder disconnected again. 😢

Perhaps we need to change the priority of the Jenkins agent jar to take priority over everything else.

Switched to nohup bash -c 'java -jar agent.jar etc etc' > test.txt 2>&1 </dev/null &. Hopefully it stays online now. We'll see in a few hours.

🎉 The builder no longer disconnects. On to the next error:

12:40:54 Done! Minimizing output
12:40:54 Skipping /[...]/index.html, no index.html found!
12:40:54 Traceback (most recent call last):
12:40:54   File "/home/run_all_mutation_tests.py", line 116, in <module>
12:40:54     shutil.copy(os.path.join('/root', 'MutPy', 'mutpy', 'templates', 'include', 'jquery.js'), base_output_dir)
12:40:54   File "/usr/lib/python3.10/shutil.py", line 417, in copy
12:40:54     copyfile(src, dst, follow_symlinks=follow_symlinks)
12:40:54   File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
12:40:54     with open(src, 'rb') as fsrc:
12:40:54 FileNotFoundError: [Errno 2] No such file or directory: '/root/MutPy/mutpy/templates/include/jquery.js'

Third error fixed. Second error is back: the builder is disconnecting again.

It did stay online while I had an active connection open to the container. Perhaps there is some sort of hibernation mode that triggers.

Based on https://community.jenkins.io/t/how-to-affect-ssh-parameters-on-ssh-agent-like-keep-alive/5954, we should probably try playing with the ~/.ssh/config file. The posted example in the link above is:

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 3

Our disconnecting job takes (just short of) 2 hours. Based only on gut feeling alone, setting the alive interval to 5 minutes and the max missing count to 24 should suffice. I'll try this out once I'm on the (physical) premises again and I have access to the machine.

To get a sense of perspective on Jenkins, I looked into GitHub Actions. At the time of writing, the maximum job execution time is 6 hours and a cron build trigger exists. This means it would be theoretically feasible to use GitHub Actions for our nightly build.

That said, we would still have to create the action (☹️), create a proper MutPy fork from my disgusting patches in the secret Tribler/py-ipv8-mutation-libraries repository (☹️), and rework the disgusting patches to be even more disgusting and output something compatible with GitHub job summaries, which use Markdown instead of HTML (😭). In short, two things I don't want to do and one thing I REALLY don't want to do.

Practically speaking, it's probably still best to stick with Jenkins.

I have updated the agent to connect via SSH. Hopefully, it will not disconnect anymore.
Here is a running job: https://jenkins.tribler.org/job/ipv8/job/mutation_test_daily/21/

Seems to be fixed now. Thanks @xoriole!