ipfs-inactive/jenkins

Parallize aegir scripts

victorb opened this issue · 5 comments

I made a quick test to see if it's viable to parallize the different aegir scripts (branch here: https://github.com/ipfs/jenkins-libs/blob/parallize/vars/javascript.groovy). Reason why is because ipfs/js-ipfs jobs takes ~20 minutes from start to finish. The tests in ipfs/js-ipfs first runs nodejs tests, then browser and after that webworkers. Experiment was to run build once for each platform + nodejs version, then use that build for use for each one of the tests + os + nodejs version.

Conclusion: doesn't actually speed up builds significantly

  • because we need one job per os + per nodejs version + aegir command, we end with 12 jobs for running tests
  • this seems to slow down the jenkins pipeline because stash/unstash (which we use to pass around files) is not working well with large directories, making each job spending ~2 minutes just transfering files first
  • since tests are not isolated, we can only run one job per worker, making this parallization saturating the queue. We have 5 workers of each OS, which a parallel build of aegir scripts would require at least 6 of each OS (if we run one job). The queue gets full from just one test run
  • when running 12 jobs at the same time in one stage, the reporting back to the master node is delayed, leading to jobs finishing in 2 minutes, actually not finishing until 5 minutes after finishing
  • specific to ipfs/js-ipfs, npm run test:node is still the slowest one and slows down the complete reporting after pipeline finished.

I'm going to walk through some details of how tests are executed in AEgir currently just to make sure no details are missed. A lot of this is probably common knowledge, sorry if this riddle with details.

When it comes to parallelization, AEgir does not have a concept of test suites. The only concept that it has parallelization around are targets, but currently this parallelization is turned off due to a hard coded concurrent execution limit set to 1. Increasing this value though doesn't do anything different than this groovy script. It simply allows multiple targets to run concurrently.

In this js-ipfs project there are three named test targets, test:node, test:browser, test:webworker.

When running test:node, each of the separate suites of tests defined in the package.json (test:node:core, test:node:http, test:node:gateway, test:node:cli), are ran serially. AEgir does not make a distinction between these as it's not aware of them.

I ran each suite parallel to each other (using a simply shell script), each row is a single run.

Test Run core http gateway cli Total
1 24.40s 113.33s 3.88s 491.06s 632.67s
2 23.52s 114.36s 5.13s 501.95s 644.96s
3 24.39s 114.07s 6.98s 491.91s 637.35s
4 24.98s 112.91s 5.07s 491.04s 634.00s
5 23.86s 113.06s 5.59s 490.48s 632.99s
6 23.86s 114.57s 5.29s 492.69s 636.41s
7 22.40s 113.37s 8.79s 493.52s 638.08s
8 23.83s 113.34s 9.97s 498.46s 645.60s
9 21.07s 114.43s 5.66s 482.90s 624.06s
10 22.11s 114.49s 4.60s 506.10s 647.30s
Avg 23.44s 113.80s 6.10s 494.01s 637.34s

The test:node:cli suite takes the longest time. This is probably in part, as many tests are run in both online and offline modes. This means then on average the test suite runs ~ 318s in either mode.

So over all, there isn't a huge advantage to breaking these and running them concurrently on the same worker. The cli tests dominate the time currently.

The rest of this posting goes into some depth as to why the cli tests are so slow.


The longest tests of the cli, almost 30%, comes from the following three tests

  • do not crash if Addresses.Swarm is empty (66827ms)
  • should handle SIGINT gracefully (65188ms)
  • should handle SIGTERM gracefully (63033ms)

If we remove these tests, the cli tests are then running around ~ 442s, or ~ 221s in a single mode on average.

A lot of the cli tests (even after the daemon is running) take on average it appears upwards of 800ms. This appears to mostly be due to the start up time of the cli.

I ran a quick test, and it will take ~ 850ms (matching the cli test speed) for a full run of a command. The the start of code execution to the process exit, averaged around ~ 250ms, which means that around ~ 600ms is just parsing and loading modules.

I was able to measure this simply wrapping the main require statements of cli.js.

diff --git a/src/cli/bin.js b/src/cli/bin.js
index 1d53444..72a6878 100755
--- a/src/cli/bin.js
+++ b/src/cli/bin.js
@@ -2,11 +2,13 @@

 'use strict'

+const st = (new Date).getTime()
 const yargs = require('yargs')
 const updateNotifier = require('update-notifier')
 const readPkgUp = require('read-pkg-up')
 const utils = require('./utils')
 const print = utils.print
+console.log(((new Date).getTime() - st) / 1000)

 const pkg = readPkgUp.sync({cwd: __dirname}).pkg
 updateNotifier({

The test:node:cli tests spawn the cli 201 times. This results in an overhead of ~ 120s for the full test run.

Currently working on this, will add a new npm run test:ci script that will run all tests in parallel.

Todo:

  • Make it possible to run test:browser and test:webworker simultaniously, requires fix in aegir to have dynamic ports in Karma, current issue is port collision
  • Make junit test reports have a timestamp or something unique, so we can have many test reports for the same area of tests
  • Add test:ci script to js-ipfs and make sure it's working properly and faster than current stuff

Make it possible to run test:browser and test:webworker simultaniously, requires fix in aegir to have dynamic ports in Karma, current issue is port collision

I don't believe this is an issue with Karma itself. I believe Karma can handle a port already in use. When I was looking into some of this parallel work I found that the ipfsd-ctl server was the issue. Both the browser and webworker tests of aegir use the same hooks browser which causes two ipfsd-ctl servers to be started.

Aegir should possibly have two hooks, one for the browser and another for webworker. For js-ipfs itself we can either start two different ipfsd-ctl servers, or share a single instance and keep a ref count.

Aegir should possibly have two hooks, one for the browser and another for webworker. For js-ipfs itself we can either start two different ipfsd-ctl servers, or share a single instance and keep a ref count.

Agree, we need to start two ipfsd-ctl servers if we want parallel browser and webworker runs.