[rush] Design: Customize Concurrency By Project (and support device assignment)

Question

[rush] Design: Customize Concurrency By Project (and support device assignment)

elliot-nelson opened this issue 5 months ago · 3 comments

Summary

In a Rush monorepo with a variety of different projects, I would like to be able to customize concurrency by project. For example, I'd like to turn on parallelism for a phased command, but this particular group of 10 projects uses a shared resource during tests, and so I need them to barred from running in parallel with each other.

In a related requirement, I may have projects that consume a resource out of a limited pool, such as "test device" (i.e. a simulator target or physical hardware device). If I have a pool of 3 devices, and 30 projects to test on them, then I need to not only set a concurrency limit of 3 for those 30 projects, but also make sure I am "assigning" a device to each project to test on.

In either case, other projects not in a limited pool should be able to keep churning away while Rush waits for limited resources.

Details

This WIP design is light on details. But we can lay out some possibilities.

In a GitHub Actions workflow, they accomplish "customized concurrency" via the concurrency: setting. Whatever value this setting returns is considered a lock, essentially. So for example, setting concurrency: abcd (a static value) forces all jobs to run sequentially, because they all have the same value. Whereas, setting concurrency: ${{ matrix.xcode-version }} in the context of a matrix would allow multiple jobs to run in parallel as long as they had different "xcode-version" values.
We need a way to set such a value for a project -- and more usefully, a range of projects. In particular we would want to support using tags to identify projects in various limited resource groups. Ideally, we could configure using JSON selector expressions linked to concurrency values.
The topic of "assigning devices" (or some other limited resource like license keys, browser PIDs, remote IPs) is thorny and very company-specific. Passing them to the test runner is also thorny and company-specific, for example, what if there's a pool of 12 devices and we want to select 3 and pass them to each runner? What if we need logic to determine if a device is still alive and ready to receive tests? What if we need a way to fail the tests if no devices are available after X minutes? It seems like we need to offload as much of this logic as possible to the monorepo maintainer.
If this feature is so advanced that we can't support it just with config files and script hooks, then I think requiring a Rush plugin is reasonable -- the question is how to integrate hooks for such a plugin into the task scheduler so that the maintainer can write one.

Thanks for considering!

Use Cases

This section highlights concrete use cases we can use to vet potential solutions.

Use Case 1: "Attached Roku Device"

In this example, I've got a Rush monorepo where most of the projects are Roku (BrightScript) projects, but about 25% of the projects are build tooling (TypeScript) projects. The runners that run PR and CI builds are not GitHub runners, but rather self-hosted runners in a lab, where each machine has a dedicated Roku device (the IP is assigned globally on the machine, e.g. ROKU_IP_ADDRESS).

I would like to be able to run rush test at the command line on these runners, and automatically have the TypeScript projects parallelize with each other, but guarantee that only 1 BrightScript test phase will run at a time.

Ideally, I'd like to configure this at run time (not in configuration), because this really is a run-time specific thing. I could imagine a command line like this:

rush test --concurrency-config=./temp-config.json

Where temp-config.json looks like this:

{
  "concurrencyOverrides": [
    {
      "selector": { "tag": "roku" },
      "phases": ["_phase:test"],
      "concurrency": 1
    }
  ]
}

In English: "For this particular run of Rush, whenever you run a "test" phase for projects tagged "roku", there can only be 1 running at a time."

Use Case 2: "Multiple Attached Xbox Devices"

This use case is similar to the first one, but now I'm on a Windows self-hosted runner, and each Windows box has some number of runners attached to it. In fact, I might not know how many -- some have 2, some have 3, etc. (Assume they are provided to me in some env var e.g. XBOX_DEVICES=["192.168.1.113", "192.168.1.172"].)

You could really over-engineer this use case within Rush, but one really simple option would be this configuration file (riffing on the solution above):

{
  "concurrencyOverrides": [
    {
      "selector": { "tag": "xbox" },
      "phases": ["_phase:test"],
      "concurrency": 2,
      "concurrentEnvVar": "XBOX_TEST_INSTANCE"
    }
  ]
}

The only addition here is a specification of an environment variable. Rush will ensure up to 2 projects tagged "xbox" are running "test" phases, and when it runs those phases, it will insert into their environment XBOX_TEST_INSTANCE with the value 0 or 1 (the guarantee here should be that there will never be 2 processes running in parallel with the same XBOX_TEST_INSTANCE value).

In this version of the feature, Rush doesn't know or care that I'm thinking of a pool of devices, but I can build my own wrapper script that selects the appropriate IP address using the XBOX_TEST_INSTANCE environment variable that Rush passes to me.

Use Case 3: "Device Farm Service"

In this use case, the Rush build is running on a self-hosted node in a device lab. There's various devices available, but they aren't statically configured; instead, there's a device service where you'll send a REST API call for "I need to use an Xbox", and it will return "here's an Xbox IP you can use for 60 minutes" (or; there are none available, and the estimated wait time is X minutes).

In this use case, the process for deciding what kind of device you need, seeing if it's available, and then using it for one of your test phases is a lot more complex. We can't statically configure a concurrency limit because we don't know ahead of time how many devices will be available.

This use case probably has no reasonable approach except a Rush Plugin that can hook into the task scheduler. In this case the plugin would want to atomically try to check out a device; if it got one, run the task and release the device; if it didn't, tell the scheduler it can't start this task right now (or, it won't be able to at all, and to mark the task as failed). Rather than separate "hooks" like "pickTask", "startTask", "stopTask", etc., some kind of wrap-around "runTask" that would allow us to modify the process argv and environment would be ideal.

Possible Designs

Design 1: "Special Exit Code"

In an early discussion about this feature years ago, @octogonz proposed that we could give Rush a tiny change: you could configure a special exit code (for example, exit code 37) to mean not "I have failed" but rather "I am not ready yet".

With this feature, you could push all the logic described above into, say, a custom node script wrapper in your project's _phase:test script. It's the job of your script to go decide, somehow, whether it's safe to run, and if it is, to run. If it's not safe to run, you would exit with exit code 37, and Rush would bump you to the back of its potential runnable tasks, and assign some value (e.g. 10 seconds) before it would attempt to run you again.

PROS:

Elegant and minimal from Rush's perspective, no huge design discussions necessary.
Relatively easy API -- no "plugins" to build, etc.

CONS:

How to lock/mutex etc. totally left to user, potentially leading to badly designed test scripts
Keeping track of which devices have been assigned, etc. is all up to individual runs of the test scripts to manage, requiring the monorepo maintainer to build some kind of external lock, rely on a series of file system lockfiles, etc.
No good way to customize how many times Rush will try or how long it will wait between tries.
Once Rush runs out of tasks that are ready, it is just in a loop of starting up a bunch of tasks that return exit code 37 and waiting for one of them to run... what does the console log output look like in these scenarios?

Design 2: "Concurrency Config File"

This design is outlined in some of the Example Use Cases above; essentially Rush would support a new file that describes concurrency options for various projects and phases, that can be provided at each invocation of Rush.

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question	Answer
Would you consider contributing a PR?	Yes

Answer 1 · 2023-12-14T19:29:23.000Z

Could you add some concrete examples of how this feature might be used, along with hypothetical config snippets? Rush managing everything sounds great in theory, but the problem of lab resources probably has lots of little details that will be different for every scenario, so I question whether it is really possible to do better than a bunch of custom scripts for each setup.

Answer 2 · 2023-12-15T20:44:26.000Z

@octogonz Good idea, I have added a Use Cases section with my two use cases, and tried to make them as concrete as possible.

Answer 3 · 2024-01-02T22:23:41.000Z

The Rush Cobuilds feature introduced the same concept of "try to get a lock, and if you fail, defer the Operation", so that logic would be something we should consolidate and generalize. I could easily see as a stress test edge case a build graph where each individual Operation is in several separate concurrency buckets (at a minimum there will be 2: the global one and the specific one).

This is a good opportunity for Rush to converge onto @rushstack/operation-graph, since this requires scheduler rework, and that will allow Heft to benefit from the same logic. Since the work queue in @rushstack/operation-graph already supports asynchronous scheduling, it is less complex to extend with resource locking.

Edit: as an added bonus, introducing the capability in @rushstack/operation-graph grants the same feature to Heft, in case it needs it.

My main concern with this feature is that it introduces the possibility of deadlocking into the scheduler, which it never previously had to deal with.