flux-framework/flux-sched

sharness based scheduler performance study

grondo opened this issue · 12 comments

I was able to create a reproducer of some kind of fluxion performance problem via the following sharness test, which creates 16K fake resources split into 3 different sets via properties, drains 10 of them, then submits various sets of mock execution jobs to them.

In my testing, Fluxion appears to take a long time to initialize from the configured fake reosurces. Then the first set of jobs get 0.4job/s throughput. After that, performance seems to improve. This may be nothing or I may have made some kind of setup mistake, but I thought I'd share just in case.

I dropped this test into t/t8888-scheduling-test.t (results below)

#!/bin/sh
#

test_description='
'

. `dirname $0`/sharness.sh

export TEST_UNDER_FLUX_QUORUM=1
export TEST_UNDER_FLUX_START_MODE=leader
rpc() {
  flux python -c \
    "import flux, json; print(flux.Flux().rpc(\"$1\").get_str())"
}

test_under_flux 16384 system

test_expect_success 'unload sched-simple' '
	flux module remove -f sched-simple
'

test_expect_success 'update configuration' '
	flux config load <<-'EOF'
	[[resource.config]]
	hosts = "fake[0-16383]"
	cores = "0-63"
	gpus = "0-3"

	[[resource.config]]
	hosts = "fake[0-9999]"
	properties = ["compute"]

	[[resource.config]]
	hosts = "fake[10000-16000]"
	properties = ["test"]
	
	[[resource.config]]
	hosts = "fake[16001-16383]"
	properties = ["debug"]

	[sched-fluxion-qmanager]
	queue-policy = "easy"

	[sched-fluxion-resource]
	match-policy = "firstnodex"
	prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node"
	match-format = "rv1_nosched"
	EOF
'

test_expect_success 'reload resource with monitor-force-up' '
	flux module reload -f resource noverify monitor-force-up
'
test_expect_success 'drain a few nodes' '
	flux resource drain 1-1000 test with drained nodes
'
test_expect_success 'load fluxion modules' '
	flux module load sched-fluxion-resource &&
	flux module load sched-fluxion-qmanager
'
test_expect_success 'wait for fluxion to be ready' '
	time rpc sched.resource-status 2>/dev/null
'
test_expect_success 'create a set of 100 inactive jobs' '
	flux submit --cc=1-100 --quiet --wait \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname
'
test_expect_success 'create a set of 64 running jobs' '
	flux submit --progress --jps --quiet --cc=1-64 --wait-event=start -N1 \
		--requires=compute \
		--setattr=exec.test.run_duration=5m hostname
'
test_expect_success 'submit a job that requires a drained node' '
	flux submit --quiet --wait-event=depend -N10 \
		--requires="host:fake[10-19] and compute" \
		hostname
'
test_expect_success 'run a simple job' '
	flux run -vvv --setattr=exec.test.run_duration=1s -N16 hostname
'
test_expect_success 'get match stats' '
	rpc sched-fluxion-resource.stat | jq
'
test_expect_success 'unload fluxion' '
	flux module remove sched-fluxion-qmanager &&
	flux module remove sched-fluxion-resource &&
	flux module load sched-simple
'
test_done
$ ./t8888-scheduling-test.t -d -v
sharness: loading extensions from /g/g0/grondo/git/flux-sched/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /g/g0/grondo/git/flux-sched/t/sharness.d/sched-sharness.sh
sharness: loading extensions from /g/g0/grondo/git/flux-sched/t/sharness.d/flux-sharness.sh
sharness: loading extensions from /g/g0/grondo/git/flux-sched/t/sharness.d/sched-sharness.sh
expecting success: 
	flux module remove -f sched-simple

ok 1 - unload sched-simple

expecting success: 
	flux config load <<-EOF
	[[resource.config]]
	hosts = "fake[0-16383]"
	cores = "0-63"
	gpus = "0-3"

	[[resource.config]]
	hosts = "fake[0-9999]"
	properties = ["compute"]

	[[resource.config]]
	hosts = "fake[10000-16000]"
	properties = ["test"]
	
	[[resource.config]]
	hosts = "fake[16001-16383]"
	properties = ["debug"]

	[sched-fluxion-qmanager]
	queue-policy = "easy"

	[sched-fluxion-resource]
	match-policy = "firstnodex"
	prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node"
	match-format = "rv1_nosched"
	EOF

ok 2 - update configuration

expecting success: 
	flux module reload -f resource noverify monitor-force-up

ok 3 - reload resource with monitor-force-up

expecting success: 
	flux resource drain 1-1000 test with drained nodes

ok 4 - drain a few nodes

expecting success: 
	flux module load sched-fluxion-resource &&
	flux module load sched-fluxion-qmanager

ok 5 - load fluxion modules

expecting success: 
	time rpc sched.resource-status 2>/dev/null

{"all":{"version":1,"execution":{"R_lite":[{"rank":"0-16383","children":{"core":"0-63","gpu":"0-3"}}],"nodelist":["fake[0-16383]"],"properties":{"compute":"0-9999","debug":"16001-16383","test":"10000-16000"},"starttime":0,"expiration":0}},"down":{"version":1,"execution":{"R_lite":[{"rank":"1-1000","children":{"core":"0-63","gpu":"0-3"}}],"nodelist":["fake[1-1000]"],"properties":{"compute":"1-1000"},"starttime":0,"expiration":0}},"allocated":null}

real	2m22.924s
user	0m0.142s
sys	0m0.027s
ok 6 - wait for fluxion to be ready

expecting success: 
	flux submit --cc=1-100 --quiet --wait \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname

PD:0   R:0   CD:100 F:0   │███████████████████████████████████│100.0%  0.4 job/s
ok 7 - create a set of 100 inactive jobs

expecting success: 
	flux submit --progress --jps --quiet --cc=1-64 --wait-event=start -N1 \
		--requires=compute \
		--setattr=exec.test.run_duration=5m hostname

PD:0   R:64  CD:0   F:0   │███████████████████████████████████│100.0%  6.1 job/s
ok 8 - create a set of 64 running jobs

expecting success: 
	flux submit --quiet --wait-event=depend -N10 \
		--requires="host:fake[10-19] and compute" \
		hostname

ok 9 - submit a job that requires a drained node

expecting success: 
	flux run -vvv --setattr=exec.test.run_duration=1s -N16 hostname

jobid: f3zuRJpBy
0.000s: job.submit {"userid":6885,"urgency":16,"flags":0,"version":1}
0.012s: job.validate
0.023s: job.depend
0.023s: job.priority {"priority":16}
1.590s: job.alloc
1.590s: job.prolog-start {"description":"cray-pals-port-distributor"}
1.591s: job.cray_port_distribution {"ports":[11999,11998],"random_integer":1935304580792628789}
1.591s: job.prolog-finish {"description":"cray-pals-port-distributor","status":0}
1.593s: job.start
1.592s: exec.init
1.593s: exec.starting
2.594s: exec.complete {"status":0}
2.594s: exec.done
flux-job: No job output found
2.594s: job.finish {"status":0}
ok 10 - run a simple job

expecting success: 
	rpc sched-fluxion-resource.stat | jq

{
  "V": 1130497,
  "E": 2260992,
  "by_rank": {
    "[0-16383]": 69
  },
  "load-time": 44.955565206000003,
  "njobs": 165,
  "min-match": 0.098841462000000005,
  "max-match": 2.72908858,
  "avg-match": 1.4591527333454548
}
ok 11 - get match stats

expecting success: 
	flux module remove sched-fluxion-qmanager &&
	flux module remove sched-fluxion-resource &&
	flux module load sched-simple

ok 12 - unload fluxion

# passed all 12 test(s)
1..12
trws commented

Decided to revisit this today. Running this exact test on my laptop with 0.33.1 (the current version as of this issue) I got 2.1 jobs/s for the inactive jobs, and 24.9 for the running jobs. The current master (0.36.1-24) gets 10/s and 115/s respectively.

I'm perplexed by the first set being that slow to be honest. If I swap --wait for --flag=waitable and flux job wait -a the submission rate jumps up to 466/s, but then takes ~10 seconds to actually retire the 100 jobs. The match times are trivial (the max is 0.099s) and I would expect at least 40/s if not more like 100/s to actually get scheduled like we see with the second set. So I wonder why they take so long to move through the rest of the system.

Either way, short of possibly digging into this further, probably good to close this?

If I swap --wait for --flag=waitable and flux job wait -a the submission rate jumps up to 466/s, but then takes ~10 seconds to actually retire the 100 jobs

This is because without --wait flux submit reports the rate of submission only, not the total throughput. This doesn't include the actual scheduling of the jobs.

trws commented

Right, understood. The confusion is in why we only get 10 jobs/s to free state when we get 100jobs/second to started state. The —wait-event=start one goes 100 jobs/second. So clearly they get scheduled at that rate, and normally that’s the limiting factor if there are enough resources (which there are, 9000 logical nodes worth) so why do we only get 10% of that through the rest of the system?

Ah, I missed you we're using wait-event=start. This does seem really strange. I'll try to take a look when I'm next at a computer. We may be able to get a bit more info if we look at the eventlogs of some of the jobs

trws commented

@grondo did you ever have a chance to poke at this? I'm still a bit curious what's holding things back here but forgot about it for a bit.

No, sorry this dropped off my radar 😔

I poked at this real quick and couldn't reproduce the results. If I add --wait-event=start to flux submit, then the observed throughput goes down to 0.4 job/s and flux job wait -a returns immediately:

expecting success: 
	flux submit --cc=1-100 --quiet --flags=waitable \
		--wait-event=start \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname &&
	time flux job wait -a

PD:0   R:100 CD:0   F:0   │███████████████████████████████████│100.0%  0.4 job/s

real	0m0.114s
user	0m0.068s
sys	0m0.035s

Without --wait-event=start, I see the 140+ job/s, but flux job wait -a takes a > 4 minutes to return:

expecting success: 
	flux submit --cc=1-100 --quiet --flags=waitable \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname &&
	time flux job wait -a

Submitting 100 jobs: │███████████████████████████████████████│100.0% 140.2 job/s

real	4m5.613s
user	0m0.082s
sys	0m0.031s

Any idea what I'm doing wrong? This is with current flux-sched master.

I thought to gather the stats at the end of the test. Looks like the avg match time is 1.8s which matches the observed throughput.

expecting success: 
	flux submit --cc=1-100 --quiet --flags=waitable \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname &&
	time flux job wait -a &&
	flux module stats sched-fluxion-resource

Submitting 100 jobs: │███████████████████████████████████████│100.0% 167.2 job/s

real	3m6.121s
user	0m0.070s
sys	0m0.010s
{
 "V": 1654785,
 "E": 1654784,
 "by_rank": {
  "[0-16383]": 101
 },
 "load-time": 70.579723353999995,
 "graph-uptime": 213,
 "time-since-reset": 283,
 "match": {
  "succeeded": {
   "njobs": 100,
   "njobs-reset": 100,
   "max-match-jobid": 1722365771776,
   "max-match-iters": 1,
   "stats": {
    "min": 1.775564986,
    "max": 1.9923653589999999,
    "avg": 1.8548949831600001,
    "variance": 0.0011645350766247222
   }
  },
  "failed": {
   "njobs": 0,
   "njobs-reset": 0,
   "max-match-jobid": 0,
   "max-match-iters": 0,
   "stats": {
    "min": 0.0,
    "max": 0.0,
    "avg": 0.0,
    "variance": 0.0
   }
  }
 }
}
ok 7 - create a set of 100 inactive jobs
trws commented
trws commented

I have an idea why this might be happening. Normally first is quick because it considers resources in order, sorted by most available components under them first. I'm guessing that we consider down nodes to have all their resources available, so even though we quickly skip them we consider them when we shouldn't. A quick fix for this might be to mark all the resources unavailable when a resource is marked down, or adding up/down as part of the key for that ordered map. It shouldn't normally make a huge difference, but treating the resources as "allocated" until the end of time in the planner might let us simpllicy some of the logic too. @milroy, you know the planners a whole lot more than I do, can you think of a reason that would cause problems?

trws commented

Found it, there was an extra traversal we don't need with a bit of adjustment. This test now runs at a more expected rate, actually much better than I expected:

expecting success:
	flux submit --cc=1-100 --quiet --flags=waitable \
		--wait-event=start \
		--requires="compute" \
		--progress --jps \
		--setattr=exec.test.run_duration=0.01s \
		hostname &&
	flux job wait -a

PD:0   R:100 CD:0   F:0   │█████████████████████████████████████│100.0% 135.3 job/s
ok 7 - create a set of 100 inactive jobs


real	0m0.838s
user	0m0.303s
sys	0m0.032s
expecting success:
	flux submit --progress --jps --quiet --cc=1-64 --wait-event=start -N1 \
		--requires=compute \
		--setattr=exec.test.run_duration=5m hostname

PD:0   R:64  CD:0   F:0   │█████████████████████████████████████│100.0% 206.2 job/s
ok 8 - create a set of 64 running jobs

I'm still cleaning up the PR, and something about this triggered a bug in the progress bar in core so there's one coming for that too.

Oh, nice! Thanks for that!