strange behaviour in a benchmark on mammouth
Closed this issue · 1 comments
So I just ran a benchmark on mammouth using PSOM_TEST_SLEEP.
Jobs 3 sec long + <1 sec random. 100 chains of 4 jobs, running using 100 workers. Ideal parallel completion time should have been between 12 and 16 sec. Instead it took several minutes. I have attached the plots of # of jobs running as a function of time. The manager (and garbage collector) seemed unable to detect job completion, despite news_feed being appropriately updated, at least on the head node. At this stage I am left scratching my head to understand why the news_feed of workers is not being read properly. I'll debug more.
OK so I have re-designed the manager/worker communication mechanism. Now the workers generate .failed or .finished tag files for each job. The manager tests the existence of these files for all running jobs. The garbage collector deletes the tags when it's collecting the profiles and logs. This new system behaves better. I am attaching two runs, one on mammouth and one on guillimin. The performance is still bad on mammouth but at least I do not get errors anymore, the system is pretty stable. This benchmark really is on the fringe of the target use cases of the library. I'll consider this closed.