SIMEXP/psom

strange behaviour in a benchmark on mammouth

Closed this issue · 1 comments

So I just ran a benchmark on mammouth using PSOM_TEST_SLEEP.

Jobs 3 sec long + <1 sec random. 100 chains of 4 jobs, running using 100 workers. Ideal parallel completion time should have been between 12 and 16 sec. Instead it took several minutes. I have attached the plots of # of jobs running as a function of time. The manager (and garbage collector) seemed unable to detect job completion, despite news_feed being appropriately updated, at least on the head node. At this stage I am left scratching my head to understand why the news_feed of workers is not being read properly. I'll debug more.
nb_jobs_running_test_sleep_100x4_max100_mam

OK so I have re-designed the manager/worker communication mechanism. Now the workers generate .failed or .finished tag files for each job. The manager tests the existence of these files for all running jobs. The garbage collector deletes the tags when it's collecting the profiles and logs. This new system behaves better. I am attaching two runs, one on mammouth and one on guillimin. The performance is still bad on mammouth but at least I do not get errors anymore, the system is pretty stable. This benchmark really is on the fringe of the target use cases of the library. I'll consider this closed.
nb_jobs_running_test_sleep_100x4_max100_gui
nb_jobs_running_test_sleep_100x4_max100_mam2