The model training issue with reward function optimizing makespan
Opened this issue · 14 comments
Hi, Hongzi
I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence?
Ok we need to debug this - it's been a while since I trained with the makespan reward. The reward calculation is here
decima-sim/spark_env/reward_calculator.py
Lines 32 to 34 in c010dd7
The learning curve you show is helpful - it shows the agent doesn't get any learning signal, the actor loss is essential 0 (in the scale of 1e-11). It's likely that the reward agent gets is all 0 or all constant. Somewhere the reward assignment to the action is off.
I will try to squeeze some time to run the code myself too - but could you run it and print out the reward to start debugging? Thanks!
Hi, there
I have tried for a couple of times with careful setting on args, however, the problem persists. I suppose the reward (shown in line 33 above) is obtained with a static time interval, e.g. from last scheduling to current scheduling step. Then the long-term return is calculated by summation of them, which is actually the time point of the final scheduling step. It is not the makespan of all jobs, as some might be still running after the scheduling. Just guess it's because the reward does not reflect the accurate makespan at all. The problem seems to be what the proper reward function that reflects the metric (makespan) should be like?
For makespan, it only makes sense to run a fixed batch of jobs (i.e., no new arrival of jobs). In your settings, did you set these settings --num_stream_*
to 0 and only use --num_init_dags
?
Hi, Hongzi,
It might be the problem, as I am not aware of the effects of stream jobs in the system on make-span. Actually, I kept them 200 (num stream jobs) each episode. I will quicly figure it out and see the result.
Hi, Hongzi
I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence?
Hi Zhang!
It seems that you have built up the enviorment successfully. May I know the SW version(e.g. tf verison, python version) you have for setting up the enviroment? I tried it but found some libs are missing. Thanks in advance!
Hi, there
To build the enviroment needs no more operations than jus cloning the whole repository. To refer, my tf.version is 1.13 and python version is 3.6.
Hi, Hongzi
During past days, I retrained the model with suggested settings, e.g. num_init_dags > 0 and num_stream_dags = 0. The detailed instruction is as follow.
nohup python3 train.py --exec_cap 25 --num_init_dags 100 --learn_obj 'makespan' --num_stream_dags 0 --reset_prob 5e-7 --reset_prob_min 5e-8 --reset_prob_decay 4e-10 --diff_reward_enabled 1 --num_agents 4 --model_save_interval 100 --num_ep 3005 --model_folder ./models/batch_100_job_diff_reward_reset_5e-7_5e-8_makespan_ep3000/ > out.log 2>&1 &
However, the average reward collected by the agent is still -1 during training. I feel the function (line 33-34) used by reward calculator may just give a static signal over time. Any suggestions on it?
We may have to print the reward values and examine it. Just start from the bare minimum, try using num_init_dags = 1
and num_stream_dags = 0
. Log all the reward values for the actions to finish this single job. Could you check if the reward you get corresponds to the this job completion time? After checking this simple scenario, we can move to two jobs, and multiple jobs. Based on what you showed, there might be some bugs with the current code for this makespan reward. Thanks!
Hi there, I have a question regarding the number a agents. Do you know the reason to have multiple agents, e.g. args.num_agents = 16 by default.
When the program halts, is there an error message?
Multiple agents are just for speeding up the training. Parallel agents (threads on CPUs) generate experience concurrently. You can set args.num_agents
based on the number of CPUs you have on your machine.
Hi Hongzi, thanks! There is no error message at all but some warnings(related to some python libarary funcitons) which seems not to critical. Since I am trying it with CPU version, I suppose that it takes too long to train it which looks like that the program stops. Could you share a bit on the tranning time you had before?
@zhangsj0608 @Nannnnnn hi, would you like share the code part that you used to plot those figure. I need help on that part to plot those figures that used in Decima paper. I can't generate any figure so far. Can you please help sharing those code part to plot as like paper. Thank you.