hongzimao/deeprm

Modifications on Reward Signal

Closed this issue · 1 comments

Following the recommendation to post an e-mail conversation (adapted) on the issues page so others can also learn from this and discuss. Regarding studies of how job's runtime unaccuracies could affect the RL and the overall scheduler performance:

I'm very interested in investigating the ability of the reinforcement learning agent in performing well when considering different job models, in particular, at first, jobs with uncertain length, i.e., experimenting on some of the partial observability discussion in the paper's section 5. Because it would be needed a set of conditional observation probabilities to cast the problem as a POMDP, I thought of a preliminary methodology that would involve, at first, randomly choosing the reward as either the original one, which uses the original job length, or a modified one, which would use the original job length + 1 in its calculations. With this I'm planning to test the RL robustness to some uncertainty in the job length.
I saw you were very solicitous to answer some questions in the GitHub code's issues section (reading your answers helped me a lot through the comprehension of the code) and decided to write this e-mail to ask if you could, if you have the time, reply with any immediate methodological flaws you can see in my approach, I really appreciate any thoughts you can provide.
Sincerely,
Vinícius [...]

Hi Vinicius,

I see what are trying to do. The high level goal for training a robust agent makes a pretty decent sense. I wonder if consistently +1 in the reward will create enough disturbance. You might want to perturb the reward signal with a noise sampled from some distribution (which can have some bias, as in your +1 case). You can vary the distribution and see how it affect the system.

Would be nice if you can post on the GitHub issue page so that others can also learn from it.

Thanks,
Hongzi

Since then I've had some very interesting results creating disturbance in reward using normal distributions, i.e, changing the reward like , but my intention is to also check for uniform and halfnormal distributions, since it's known that users runtime estimates are almost always overestimated, although some very interesting concerns and issues are appearing:

  1. Depending on the distribution and it's parameters, the performance could be heavily influenced by the workload model, e.g. N(1,1²) is an uncertainty higher in terms of percentage than N(15,1²); (Some carefully picked methods to introduce estimation errors can be found in II.A of this paper)
  2. The original workload has a job distribution in which a lot of jobs has duration 1 (80% of
    the jobs have duration uniformly chosen between 1t and 3t), carelessly setting up the disturbances could cause the estimate runtime to be a negative number.

Thanks for posting your exploration! Nice work!