alex-petrenko/sample-factory

Evaluation during training

Opened this issue · 2 comments

Hi @alex-petrenko!

Thank you so much for the great project!

I am wondering whether there is a plan to include an evaluator component to the framework that is capable of periodically running the policy in other environments. Say, if I'd like to test the generalization capabilities of the model, or employ it in a continual or curriculum learning setting, it would be necessary to periodically evaluate the policy on environments not actively being trained on.

The enjoy.py script is very useful, but it is rather inconvenient and costly to store numerous checkpoints throughout training and separately run the evaluation on each of them afterward. It would be very handy to have it incorporated into the training run and have the results aggregated under one run.

I found an empty default_evaluator.py, which might have been initiated for that very purpose a while back.

I reckon that the appropriate way is to integrate the evaluator in the connect_components() of Runner to pick up a signal from the learner_worker. For instance, after every n policy updates, the evaluator could obtain the most recent policy, run it in a set of environments, and report the results back to some other component. Perhaps you could give some high-level pointers on how to properly implement this, such that it would follow the architecture and design paradigms of the project to avoid a hacky solution.

Cheers,
Tristan Tomilin

Hi Tristan!

Great question! Your intuition is pretty much on point!

I suppose the most straightforward way to implement the evaluator would be to add an "AlgoObserver". There's an example in train.py:

    runner = runner_cls(cfg)

    if cfg.with_pbt:
        runner.register_observer(PopulationBasedTraining(cfg, runner))

    return cfg, runner

If you can formulate your evaluator as an algo observer (e.g. just copy the most recent checkpoint and fire a process there once per N training iterations), that'd be the easiest way to go.

I you need something a bit more sophisticated, e.g. maybe you want a continuously running parallel process that provides some kind of feedback for the main process, indeed you might want to consider a combination of EventLoopProcess and EventLoopObject. Basically, you spawn a process, create an evaluator object which lives on this event loop, connect some signals and slots so you can exchange messages between this process and the runner process. There's a bit of a learning curve to this, but it's totally doable!

Thanks a lot for the suggestions! The AlgoObserver indeed seems to be suitable for what I was after.