web-arena-x/webarena

Could I do multi-thread evaluation?

Opened this issue · 4 comments

To speed up the evaluation, I would like to evaluate, say 64 examples in parallel with multiple threads. Does this affect the correctness of the evaluation? Thanks a lot!

That may affect the results. The reason is that we deliberately design the order of examples so that former examples won't affect later examples.

This is the script for 4 parallel runs.
You can also reset the environment more frequently to avoid the inter-example influence.

Thanks a lot for the reply!

  1. In my understanding, with the reset environment, the evaluation of each example is correct. Therefore, I may set up two AWS instances, and evaluate, say examples 1-406 with instance 1, and examples 407-812 with instance 2. Is such evaluation correct?
  2. Sometimes errors may happen in the middle. For example, if the evaluation of the 10th example breaks down, could I just continue to evaluate the 11th example without re-evaluating the first 10 examples and without resetting environments?

Your kind suggestions are highly appreciated!

Hello! Do you mind elaborating on how the earlier tasks are dependent on later tasks? Is there any way to launch separate sites for each new task that we're evaluating so that we can run multiple agents at the same time? How often should the environment resets be happening? Thanks for you help :)

Hello, do you have any advise on how to set up multiple dockers for the same website. For example, we can set up 10 shoping weisite with different port. So we can parallel evaluate it. Thank you!