microsoft/WindowsAgentArena

Running the base benchmark tasks on local: How to run specific tasks/category

asc0216 opened this issue · 5 comments

I followed the readme to run the tasks using

./run-local.sh --start-client true

I see no change in the terminal atleast for ~1 hour:
image

Questions:

  • what is the expected output when running the benchmark?
  • how can I run a subset of the tasks?

Could you share the content of the ps_script_log.txt? You can find the log file in the Setup folder on the desktop of the running VM. You can access the VM at localhost:8006 if using a browser or localhost:3390 if using RDP. More information on how to troubleshoot the preparation step can be found here.

The expected output when run locally typically comes out as a reported success score logged or printed to terminal/console--one after each task (1 if successful, 0 if not or a value between 0 and 1 if the task relies on some similarity measure) and one after each group of tasks or after all tasks are done for an overall success rate. The list of scores/rewards are stored in a list that's appended to as the benchmark loops through tasks so it should be straightforward to save them out if you need it for your own purposes.

To run a subset of the tasks, all you have to do is create a new json file similar to those that exist under
/win-arena-container/client/evaluation_examples_windows. The new json should mimic the other jsons that are there in that the keys should be the category/application of the tasks and the values should be a list of the task IDs for each one of the keys. For instance, you could make a copy of test_all.json and edit that down to the subset of tasks you want by keeping the programs and IDs corresponding to your desired tasks.

the ps_script_log.txt looks like the below, I see failures with installing LibreOffice and pip:
image
image
image
image

This should have been addressed as per #24:

  1. Pull the latest changes from the main branch.
  2. Run docker pull windowsarena/winarena:latest.
  3. Remove any existing content in src/win-arena-container/vm/storage.
  4. Execute again ./run-local --prepare-image true.

Closing for now. Let me know if you're still experiencing any issues at the preparation step.