llm_judge_{task}.py
: the script to run inference. I call it llm_judge as I set the evaluation prompt following LLM-as-a-judge (except MMLU), but technically it can be any inference task.run*.sh
: example of the actual commands to call the python scripts