[AGE-291] [Bug] Code Evaluation is not working
shhhhhh2001 opened this issue · 3 comments
Describe the bug
The default code evaluation function in the "automatic evaluation" group is not working properly.
To Reproduce
Steps to reproduce the behavior:
- Randomly create a variant in the playground.
- Go to 'Automatic Evaluation' -- 'Evaluators'
- Click 'New Evaluator' -- 'Code Evaluation' -- 'Save'
- Go to 'Automatic Evaluation' -- 'Results'
- Click 'New Evaluation'
- Choose a random dataset and the previously created variant, and select the 'code evaluation' evaluator
- Click 'Create'
- See error
Expected behavior
Any float number shown in the corresponding column in the same page.
Desktop (please complete the following information):
- OS: Win10
- Browser: Firefox
- Version: 126.0.1
Important Context
The 'code evaluation' function properly worked last week but not today...
The issue with the 'code evaluation' function in the 'automatic evaluation' group is due to the direct_use
attribute being set to False
. To fix this, you need to change the direct_use
attribute to True
for the 'Code Evaluation' evaluator. Here is the corrected code snippet:
{
"name": "Code Evaluation",
"key": "auto_custom_code_run",
"direct_use": True, # Change this to True
"settings_template": {
"code": {
"label": "Evaluation Code",
"type": "code",
"default": "from typing import Dict\n\ndef evaluate(\n app_params: Dict[str, str],\n inputs: Dict[str, str],\n output: str,\n correct_answer: str\n) -> float:\n # ...\n return 0.75 # Replace with your calculated score",
"description": "Code for evaluating submissions",
"required": True,
}
},
"description": "Code Evaluation allows you to write your own evaluator in Python. You need to provide the Python code for the evaluator.",
}
Additionally, ensure that the evaluation scenarios are correctly created and updated in the backend. The agenta-backend
service provides functions to handle evaluation scenarios, such as create_evaluation_scenario
and update_human_evaluation_scenario
. Make sure these functions are properly integrated and invoked in your workflow to manage evaluation scenarios effectively [1][2].
Moreover, ensure that the frontend components are correctly fetching and displaying the evaluation results. The fetchEvaluations
function in AutomaticEvaluationResult.tsx
is responsible for loading evaluations and their results. Verify that this function is correctly mapping and transforming the evaluation data, and that the custom_code_eval_id
is properly set and used in the evaluation process [3].
The error message "execute_code_script() got an unexpected keyword argument 'data_point'" suggests that there might be a mismatch in the arguments being passed to the execute_code_script
function. Ensure that the function definition matches the arguments being provided. The auto_custom_code_run
function should handle errors using a try-except block, as shown below:
def auto_custom_code_run(
inputs: Dict[str, Any],
output: str,
data_point: Dict[str, Any],
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any], # pylint: disable=unused-argument
) -> Result:
try:
result = sandbox.execute_code_safely(
app_params=app_params,
inputs=inputs,
output=output,
data_point=data_point,
code=settings_values["code"],
)
return Result(type="number", value=result)
except Exception as e: # pylint: disable=broad-except
return Result(
type="error",
value=None,
error=Error(
message="Error during Auto Custom Code Evaluation", stacktrace=str(e)
),
)
Ensure that the execute_code_script
function is defined to accept the data_point
argument [4].
After checking your code, I think that the variable "data_point" shouldn't be passed to the "sandbox.execute_code_safely" as it doesn't have a parameter called "data_point".
Thank you @shhhhhh2001 , I am looking into this now