Agenta-AI/agenta

[AGE-291] [Bug] Code Evaluation is not working

shhhhhh2001 opened this issue · 3 comments

Describe the bug
The default code evaluation function in the "automatic evaluation" group is not working properly.

To Reproduce
Steps to reproduce the behavior:

  1. Randomly create a variant in the playground.
  2. Go to 'Automatic Evaluation' -- 'Evaluators'
  3. Click 'New Evaluator' -- 'Code Evaluation' -- 'Save'
  4. Go to 'Automatic Evaluation' -- 'Results'
  5. Click 'New Evaluation'
  6. Choose a random dataset and the previously created variant, and select the 'code evaluation' evaluator
  7. Click 'Create'
  8. See error

Expected behavior
Any float number shown in the corresponding column in the same page.

Screenshots
图片
图片

Desktop (please complete the following information):

  • OS: Win10
  • Browser: Firefox
  • Version: 126.0.1

Important Context
The 'code evaluation' function properly worked last week but not today...

AGE-291

The issue with the 'code evaluation' function in the 'automatic evaluation' group is due to the direct_use attribute being set to False. To fix this, you need to change the direct_use attribute to True for the 'Code Evaluation' evaluator. Here is the corrected code snippet:

{
    "name": "Code Evaluation",
    "key": "auto_custom_code_run",
    "direct_use": True,  # Change this to True
    "settings_template": {
        "code": {
            "label": "Evaluation Code",
            "type": "code",
            "default": "from typing import Dict\n\ndef evaluate(\n    app_params: Dict[str, str],\n    inputs: Dict[str, str],\n    output: str,\n    correct_answer: str\n) -> float:\n    # ...\n    return 0.75  # Replace with your calculated score",
            "description": "Code for evaluating submissions",
            "required": True,
        }
    },
    "description": "Code Evaluation allows you to write your own evaluator in Python. You need to provide the Python code for the evaluator.",
}

Additionally, ensure that the evaluation scenarios are correctly created and updated in the backend. The agenta-backend service provides functions to handle evaluation scenarios, such as create_evaluation_scenario and update_human_evaluation_scenario. Make sure these functions are properly integrated and invoked in your workflow to manage evaluation scenarios effectively [1][2].

Moreover, ensure that the frontend components are correctly fetching and displaying the evaluation results. The fetchEvaluations function in AutomaticEvaluationResult.tsx is responsible for loading evaluations and their results. Verify that this function is correctly mapping and transforming the evaluation data, and that the custom_code_eval_id is properly set and used in the evaluation process [3].

The error message "execute_code_script() got an unexpected keyword argument 'data_point'" suggests that there might be a mismatch in the arguments being passed to the execute_code_script function. Ensure that the function definition matches the arguments being provided. The auto_custom_code_run function should handle errors using a try-except block, as shown below:

def auto_custom_code_run(
    inputs: Dict[str, Any],
    output: str,
    data_point: Dict[str, Any],
    app_params: Dict[str, Any],
    settings_values: Dict[str, Any],
    lm_providers_keys: Dict[str, Any],  # pylint: disable=unused-argument
) -> Result:
    try:
        result = sandbox.execute_code_safely(
            app_params=app_params,
            inputs=inputs,
            output=output,
            data_point=data_point,
            code=settings_values["code"],
        )
        return Result(type="number", value=result)
    except Exception as e:  # pylint: disable=broad-except
        return Result(
            type="error",
            value=None,
            error=Error(
                message="Error during Auto Custom Code Evaluation", stacktrace=str(e)
            ),
        )

Ensure that the execute_code_script function is defined to accept the data_point argument [4].

To continue talking to Dosu, mention @dosu.

After checking your code, I think that the variable "data_point" shouldn't be passed to the "sandbox.execute_code_safely" as it doesn't have a parameter called "data_point".

Thank you @shhhhhh2001 , I am looking into this now