Research Prompt Comparison Tool

This tool is designed to compare the effectiveness of a modified prompt against the original prompt used in a research paper. It evaluates the number of correct answers generated by each prompt when applied to a specific task, allowing researchers to quantitatively assess the impact of their modifications.

Blog post: Improving GPT-4’s Visual Reasoning with Prompting

Features

  • Prompt Evaluation: Compares the original research paper prompt with a modified version.
  • Correct Answer Counting: Counts the number of correct answers generated by each prompt.
  • JSONL Support: Reads and processes results stored in JSONL format.

How It Works

The tool operates by reading through JSONL files that contain the results of running both the original and modified prompts. Each entry in these files includes whether the response was correct. The tool counts and compares the number of correct responses for each prompt.

Usage

  1. Ensure you have Python installed on your system.
  2. Place your JSONL files in the designated folders. By default, these are named modified_prompt_results.jsonl and paper_prompt_results.jsonl.
  3. Add your OpenAI API key to a .env as OPENAI_API_KEY
  4. Run the script using Python:
python vision_test.py

Metrics

The metrics.py script is designed to count the number of correct answers from results stored in JSONL files. It specifically processes two files: modified_prompt_results.jsonl and paper_prompt_results.jsonl. Each line in these files represents a result in JSON format, where the script looks for a "correct" key to determine if the answer was correct ("true").

Here's a brief overview of its functionality:

  1. Import Required Modules: It imports the json module for parsing JSON data.
  2. Define count_correct_answers Function: This function takes a file path as an argument, reads the file line by line, parses each line as JSON, and counts the occurrences where the "correct" key has the value "true".
  3. Set File Paths: It sets the paths to the JSONL files to be processed.
  4. Count Correct Answers: It calls count_correct_answers for each file and stores the results.
  5. Print Results: Finally, it prints the number of correct answers found in each file. This script is useful for evaluating the effectiveness of different prompts by comparing the number of correct responses generated by each.
python metrics.py

Output

The script prints the count of correct answers for both the modified and original prompts, allowing for a direct comparison.

Requirements

  • Python 3.x
  • JSONL files with the results of the prompt evaluations