In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. Then, for each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions again. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection. In addition, we compared the various types of self-reflection to determine their individual contribution to performance.
- Solve with Baseline - answers all questions using the baseline agent
- Reflect on Solution - self-reflects on incorrectly answered problems given the correct answer
- Save Reflections - separates reflections by type, redacts answers, and saves reflection text
- Solve with Reflection - re-answers all incorrectly answered questions using the reflections
- Plot Accuracy - plots the accuracy for each agent
- Plot Accuracy by Model and Agent - plots the accuracy by model and agent
- Plot Accuracy by Exam and Agent - plots the accuracy for each model by exam and agent
- Analyze Details - performs the McNemar test and creates a table of the results
- Analyze Keywords - analyzes the error keywords produced by the self-reflections
- Details - the low-level level details for each question answered in CSV format
- Dialogs - the dialog for each question answered in JSON format
- Exams - the exams containing MCQA problems in JSONL format
- Logs - the log files for each question answered and self-reflection in plain-text format
- Plots - the data visualizations of the results in PDF format
- Reflections - the text generated during the self-reflections process stored as plain text files
- Results - the results from the experiment in CSV format
- Tables - the tabular results of the analysis stored as CSV files