This repository contains code and data for the paper: Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD). CRAFT-MD is a multi-agent conversational framework, consisting of doctor-AI, patient-AI and grader-AI agents along with medical experts. It designed to assess the capabilities of LLMs in medical diagnosis. Within the CRAFT-MD framework, doctor-AI and patient-AI agents emulate the intricate dynamics of a doctor-patient interaction, while the grader-AI agent and medical experts evaluate the conversations. We used CRAFT-MD to evaluate the clinical conversational reasoning abilities of GPT-4 and GPT-3.5.
data/
- Contains the dataset, consists of 140 case vignettes, spanning a diverse range of dermatologic diseases. Each case vignette is accompanied with 4 answer choices (columnsChoice 1
,Choice 2
,Choice 3
, andChoice 4
). The correct answer is indicated in theAnswer
column.Category
details the dermatologist's assessment of each case vignette as follows: single diagnosis (category 1), one most likely diagnosis (category 2), or many possible diagnoses (category 3).results/
- Contains intermediate results and model output.conversations_raw/
- contains the raw outputs from GPT-4 and GPT-3.5 conversations, generated using CRAFT-MD.expert_annotations/
- contains dermatologist assessment of doctor-AI, patient-AI and grader-AI using 120 conversations, 60 generated by GPT-4 and 60 by GPT-3.5.statistics/
- contains statistics for all experiments in the manuscript.results_vignette.json
- contains the results for vignette + 4-choice MCQ, vignette + many-choice MCQ, and vignette + FRQ experiments.results_conversation.json
- contains the results for multi-turn, single-turn and summarized conversations, followed by 4-choice MCQ, many-choice MCQ, and FRQ.results_conversation_withoutPE.json
- contains the results for conversation without physical exam followed by 4-choice MCQ, many-choice MCQ, and FRQ.
src/
- contains re-usable code for implementing the CRAFT-MD framework, and conducting analyses presented in the manuscript.
Jupyter notebooks that can be run to replicate the CRAFT-MD framework and other results presented in the manuscript.
Note: You will need to setup an OpenAI API setup, and replace the OpenAI key information in the below script
CRAFT-MD can be used for simulating conversation with physical exam from a case vignette as follows (gpt-3.5 or gpt-4 should be specified as a parameter):
python run_conversation_withPE.py gpt-4
CRAFT-MD can be used for simulating conversation without physical exam from a case vignette as follows (gpt-3.5 or gpt-4 should be specified as a parameter):
python run_conversation_withoutPE.py gpt-4
For using CRAFT-MD on a different dataset, replace ./data/dataset_final.tsv
with the new dataset.
All code contained in the repository was tested on python v3.9.17. The conda environment used can be re-created using conda_env.yml
. All code can be run on a normal desktop computer, with run time of less than 1 minute per case vignette, for all experiments.
Dependencies: statsmodels >= 0.14.0 scipy >= 1.10.1 openai >= 0.27.8