CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

This repository contains code and data for the paper: Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD). CRAFT-MD is a multi-agent conversational framework, consisting of doctor-AI, patient-AI and grader-AI agents along with medical experts. It designed to assess the capabilities of LLMs in medical diagnosis. Within the CRAFT-MD framework, doctor-AI and patient-AI agents emulate the intricate dynamics of a doctor-patient interaction, while the grader-AI agent and medical experts evaluate the conversations. We used CRAFT-MD to evaluate the clinical conversational reasoning abilities of GPT-4 and GPT-3.5.

Directory structure

data/ - Contains the dataset, consists of 140 case vignettes, spanning a diverse range of dermatologic diseases. Each case vignette is accompanied with 4 answer choices (columns Choice 1, Choice 2, Choice 3, and Choice 4). The correct answer is indicated in the Answer column. Category details the dermatologist's assessment of each case vignette as follows: single diagnosis (category 1), one most likely diagnosis (category 2), or many possible diagnoses (category 3).
results/ - Contains intermediate results and model output.
- conversations_raw/ - contains the raw outputs from GPT-4 and GPT-3.5 conversations, generated using CRAFT-MD.
- expert_annotations/ - contains dermatologist assessment of doctor-AI, patient-AI and grader-AI using 120 conversations, 60 generated by GPT-4 and 60 by GPT-3.5.
- statistics/ - contains statistics for all experiments in the manuscript.
- results_vignette.json - contains the results for vignette + 4-choice MCQ, vignette + many-choice MCQ, and vignette + FRQ experiments.
- results_conversation.json - contains the results for multi-turn, single-turn and summarized conversations, followed by 4-choice MCQ, many-choice MCQ, and FRQ.
- results_conversation_withoutPE.json - contains the results for conversation without physical exam followed by 4-choice MCQ, many-choice MCQ, and FRQ.
src/ - contains re-usable code for implementing the CRAFT-MD framework, and conducting analyses presented in the manuscript.

Jupyter notebooks that can be run to replicate the CRAFT-MD framework and other results presented in the manuscript.

Running CRAFT-MD

Note: You will need to setup an OpenAI API setup, and replace the OpenAI key information in the below script

CRAFT-MD can be used for simulating conversation with physical exam from a case vignette as follows (gpt-3.5 or gpt-4 should be specified as a parameter): python run_conversation_withPE.py gpt-4

CRAFT-MD can be used for simulating conversation without physical exam from a case vignette as follows (gpt-3.5 or gpt-4 should be specified as a parameter):
python run_conversation_withoutPE.py gpt-4

For using CRAFT-MD on a different dataset, replace ./data/dataset_final.tsv with the new dataset.

System Requirements

All code contained in the repository was tested on python v3.9.17. The conda environment used can be re-created using conda_env.yml. All code can be run on a normal desktop computer, with run time of less than 1 minute per case vignette, for all experiments.

Dependencies: statsmodels >= 0.14.0 scipy >= 1.10.1 openai >= 0.27.8

rajpurkarlab/craft-md

CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

Directory structure

Running CRAFT-MD

System Requirements