/socratic-llm

Training pipeline for fine tuning Phi-3-mini-instruct to follow the Socratic method

Primary LanguagePythonMIT LicenseMIT

ollama

Socratic LLM

Static Badge Static Badge Static Badge Static Badge Static Badge

Using Large Language Models (LLMs) in education presents unique challenges. Typically, LLMs are designed to provide direct answers to questions, which can hinder students' critical thinking and self-discovery skills. To address this, we focus on fine-tuning LLMs to facilitate Socratic interactions. Instead of giving straightforward answers, these models guide students to explore and find the answers themselves. We achieve this through Direct Preference Optimization (DPO). We test our approach with diverse datasets, including various educational materials and Socratic dialogues. Using advanced models like GPT-4o for evaluation, our results show that DPO successfully fine-tunes LLMs for Socratic dialogue, enhancing their educational value.

This repository contains the source material for the paper "Fine Tuning a Large Language Model for Socratic Interactions" (KKD-2024, AI4EDU Workshop).

Model inference

HuggingFace

It's possible to download and execute the model using HuggingFace's transformers library with:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "eurecom-ds/Phi-3-mini-4k-socratic",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)

tokenizer = AutoTokenizer.from_pretrained("eurecom-ds/Phi-3-mini-4k-socratic", trust_remote_code=True)

Check out for more details at Phi-3-mini-4k-socratic.

Ollama

The model is also available at OllamaHub: eurecom-ds/phi-3-mini-4k-socratic. We also made available the quantized versions for memory constrained environments. Ollama allows swiftly mounting this model in a web service, or simply for local execution. For example,

# Ollama installation
curl -fsSL https://ollama.com/install.sh | sh

# Launching ollama service
ollama serve &

# Running the quantized model locally
ollama run eurecom-ds/phi-3-mini-4k-socratic:Q4_0

Check out more about Ollama here.

ollama-demo.mp4

Chatbot

Running a chatbot

You can interact with the model using a chatbot application powered with Gradio by running a Docker container.

docker run --rm --gpus all -p 2121:2121 -v /home/<user>/huggingface/:/huggingface -e HF_HOME=/huggingface -it eurecomds/phi-3-mini-4k-socratic

You can specify which port the chatbot application starts with --server-port <port number> (default 2121), or load the model with 4-bit quantization by adding --load-in-4bit to the end of the above command line.

Building your own chatbot

Our model was trained to follow the Socratic method over multiple interactions. However, you need to provide a chat history to its inputs. Thus, we advise prefixing student's and professor's by role and to present them in a linear path to the model. For example, the chat history below can be used as the model's input.

Student: What can stop a fire?
Professor: Can you think of different methods or substances that might be effective in stopping a fire?
Student: I could use water
Professor: Water extinguishes fire due to its cooling effect and its ability to remove heat. Can you think about how the heat absorption by water might affect the fire triangle, which consists of heat, fuel, and oxygen? And considering your answer, what other methods could be effective in different scenarios?
Student: Maybe using a carbon dioxide for removing oxygen?

For more details, check out how we built our chatbot at socratic_ui.py.

Scripts

We also make available evaluation scripts.

  • self_eval.py: Perform evaluation of the LLM and prompt engineering (e.g., GPT-4o or Llama3:70b)
  • eval_model.py: Perform evaluation of the finetuned model or the base model and prompt engineering only
  • gen_train_dataset.py: Generates the dataset for DPO finetuning using another LLM as a judge (i.e., GPT-4o)
  • train.py: Runs DPO on the base model
  • human_vs_gpt.py: Use Judge model to perform evaluation of the human scored examples (validation of judge LLM)
  • pipeline.py: Executes the training pipeline end-to-end (DPO dataset generation + finetuning + evaluation)

For each script, check --help for more details.

Pipeline artifacts

When running the complete pipeline, the script generates a set of training and evaluation artifacts following the given structure:

├── training_root                                  # name to be specified by the user
│   ├── dpo                                        # DPO related files
│   │   ├── {dataset}                              # seed dataset {mathdial,tutorchat,debugging}
│   │       ├── train_dataset.json                 # Examples generated by the base model + prompt engineering then classified in choosen/rejected by the judge model
│   │       ├── weights                            # Finetuned model weights
│   │       ├── checkpoints                        # Training checkpoints
│   ├── evaluation                                 # Performance assements related files
│   │   ├── {dataset}                              # seed dataset {mathdial,tutorchat,debugging}
│   │   │   ├── from_finetuned_with_tutorchat.json # GPT-4o evaluation using model finetuned with tutorchat data 
│   │   │   ├── from_finetuned_with_mathdial.json  #    "        "       "     "       "      "   mathdial data
│   │   │   ├── from_finetuned_with_debugging.json #    "        "       "     "       "      "   debbuging data
│   │   │   ├── base.json                          #    "        "       "     base model + prompt-engineering
│   │   │   ├── gpt4o.json                         #    "        "       "     GPT-4o + prompt-engineering
│   │   ├── human_vs_gpt.json                      # Comparison between human asssessment and judge LLM
│   ├── figures                                    # report evaluation figures

Running in Docker container

It's possible to run any project's script with a Docker container. To do so, first build the image with

$  docker build -t socratic-llm .

Then run it with (tip: don't forget to mount the GPU and script's input/output directories). For example,

$ docker run --rm --gpus all -v socratic-llm/:/socractic-llm -v /home/<user>/huggingface:/huggingface -e HF_HOME=/huggingface -it socratic-llm -m pipeline --judge-llm openai <open-ai-key> gpt-4o --output-dir /socractic-llm --instruct-model microsoft/Phi-3-mini-4k-instruct

Cite this work

@inproceedings{"bonino2024socratic",
  title        = {Fine Tuning a Large Language Model for Socratic Interactions},
  author       = {Giulia Bonino and Gabriele Sanmartino and Giovanni Gatti Pinheiro and Paolo Papotti and Raphael Troncy and Pietro Michiardi},
  year         = 2024,
  month        = {August},
  booktitle    = {Proceedings of the Workshop On AI For Eeducation (AI4EDU), in conjunction with the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)},
  publisher    = {ACM Press},
  address      = {Barcelona},
  organization = {ACM}
}