ai-alignment-project

Citation

This github repo relies on code from https://github.com/andyzoujm/representation-engineering and https://github.com/saprmarks/geometry-of-truth

TO-DO

Using Python 3.11

  • Set up access to model internals
  • Set up environments
  • Set up Avalon-LLM
    • Get it to run locally
    • Testing activation patching
    • Testing Mistral involvement
  • Get training data (2 truths 1 lie)
  • Prompt experimentation
    • LLaMA (Bruce)
    • Mistral (Bruce)
    • Test with DSPy (Rob)
    • Test with getting GPT-4 to provide few-shot prompting?
  • Get datasets with response + hidden representations
    • Make dataset of 2 truths and a lie (each row is 3 sentences, a label for each, and then the group has a label (Stephen)
  • RepE
    • testing using RepE dataset (simplified) (Rob)
    • Testing on two truths and a lie dataset (Bruce)
    • Run to get honesty eval on 2 truths and a lie (Stephen)
    • Visualize (Stephen)
    • Test on Avalon?
  • Training Approaches
    • Make dataset of N truths M lies
      • One and one, two and one, two and two
    • Train on those (Rob)
    • See performance
  • Write-up
  • Abstract
  • Figure/Diagram
  • Formal Description
  • Related Work
  • [ ]