ai-alignment-project

Citation

Using Python 3.11

Set up access to model internals
Set up environments
Set up Avalon-LLM
- Get it to run locally
- Testing activation patching
- Testing Mistral involvement
Get training data (2 truths 1 lie)
Prompt experimentation
- LLaMA (Bruce)
- Mistral (Bruce)
- Test with DSPy (Rob)
- Test with getting GPT-4 to provide few-shot prompting?
Get datasets with response + hidden representations
- Make dataset of 2 truths and a lie (each row is 3 sentences, a label for each, and then the group has a label (Stephen)
RepE
- testing using RepE dataset (simplified) (Rob)
- Testing on two truths and a lie dataset (Bruce)
- Run to get honesty eval on 2 truths and a lie (Stephen)
- Visualize (Stephen)
- Test on Avalon?
Training Approaches
- Make dataset of N truths M lies
  - One and one, two and one, two and two
- Train on those (Rob)
- See performance
Write-up
Abstract
Figure/Diagram
Formal Description
Related Work
[ ]