This github repo relies on code from https://github.com/andyzoujm/representation-engineering and https://github.com/saprmarks/geometry-of-truth
Using Python 3.11
- Set up access to model internals
- Set up environments
- Set up Avalon-LLM
- Get it to run locally
- Testing activation patching
- Testing Mistral involvement
- Get training data (2 truths 1 lie)
- Prompt experimentation
- LLaMA (Bruce)
- Mistral (Bruce)
- Test with DSPy (Rob)
- Test with getting GPT-4 to provide few-shot prompting?
- Get datasets with response + hidden representations
- Make dataset of 2 truths and a lie (each row is 3 sentences, a label for each, and then the group has a label (Stephen)
- RepE
- testing using RepE dataset (simplified) (Rob)
- Testing on two truths and a lie dataset (Bruce)
- Run to get honesty eval on 2 truths and a lie (Stephen)
- Visualize (Stephen)
- Test on Avalon?
- Training Approaches
- Make dataset of N truths M lies
- One and one, two and one, two and two
- Train on those (Rob)
- See performance
- Make dataset of N truths M lies
- Write-up
- Abstract
- Figure/Diagram
- Formal Description
- Related Work
- [ ]