Note 2021-07-31: I decide to leave AI alignment research last year. See https://www.lesswrong.com/posts/HDXLTFnSndhpLj2XZ/i-m-leaving-ai-alignment-you-better-stay This repo shows the state of my research at the time I left.
BLUF: Read Overseer failures in SupAmp and ReAmp, then the interesting parts of Training a tiny SupAmp model on easy tasks.
Project definition (CoR):
I'm studying the impact of overseer failure on RL-based IDA,
because I want to know under what conditions amplification and distillation
increase or decrease the failure rate,
in order to help my reader understand whether explicit reliability
amplification is necessary for IDA to work in practice.
In this project I will:
- Take the implementation of iterated distillation and amplification from Christiano et al.'s ‘Supervising strong learners by amplifying weak experts’, introduce overseer failures and see how they influence the overall failure rate.
- Adapt the system to reinforcement learning. (It uses supervised learning now.)
- Introduce overseer failures in the RL setting and see how they influence the overall failure rate.
- Write a paper about the results.
Overseer failures in SupAmp and ReAmp contains a more extensive introduction, as well as an explanation of the relevant terms, concepts etc.
For the code see rmoehn/amplification, which is a fork of paulfchristiano/amplification.
- Overseer failures in SupAmp and ReAmp – Start here. This is the most polished document so far.
- Training a tiny SupAmp model on easy tasks – Report on first experiments.
- How to turn SupAmp into ReAmp? – Less detailed and less polished analysis of how to approach the first project phase.
- What I need for planning the Farlamp draft – Will contain all the information I need for planning a draft of the paper.
- Literature overview – Work in progress of searching, skimming, filtering, summarizing literature for this project.
- Current project outline – Overview of cases and estimates. Most up to date, but doesn't contain upcoming milestones.
- Old project outline – Outdated, but does contain some upcoming milestones.
There are more files, but they are only useful for me. The code won't be published here, because it will be based on the code from CSASupAmp, which underlies some strict publication policy.
Term | Definition |
---|---|
CoR | Booth et al.: The Craft of Research |
CSASupAmp | Christiano et al.: Supervising strong learners by amplifying weak experts |
Est. 5 % | 5th percentile of my estimated duration distribution/leftmost point in triangle distribution |
Est. mode | mode of my estimated duration distribution |
Est. 95 % | 95th percentile of my estimated duration distribution/rightmost point in triangle distribution |
Farlamp | Failures in RL-based amplification (I just had to come up with a short project name.) |
Draft Basis | A template derived from CoR, p. 175, which when filled in completely, provides all the information necessary for planning a draft. Includes the structure of the argument. |
LW | LessWrong |
MxD | MIRIxDiscord |
RL | reinforcement learning |
ReAmp | SupAmp adapted to RL |
SL | supervised learning |
SupAmp | The system from CSASupAmp for iterated distillation and amplification using supervised learning |
For detailed bibliographical information see references.bib.
Thanks to Paul Christiano for funding this project and giving me advice. Thanks also to William Saunders for providing his version of the CSASupAmp code.
To the extent possible under law,
Richard Möhn
has waived all copyright and related or neighboring rights to
Farlamp documentation.
This work is published from:
Japan.