Keeping language models honest by directly eliciting knowledge encoded in their activations. Building on "Discovering latent knowledge in language models without supervision" (Burns et al. 2022)
Primary LanguagePythonMIT LicenseMIT
No one’s watching this repository yet.