/elk

Keeping language models honest by directly eliciting knowledge encoded in their activations. Building on "Discovering latent knowledge in language models without supervision" (Burns et al. 2022)

Primary LanguagePythonMIT LicenseMIT

Watchers

No one’s watching this repository yet.