/alignment-attribution-code

Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Primary LanguagePythonMIT LicenseMIT

Watchers