Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts.
Primary LanguagePython