/prompt-hacking-classifier

A flexible and portable solution that uses a single robust prompt and customized hyperparameters to classify user messages as either malicious or safe, helping to prevent jailbreaking and manipulation of chatbots and other LLM-based solutions.

Primary LanguageJupyter NotebookMIT LicenseMIT

Watchers