tsaoni's Stars
LLM-Tuning-Safety/LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.
asahi417/lmppl
Calculate perplexity on a text with pre-trained language models. Support MLM (eg. DeBERTa), recurrent LM (eg. GPT3), and encoder-decoder LM (eg. Flan-T5).
zhxieml/remiss-jailbreak
XuandongZhao/weak-to-strong
Weak-to-Strong Jailbreaking on Large Language Models
vinusankars/BEAST
Implementation of BEAST adversarial attack for language models (ICML 2024)
montemac/activation_additions
Algebraic value editing in pretrained language models
facebookresearch/advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
0xk1h0/ChatGPT_DAN
ChatGPT DAN, Jailbreaks prompt
RICommunity/TAP
TAP: An automated jailbreaking method for black-box LLMs
huizhang-L/CodeChameleon
centerforaisafety/tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
togethercomputer/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
chawins/pal
PAL: Proxy-Guided Black-Box Attack on Large Language Models
chawins/llm-sp
Papers and resources related to the security and privacy of LLMs 🤖
google-research/google-research
Google Research
Soulter/hugging-chat-api
HuggingChat Python API🤗
keirp/automatic_prompt_engineer
FreedomIntelligence/ReasoningNLP
paper list on reasoning in NLP
subeeshvasu/Awesome-Learning-with-Label-Noise
A curated list of resources for Learning with Noisy Labels
claws-lab/few-shot-adversarial-robustness
Code for ACL'23 Findings paper on 'Adversarial Robustness of Prompt-based Few-Shot Learning for Natural Language Understanding'
mingkaid/rl-prompt
Accompanying repo for the RLPrompt paper
allenai/RL4LMs
A modular RL library to fine-tune language models to human preferences
llm-attacks/llm-attacks
Universal and Transferable Attacks on Aligned Language Models
google-research/prompt-tuning
Original Implementation of Prompt Tuning from Lester, et al, 2021
facebookresearch/text-adversarial-attack
Repo for arXiv preprint "Gradient-based Adversarial Attacks against Text Transformers"
XiangLi1999/Diffusion-LM
Diffusion-LM
Phantivia/T-PGD
[Findings of ACL 2023] Bridge the Gap Between CV and NLP! A Optimization-based Textual Adversarial Attack Framework.
uber-research/PPLM
Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.
TobeyYang/StyleDGPT
The code for ``STYLEDGPT: Stylized Response Generation with Pre-trained LanguageModels'' (Findings of EMNLP2020)
facebookresearch/segment-anything
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.