Pinned Repositories
agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
diffusion_denoised_smoothing
Certified robustness "for free" using off-the-shelf diffusion models and classifiers
misleading-privacy-evals
Official code for "Evaluations of Machine Learning Privacy Defenses are Misleading" (https://arxiv.org/abs/2404.17399)
realistic-adv-examples
Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]
rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
robust-style-mimicry
satml-llm-ctf
Code used to run the platform for the LLM CTF colocated with SaTML 2024
superhuman-ai-consistency
unlearning-vs-safety
SPY Lab's Repositories
ethz-spylab/rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
ethz-spylab/agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
ethz-spylab/rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
ethz-spylab/diffusion_denoised_smoothing
Certified robustness "for free" using off-the-shelf diffusion models and classifiers
ethz-spylab/robust-style-mimicry
ethz-spylab/superhuman-ai-consistency
ethz-spylab/satml-llm-ctf
Code used to run the platform for the LLM CTF colocated with SaTML 2024
ethz-spylab/realistic-adv-examples
Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]
ethz-spylab/unlearning-vs-safety
ethz-spylab/misleading-privacy-evals
Official code for "Evaluations of Machine Learning Privacy Defenses are Misleading" (https://arxiv.org/abs/2404.17399)
ethz-spylab/lm_memorization_data
Data for "Quantifying Memorization Across Neural Language Models"
ethz-spylab/lm-extraction-benchmark-data
Datasets for the SATML 2023 competition on training data extraction
ethz-spylab/non-adversarial-reproduction
Official code for "Measuring Non-Adversarial Reproduction of Training Data in Large Language Models" (https://arxiv.org/abs/2411.10242)
ethz-spylab/infoseclab_23
ethz-spylab/vmi-retreat-workshop-2024
Repository for the VMI Summer Retreat Workshop on Hacking AI Agents
ethz-spylab/.github
ethz-spylab/data-decay
Playing around with the CC3M data
ethz-spylab/llm_lab
ethz-spylab/privacy
Library for training machine learning models with privacy for training data
ethz-spylab/Blind-MIA
This is the official code for Blind Baselines Beat Membership Inference Attacks for Foundation Models
ethz-spylab/ctf-satml24-data-analysis