ML-roots-you: A repository from mtrazzi

Like Root Me, but the environment exploit is Machine Learning based

Why

Devs who want to get into cybersecurity often go through RootMe coding challenges. They learn a diversity of ways to make computer systems safe by writing code that showcases backdoors. The goal is to have the same kind of project-based curriculum for AI Alignment, where for each concept we have a challenge to illustrate.

What

Challenges for Reinforcement Learning (RL) exploits, and the corresponding environments. Environments could have multiple sub-challenges with different rules. For instance:

Full RL problem where the agent needs to exploit a flaw in the env. design.
An open problem to show what a safer solution could look like, with both a reward and a safety metric.

For each non-open sub-challenge there could be hints/relevant literature and a benchmark indicated in the description.

Example

The treacherous turn gym environment. Sub-challenges could include maximizing reward:

without restriction (Q-learning)
killing your supervisor once (DynaQ)
using screen as input (DQN)
killing supervisor once, no planning (meta-learning)

Check the challenges folder for a rough draft.

More ideas

The key-chest mesa misalignment env (code)). AI Safety gridworlds has plenty of envs with both reward and safety metrics. Sub-challenges could be first to make a pure reward maximizer (that exploits the environment) then an open problem where the goal is to maximize a combination of reward & safety performance. Viktoria’s list many similar examples.

TODO

Decide on a few environments where it’s interesting to show a failure.
Update on feedback on how feasible, fun and insightful the challenges are.
Launch website with leaderboards + publish on Twitter/LW/github.
Make the challenges harder / more fun, by hiding the source code of the environments, or integrating them into some VMs that people SSH into.
Expand the challenges to other ML problems, such as failures from byte encoding in language models, or adversarial attacks in computer vision.

mtrazzi/ML-roots-you

Why

What

Example

More ideas

TODO