Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1
Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.
Hint 1: The labels are binary.
Hint 2: The network gets 95.58% accuracy on the test set.
Hint 3: The labeling function can be described in words in one sentence.
Hint 4: This image may be helpful.
*The challenge was not solved by finding the labeling function but instead by showing that finding the labeling function is bery unlikely to be tractable. In the report linked below, I am quoted with some thoughts about this.
Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.
Hint 1: The labels are binary.
Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.
Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...
If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!