Pinned Repositories
TransformerLens
Deep-Reinforcement-Learning-Algorithms
steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
connectome
ARENA_2.0
DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
DiscreteSAC
DRL-from-Human-Preferences
A reproduction of the paper "Deep Reinforcement Learning from Human Preferences"
MLAB-Transformers-From-Scratch
Reimplementing transformers from scratch (from Redwood Research's Machine Learning for Alignment Bootcamp).
sandbagging
Felhof's Repositories
Felhof/sandbagging
Felhof/sandbagging-elicitation
Felhof/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
Felhof/From-Sycophancy-To-Sandbagging
Felhof/TransformerLens
Felhof/PersonaInvestigations
Felhof/Activation-Engineering-Investigations
Felhof/Comparing-Measures-of-LLM-Truthfulness
Felhof/LLM-Classification-Faithfulness
Felhof/Exhibiting-Deception-in-LLMs
Felhof/connectome
Felhof/ARENA_2.0
Felhof/swap-graphs
An implementation of input swap graphs. A tool to discover the role of neural network components with causal interventions.
Felhof/DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
Felhof/Deep-Reinforcement-Learning-Algorithms
Felhof/pytest
The pytest framework makes it easy to write small tests, yet scales to support complex functional testing
Felhof/MLAB-Transformers-From-Scratch
Reimplementing transformers from scratch (from Redwood Research's Machine Learning for Alignment Bootcamp).
Felhof/DRL-from-Human-Preferences
A reproduction of the paper "Deep Reinforcement Learning from Human Preferences"
Felhof/DiscreteSAC
Felhof/MEng_Project
Felhof/Wacc-Compiler
Compiler for the WACC language specified in Imperial College 2nd Year Compilers course
Felhof/Kaggle_Houseprices
Felhof/kaggle_titanic
Felhof/Pintos
Implementation of scheduler and user programs for the Pintos Operating System - Imperial College 2nd Year Lab