Felhof

Pinned Repositories

TransformerLens
Language:Python10
Deep-Reinforcement-Learning-Algorithms
Language:Python20
steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
Language:Python535
connectome
Language:Jupyter Notebook31
ARENA_2.0
Language:HTML10
DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
Language:Jupyter Notebook10
DiscreteSAC
Language:Python394
DRL-from-Human-Preferences
A reproduction of the paper "Deep Reinforcement Learning from Human Preferences"
Language:Python20
MLAB-Transformers-From-Scratch
Reimplementing transformers from scratch (from Redwood Research's Machine Learning for Alignment Bootcamp).
Language:Python10
sandbagging
Language:Jupyter Notebook11

Felhof's Repositories

Felhof/sandbagging
Language:Jupyter Notebook11
Felhof/sandbagging-elicitation
Language:Jupyter Notebook
Felhof/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
Felhof/From-Sycophancy-To-Sandbagging
Felhof/TransformerLens
Language:Python1
Felhof/PersonaInvestigations
Felhof/Activation-Engineering-Investigations
Language:Jupyter Notebook
Felhof/Comparing-Measures-of-LLM-Truthfulness
Language:Jupyter Notebook
Felhof/LLM-Classification-Faithfulness
Language:Jupyter Notebook
Felhof/Exhibiting-Deception-in-LLMs
Language:Jupyter Notebook
Felhof/connectome
Language:Jupyter Notebook31
Felhof/ARENA_2.0
Language:HTML1
Felhof/swap-graphs
An implementation of input swap graphs. A tool to discover the role of neural network components with causal interventions.
1
Felhof/DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
Language:Jupyter Notebook1
Felhof/Deep-Reinforcement-Learning-Algorithms
Language:Python2
Felhof/pytest
The pytest framework makes it easy to write small tests, yet scales to support complex functional testing
Felhof/MLAB-Transformers-From-Scratch
Reimplementing transformers from scratch (from Redwood Research's Machine Learning for Alignment Bootcamp).
1
Felhof/DRL-from-Human-Preferences
A reproduction of the paper "Deep Reinforcement Learning from Human Preferences"
Language:Python2
Felhof/DiscreteSAC
Language:Python394
Felhof/MEng_Project
Felhof/Wacc-Compiler
Compiler for the WACC language specified in Imperial College 2nd Year Compilers course
Language:Java
Felhof/Kaggle_Houseprices
Language:Python
Felhof/kaggle_titanic
Language:Python
Felhof/Pintos
Implementation of scheduler and user programs for the Pintos Operating System - Imperial College 2nd Year Lab
Language:HTML