Pinned Repositories
safetywashing
Measuring correlations between safety benchmarks and general AI capabilities benchmarks.
llama-lying
Code for our paper "Localizing Lying in Llama"
iti_capstone
Analyzing truth representations in LLMs across different kinds of truth and intervening on their hidden states to make LLMs more truthful
ADecodingTrustYouCanTrust
A codebase that actually works for DecodingTrust evaluations, from scratch.
AI-job-exposure
Using NLP to construct an automation exposure metric using semantic overlap between patent text and occupational task descriptions.
arena-curriculum
Exercises on mechanistic interpretability, RL, and training models at scale
arena_curriculum_trlxRLHF
Completed ARENA2.0 RLHF exercises.
code-repository
How to do llama-70b HuggingFace inference, parallelized across multiple GPUs
representation-engineering
Representation Engineering: A Top-Down Approach to AI Transparency
wolfram-toolformer-tests
An exploratory project to test out GPT's math ability when fine-tuned and augmented with the Wolfram Alpha API.
notrichardren's Repositories
notrichardren/code-repository
How to do llama-70b HuggingFace inference, parallelized across multiple GPUs
notrichardren/wolfram-toolformer-tests
An exploratory project to test out GPT's math ability when fine-tuned and augmented with the Wolfram Alpha API.
notrichardren/AI-job-exposure
Using NLP to construct an automation exposure metric using semantic overlap between patent text and occupational task descriptions.
notrichardren/ADecodingTrustYouCanTrust
A codebase that actually works for DecodingTrust evaluations, from scratch.
notrichardren/arena-curriculum
Exercises on mechanistic interpretability, RL, and training models at scale
notrichardren/arena_curriculum_trlxRLHF
Completed ARENA2.0 RLHF exercises.
notrichardren/cis522-course-fork-ec-1
Let's grind those ec points
notrichardren/CIS522-homework
notrichardren/discovering_latent_knowledge
notrichardren/ENM5310
notrichardren/fastbook
The fastai book, published as Jupyter Notebooks
notrichardren/representation-engineering
Representation Engineering: A Top-Down Approach to AI Transparency
notrichardren/cluster-docs
Center for AI Safety Cluster Documentation
notrichardren/DecodingTrust
Trying to get DecodingTrust evaluations to work
notrichardren/evaluation-robust-control
A framework for few-shot evaluation of language models.
notrichardren/harmbench_static
notrichardren/iti
Fork of Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
notrichardren/notrichardren
Config files for Github profile.
notrichardren/notrichardren.github.io
notrichardren/PurpleLlama
Set of tools to assess and improve LLM security.
notrichardren/segment-edit
Image editing with segmentation.
notrichardren/STEER-evaluation
notrichardren/Testing-AidanBench
Aidan Bench attempts to measure <big_model_smell> in LLMs.
notrichardren/toolformer-data-cleaning
LLM that can (generate code to) clean your data for you
notrichardren/truthfulness_high_quality
load_from_disk("truthfulness_high_quality")