Pinned Repositories
benchbench
BenchBench is a Python package to evaluate multi-task benchmarks.
causal-features
Code to reproduce the paper "Do causal predictors generalize better to new domains?"
error-parity
Achieve error-rate fairness between societal groups for any score-based classifier.
folktables
Datasets derived from US census data
folktexts
Get classification risk scores on tabular tasks using LLMs
lawma
Lawma: A lightly fine-tuned Llama model for legal classification tasks.
surveying-language-models
Code to reproduce the paper "Questioning the Survey Responses of Large Language Models"
training-on-the-test-task
Code to reproduce the experiments in the paper Training on the Test Task Confounds Evaluation and Emergence.
tttlm
Test-time-training on nearest neighbors for large language models
whynot
A Python sandbox for decision making in dynamics
Social Foundations of Computation's Repositories
socialfoundations/whynot
A Python sandbox for decision making in dynamics
socialfoundations/folktables
Datasets derived from US census data
socialfoundations/tttlm
Test-time-training on nearest neighbors for large language models
socialfoundations/error-parity
Achieve error-rate fairness between societal groups for any score-based classifier.
socialfoundations/folktexts
Get classification risk scores on tabular tasks using LLMs
socialfoundations/lawma
Lawma: A lightly fine-tuned Llama model for legal classification tasks.
socialfoundations/benchbench
BenchBench is a Python package to evaluate multi-task benchmarks.
socialfoundations/surveying-language-models
Code to reproduce the paper "Questioning the Survey Responses of Large Language Models"
socialfoundations/training-on-the-test-task
Code to reproduce the experiments in the paper Training on the Test Task Confounds Evaluation and Emergence.
socialfoundations/causal-features
Code to reproduce the paper "Do causal predictors generalize better to new domains?"
socialfoundations/backward_baselines
Code for "Is your model predicting the past?"
socialfoundations/lm-evaluation-harness
A framework for few-shot evaluation of language models.
socialfoundations/twitter-predictability