Transparency at the Source

Repo for the EMNLP 2023 Findings paper Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution.

Our pipeline consists of a mix of Java and Python code: grammar induction is done using the original code of Petrov et al. (2006) in Java, language model training is done using transformers in Python.

Grammar induction

java -Xmx32g -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer -path $path -out $save_dir/stage -treebank SINGLEFILE -mergingPercentage 0.5 -filter 1.0e-8 -SMcycles 5

Masked token PCFG probabilities, $grammarfile should point to the grammar archive file that can be found in the Google Drive resources (500k).

java -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.BerkeleyParser -gr $grammarfile -inputFile $inputfile

EarleyX causal PCFG probabilities

java -Xms32768M -classpath "earleyx_fast.jar:lib/*" parser.Main -in data/eval_subset_100.txt -grammar grammars/earleyx.grammar -out results -verbose 1 -thread 1

Language model training

python3 main_multi.py \
  --model.model_type microsoft/deberta-base \
  --model.is_mlm \
  --tokenizer.path tokenizers/added_tokens.json \
  --data.data_dir corpora \
  --data.train_file train.txt \
  --trainer.output_dir $save_dir

clclab/pcfg-lm

Transparency at the Source