
Evaluating Neural Language Models for Linguistic Knowledge

Primary LanguagePythonApache License 2.0Apache-2.0



  This is actively being developed, and I anticipate major restructing
  in the next year. Experiments and methods are being added within the 
  context of my courses. If you have suggestions please either 
  post an Issue or write to me at forrestd@mit.edu

Shared containers for experiments that I use in my work. The aim is to have a unified API for running models on typical experiments in neural model interpretability. This is done for LSTMs/RNNs by using a version of neural-complexity, developed by my advisor Marten van Schijndel. Transformers are accessed via HuggingFace's API. Note that this code is for inference, and thus, we will assume pretrained models. An additional library which is nice for doing this is minicons developed by Kanishka Misra. That package is more fully developed, so that's likely a better option if you want more assurances at this time.


I will assume you have conda and can subsequently use that to create a virtual environment. I've included a yml file to facilitate this.

conda env create -n mapi --file ModelsAPI.yml

Running the above will create a environment named mapi which should work for this code. If you run into errors on Mac with M1 see this blog.

Quick run

To run the code, simply enter:

python main.py

This will use the default config file (elaborated on below) run_config.yml. You can pass in a different config file as below:

python main.py new_config.yml

Config files

Running an experiment is done by specifying a config file. An example one is copied below:

exp: TSE

          - bert-base-uncased
          - gpt2

return_type: prob

    - stimuli/tiny_IC_mismatch_BERT.tsv
    - stimuli/tiny_IC_mismatch.tsv

include_punct: False

lower: True

I describe each parameter below.


There are four options: TSE, Incremental, Cumulative, and Interactive. TSE does targeted syntactic evaluations, so you will have some context and a target and this will look at that target conditioned on the context. Incremental calculates the by word measures. Cumulative returns the log probability of a whole string (optionally conditioned on a context). Interactive allows you to test out sentences on the command line and see the incremental surprisal and probability values for each word.


Models require a model type (bert|roberta|gpt2|lstm|tfxl|gptneo|gptj) which tells the api which model architecture the model is drawn from. Then the name of the model (or many models of the same type) are provided under the model type using -. I've lazily done this, so you can run a few models at once, but very large ones will probably cause problems because all models are loaded into memory. The above sample config file will run the pretrained bert-base-uncased model and smallest gpt2 model provided by huggingface. This is a general property of all tranformer models with this pipeline, passing in a name will trigger a check on huggingface and that model will be loaded. You can also specify a path to a local copy of a model (e.g., /data/gpt2).


This is either prob (for probability) or surp (for surprisal) and will be the measure returned from a model.


Name of stimuli file to use for the experiment. Each stimuli file will be sequentially associated with each model. So the above will try to run bert-base-uncased on stimuli/tiny_IC_mismatch_BERT.tsv and gpt2 on stimuli/tiny_IC_mismatch.tsv. Examples files are provided for incremental and TSE experiments. TSE needs at least two columns one called called context, which gives the context sentence (could be bidirectional, more on this in a moment) and another column called target which gives the target. The target should be one word (i.e. not split by the model). Some variables are also an option, I turn to these at the bottom. Incremental experiments expect one column called sent which gives the sentence.

For bidirectional models you can pass in a full context and target a medial word. Use MASKTOKEN as the special token and the model specific token will be inserted in this location.


Whether to include the punctuation in the surp/probability calculation. This is only used by incremental and interactive. Notice that in these use cases, a given tokenizer might split a word into subwords. Right now I flag this and treat the probability of that whole word as the joint probability of its subparts. So if the word 'human' is mapped to 'hu' + 'man', then the probability of 'human' will be the probability of 'hu' times the probability of 'man' (conditioned on their respective contexts). This is more tricky for bidirectional models, so we can discuss this if you want.


Whether to lowercase the first word in the sentence. We might want to lowercase more, but I didn't code that for some reason I've forgotten.


I've added two variables which may be useful. These can be inserted as targets and a larger set of items will be checked. The wildcards are:

$SG maps to English third person singular verbs

$PL maps to English third person plural verbs

The values for all singular/plural verbs will be summed and one value returned. Thus, this only makes sense if the return type is prob, but I don't check this.


The results of an experiment are complied in tsv files under the results directory. The resultant name is hard coded as the following:


The organization of the file should be straightforwardly interpretable by looking at the output.


This code can be straightforwardly run on colab where you can access (free) GPUs. I've included a small document in the colab folder which outlines how to link github and google drive. Once that's in place, the included scripts/colab.ipynb script can be run.