DirectProbe

Codebase for probe representation space without training classifiers.

See the blog post on DirectProbe for a brief introduction.

Installing

Clone the repository

    git clone https://github.com/utahnlp/DirectProbe/
    cd DirectProbe

[Optional] Construct a virtual environment for this project. Only python3 is supported.
```
    pyenv virtualenv 3.8.0 DirectProbe
    pyenv local DirectProbe
```
More details about creating python virtual environment using pyenv can found here
Install the required packages. gurobipy is installed independently because it is installed from a private PyPi server. Be note, here gurobipy is just a python interface for Gurobi. You still need to install the REAL Gurobi following the instrcutions from Gurobi Installation Guide and get the licenses from Gurobi. If you can not install the Gurobi or obtain the licenses, the probe will detect automatically and degrade to use linear SVM from scikit-learn. However, using a linear SVM instead of Gurobi results into unstable results. It might end up with different clusters and hard to reproduce.
```
    pip install -r requirements.txt
    pip install -i https://pypi.gurobi.com gurobipy
```

Getting Started

Download datasets and Running examples

Download the pre-packaged data from here and unzip them. Inside each dataset, there are three directories:
- 'embeddings': contains all the embeddings from different representation models.
  - 'embeddings/layers': contains the embeddings from each layer of BERT-base-cased model.
- 'entities': contains the examples of (example, label) pairs for training and test set. Example and label are separated by a tab. Each line is an example.
- 'labels': contains the set of possible labels for each task.
Suppose all the pre-packaged data is put in the directory data, then we can run an experiment using the configuration from config.ini.
```
    python main.py
```

Results

After probing, you will find the results in the directory results/SS/.(We are using the supersense role task as the example.) In this directory, there are 4 files:

clusters.txt: The clustering results. Each line contains a cluster number for the corresponding training example.
'dis.txt': The distances between clusters. Each line represents a pair of cluters. The format is:
```
 (i-A,j-B): d
```
where i,j is the cluster number, A and B are their corresponding label, d is the distance between these two clusters.
'log.txt': The probing log file.
'prediction.txt': The prediction results using the clusters. Each line is an example in the test set. Each line is in the following format:
```
  gold_label\t i-A,d_i\t j-B,d_j ...
```
where gold_label is the golden label for the test example, i and j is the cluster number, 'A' and 'B' are the labels for corresponding cluster. 'd_i' is the distance between test point and cluster 'i'. All these clusters are sorted in increasing order.

Configuration

DirectProbe probe the representations vis the configurations from config.ini file. Please see the config.ini for more details.

Citations

@inproceedings{zhou-2021-directprobe,
    title = "DirectProbe: Studying Representations without Classifiers",
    author = "Zhou, Yichu and Srikumar, Vivek",
    booktitle = "Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics",
    month = june,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}