/provis

Official code repository of "BERTology Meets Biology: Interpreting Attention in Protein Language Models."

Primary LanguagePythonOtherNOASSERTION

BERTology Meets Biology: Interpreting Attention in Protein Language Models

This repository is the official implementation of BERTology Meets Biology: Interpreting Attention in Protein Language Models.

Table of Contents

ProVis Attention Visualizer

This section provides instructions for generating visualizations of attention projected onto 3D protein structure.

Image Image

Installation

General requirements:

  • Python >= 3.7
pip install biopython==1.77
pip install tape-proteins==0.5
pip install jupyterlab==3.0.14
pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix

If you run into problems installing nglview, please refer to their installation instructions for additional installation details and options.

Execution

cd <project_root>/notebooks
jupyter notebook provis.ipynb

If you get an error running the notebook, you may need to execute the notebook as follows:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000

See nglview installation instructions for more details.

You may edit the notebook to choose other proteins, attention heads, etc. The visualization tool is based on the excellent nglview library.


Experiments

This section describes how to reproduce the experiments in the paper.

Installation

cd <project_root>
python setup.py develop

To download additional required datasets from TAPE:

cd <project_root>/data
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/secondary_structure.tar.gz
tar -xvf secondary_structure.tar.gz && rm secondary_structure.tar.gz
wget http://s3.amazonaws.com/songlabdata/proteindata/data_pytorch/proteinnet.tar.gz
tar -xvf proteinnet.tar.gz && rm proteinnet.tar.gz

Attention Analysis

The following steps will reproduce the attention analysis experiments and generate the reports currently found in <project_root>/reports/attention_analysis. This includes all experiments besides the probing experiments (see Probing Analysis).

Before performing steps, navigate to appropriate directory:

cd <project_root>/protein_attention/attention_analysis

Tape BERT Model

The following executes the attention analysis (may run for several hours):

sh scripts/compute_all_features_tape_bert.sh

The above script create a set of extract files in <project_root>/data/cache corresponding to various properties being analyzed. You may edit the script files to remove properties that you are not interested in. If you wish to run the analysis without a GPU, you must specify the --no_cuda flag.

The following generate reports based on the files created in previous step:

sh scripts/report_all_features_tape_bert.sh

If you removed steps from the analysis script above, you will need to update the reporting script accordingly.

ProtTrans Models

In order to generate reports for the ProtTrans models, follow the instructions as for the TapeBert model above, but substitute the following commands:

ProtBert:

sh scripts/compute_all_features_prot_bert.sh
sh scripts/report_all_features_prot_bert.sh

ProtBertBFD:

sh scripts/compute_all_features_prot_bert_bfd.sh
sh scripts/report_all_features_prot_bert_bfd.sh

ProtAlbert:

sh scripts/compute_all_features_prot_albert.sh
sh scripts/report_all_features_prot_albert.sh

ProtXLNet:

sh scripts/compute_all_features_prot_xlnet.sh
sh scripts/report_all_features_prot_xlnet.sh

Probing Analysis

The following steps will recreate the figures from the probing analysis, currently found in <project_root>/reports/probing

Navigate to directory:

cd <project_root>/protein_attention/probing

Training

Train diagnostic classifiers. Each script will write out an extract file with evaluation results. Note: each of these scripts may run for several hours.

sh scripts/probe_ss4_0_all
sh scripts/probe_ss4_1_all
sh scripts/probe_ss4_2_all
sh scripts/probe_sites.sh
sh scripts/probe_contacts.sh

Reports

python report.py

License

This project is licensed under BSD3 License - see the LICENSE file for details

Acknowledgments

This project incorporates code from the following repo:

Citation

When referencing this repository, please cite this paper.

@misc{vig2020bertology,
    title={BERTology Meets Biology: Interpreting Attention in Protein Language Models},
    author={Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani},
    year={2020},
    eprint={2006.15222},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2006.15222}
}