Unsupervised Parsing via Constituency Tests

This is code for the paper:
Unsupervised Parsing via Constituency Tests
Steven Cao, Nikita Kitaev, Dan Klein
EMNLP 2020

Dependencies

This code was tested with python 3.6, pytorch 1.1, and pytorch-transformers 1.2.

conda create -n ConTest python=3.6
conda activate ConTest
pip install -r requirements.txt

Data

The Penn Treebank and CoLA data are contained in the data folder. The folder also contains a few sentences from Gigaword to show the formatting; for the full data please download it from the LDC.

The PTB data is split into test (23.auto.clean), dev (22.auto.clean), and train (02-21.10way.clean). The ptb-test.txt file is the same as 23.auto.clean except without punctuation or unary chains, and the sentences are in a different order. We use ptb-test.txt during evaluation to stay consistent with past work.

Running the code

To run the main experiment in the paper, see run_full.sh. To reduce the memory usage, reduce both --subbatch-size and --num-grad while ensuring that the ratio between them stays the same (num-grad divided by subbatch-size should be 16).

Note regarding evaluation

The code contains two ways of computing parser F1: evalb, which is standard in supervised parsing evaluation, and a custom script used in past grammar induction work (see eval_for_comparison.py, taken from the Compound PCFG github repo). The latter ignores punctuation (among other differences; see the paper for details) and typically results in higher F1 numbers.

Citation

@inproceedings{cao-etal-2020-unsupervised-parsing,
    title = "Unsupervised Parsing via Constituency Tests",
    author = "Cao, Steven  and
      Kitaev, Nikita  and
      Klein, Dan",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.389",
    doi = "10.18653/v1/2020.emnlp-main.389",
    pages = "4798--4808",
}