This is code for the paper:
Unsupervised Parsing via Constituency Tests
Steven Cao, Nikita Kitaev, Dan Klein
EMNLP 2020
This code was tested with python 3.6
, pytorch 1.1
, and pytorch-transformers 1.2
.
conda create -n ConTest python=3.6
conda activate ConTest
pip install -r requirements.txt
The Penn Treebank and CoLA data are contained in the data
folder. The folder also contains a few sentences from Gigaword to show the formatting; for the full data please download it from the LDC.
The PTB data is split into test (23.auto.clean
), dev (22.auto.clean
), and train (02-21.10way.clean
). The ptb-test.txt
file is the same as 23.auto.clean
except without punctuation or unary chains, and the sentences are in a different order. We use ptb-test.txt
during evaluation to stay consistent with past work.
To run the main experiment in the paper, see run_full.sh
. To reduce the memory usage, reduce both --subbatch-size
and --num-grad
while ensuring that the ratio between them stays the same (num-grad
divided by subbatch-size
should be 16).
The code contains two ways of computing parser F1: evalb
, which is standard in supervised parsing evaluation, and a custom script used in past grammar induction work (see eval_for_comparison.py
, taken from the Compound PCFG github repo). The latter ignores punctuation (among other differences; see the paper for details) and typically results in higher F1 numbers.
@inproceedings{cao-etal-2020-unsupervised-parsing,
title = "Unsupervised Parsing via Constituency Tests",
author = "Cao, Steven and
Kitaev, Nikita and
Klein, Dan",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.389",
doi = "10.18653/v1/2020.emnlp-main.389",
pages = "4798--4808",
}