Authors: Ruiqi Zhong, Peter Zhang, Steve Li, JinWoo Ahn, Dan Klein, Jacob Steinhardt
This repository hosts OpenD5, a benchmark for discovering natural language facts from pairs of corpora. Our paper focuses on the setting comparing two distributions of text via a text description. The repository containing the system is available here.
The benchmark spans a wide array of disciplines and problem types. A sibling repostiory that contains code for running our system for solving these problems is available here.
To create the full benchmark, you should 1) downloaded these folders and 2) run the build_benchmark.sh
script from the main repo.
For more details, please refer to the
- instructions for using the scripts
- explanations of the relevant schema
- The 675 problems in the original paper are available here.
- An extension with 37 additional problems is available here.
- A reproduction package for the entire dataset is available here. It includes additional source data that is required to assemble the full dataset.
If you'd like to contribute additonal problems to the benchmark, please:
- Create a script for constructing various splits for the dataset (see
pull_data.py
). - Add the dataset's relevant metadata to the
datasets.yaml
andpairs.yaml
schema. - Create a pull request and list the relevant citation.
- Email petez@berkeley.edu with any questions.
@article{zhong2023goal,
title={Goal Driven Discovery of Distributional Differences via Language Descriptions},
author={Zhong, Ruiqi and Zhang, Peter and Li, Steve and Ahn, Jinwoo and Klein, Dan and Steinhardt, Jacob},
journal={arXiv preprint arXiv:2302.14233},
year={2023}
}