r-three/common-pile

Openreview

Opened this issue · 1 comments

Papers on openreview can be marked as having some particular license, see e.g. https://openreview.net/forum?id=frA0NNBS1n We could crawl openreview for PDFs of appropriately licensed papers and extract text from them.

There's likely a large overlap with the arXiv papers and not all venues support the license field (e.g., ICLR doesn't appear to have license on their submission form). However, all reviews and comments on OpenReview are CC-BY (see Comment and Configuration Record License section on this page). This could make for a pretty interesting dataset where a record is a sequence of comments/reviews and optionally the associated paper (if the paper is permissively licensed). This would be useful for teaching models to critique, engage in dialog, summarize, etc.

From ICLR and TMLR alone there are roughly ~200K reviews for ~20K papers (~2K of which are permissively licensed). There are probably other venues that have public comments as well.