This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.
This page is continually being updated. If I missed something, please contact me at nguha@stanford.edu and I'll add it!
Neel Guha
These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.
- Caselaw Access Project: all official, book-published United States case law.
- Legifrance: a French legal publisher providing access to law codes and legal decisions. Requires scraping (Paper).
- US Supreme Court Database: information about every case decided by the US Supreme Court between 1791 and today.
- European Parliment Proceedings: Parallel text of the proceedings of the European Parliment, collected in 11 languages.
- US Code: downloadable version of the US Code in XML format
- Patent Litigation Docket Reports: detailed patent litigation data on over 80k unique district court cases
- Pile of Law: a 256GB dataset of legal, administrative, and contractual texts.
- Open Australian Legal Corpus: The first and only multijurisdictional open corpus of Australian legislative and judicial documents.
- LexGlue: a GLUE inspired set of legal tasks
- LegalBench: a large language model benchmark for legal reasoning
Training a model to predict the outcome of a case from various case specific features.
- European Court of Human Rights: 11.5k cases from ECHR's public database. Paper.
Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).
- Detecting unfair clauses from online terms-of-service: ~12k sentences from 50 terms-of-service agreements. Paper.
- Usable Privacy Project Data: a collection of datasets for privacy policies, including OPP-115, APP-350, MAPS, and the ACL/COLING 2014 Dataset.
- Contract extraction dataset: 3,500 English contracts manually annotated with 11 different contract elements. Paper.
- EURLEX with EUROVOC annotations: 57k legilsative documents from the EU's public document database, annotated with concepts from EUROVOC. Paper.
- Cornell eRulemaking Corpus: Collection of 731 user comments on the the Consumer Debt Collection Practices rule by the CFPB, with annotations containing information about argument structure. Paper.
- German rental agreements (in English): ~913 sentences from German rental agreements annotated by semantic type. Paper.
- Segmenting US court decision opinions into issue parts: 316 court decisions on cyber crime and trade secrets, manually segmented into 6 content based "types" (encompassing categories like "Introduction", "Dissent", or "Background"). Paper
- ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Training a model to summarize complex contractual jargon or legal analysis.
- Summarizing contracts into plain english: 446 contracts with parallel plain-text section-level English summaries. Paper.
- Cookie policies from 151 companies: User agreements for 151 services with sections annotated by TOS;DR. Paper.
- Australian case citation summarization: 4000 cases from the Federal Court of Australia with citation-based summaries.
- Board of Veterans' Appeals Case Summarization: Summarizing BVA cases concerning PTSD. Paper.
- Multi-LexSum: Summarizing civil rights opinions at different granularities!
- EUR-Lex-Sum: Dataset for cross-lingual summarization based on manually curated document summaries of legal acts from the European Union law platform.
Training a model to answer questions or to identify passages from a target document that are relevant to a specified query.
- Linking Supreme Court Opinions to the US Constitution: 36k paragraphs from USC opinions with 41k links to the US Constitution. Paper.
- StAtutory Reasoning Assessment (SARA): Collection of rules extracted from US Internal Revenue Code and natural language questions requiring application of those rules. Paper.
- PrivacyQA: 1750 questions on mobile application privacy policies and 3500 relevant expert annotations. Paper
- CaseHOLD: 53,000+ MC questions that require identifying the correct holding for a case citation from the preceeding context. Paper
- LegalSupport: inferring BlueBook support signals from legal texts
Training a model to classify a (typically lengthy) legal filing or document.
- EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.
Datasets which don't fit into the above categories:
- Segmenting sentences in US cases: ~26k sentences from 80 cases. Paper.
- Demosthenes Corpus for argument mining in legal documents.