Datasets for Machine Learning in Law

This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.

This page is continually being updated. If I missed something, please contact me at nguha@stanford.edu and I'll add it!

Neel Guha

Task agnostic datasets

These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.

Caselaw Access Project: all official, book-published United States case law.
Legifrance: a French legal publisher providing access to law codes and legal decisions. Requires scraping (Paper).
US Supreme Court Database: information about every case decided by the US Supreme Court between 1791 and today.
European Parliment Proceedings: Parallel text of the proceedings of the European Parliment, collected in 11 languages.
US Code: downloadable version of the US Code in XML format
Patent Litigation Docket Reports: detailed patent litigation data on over 80k unique district court cases
Pile of Law: a 256GB dataset of legal, administrative, and contractual texts.

Benchmarks which combine multiple types of tasks

LexGlue: a GLUE inspired set of legal tasks
LegalBench: a large language model benchmark for legal reasoning

Judgement prediction

Training a model to predict the outcome of a case from various case specific features.

European Court of Human Rights: 11.5k cases from ECHR's public database. Paper.

Document/contract annotation

Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).

Detecting unfair clauses from online terms-of-service: ~12k sentences from 50 terms-of-service agreements. Paper.
Usable Privacy Project Data: a collection of datasets for privacy policies, including OPP-115, APP-350, MAPS, and the ACL/COLING 2014 Dataset.
Contract extraction dataset: 3,500 English contracts manually annotated with 11 different contract elements. Paper.
EURLEX with EUROVOC annotations: 57k legilsative documents from the EU's public document database, annotated with concepts from EUROVOC. Paper.
Cornell eRulemaking Corpus: Collection of 731 user comments on the the Consumer Debt Collection Practices rule by the CFPB, with annotations containing information about argument structure. Paper.
German rental agreements (in English): ~913 sentences from German rental agreements annotated by semantic type. Paper.
Segmenting US court decision opinions into issue parts: 316 court decisions on cyber crime and trade secrets, manually segmented into 6 content based "types" (encompassing categories like "Introduction", "Dissent", or "Background"). Paper
ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts

Summarization

Training a model to summarize complex contractual jargon or legal analysis.

Summarizing contracts into plain english: 446 contracts with parallel plain-text section-level English summaries. Paper.
Cookie policies from 151 companies: User agreements for 151 services with sections annotated by TOS;DR. Paper.
Australian case citation summarization: 4000 cases from the Federal Court of Australia with citation-based summaries.
Board of Veterans' Appeals Case Summarization: Summarizing BVA cases concerning PTSD. Paper.
Multi-LexSum: Summarizing civil rights opinions at different granularities!

Linking / question answering

Training a model to answer questions or to identify passages from a target document that are relevant to a specified query.

Linking Supreme Court Opinions to the US Constitution: 36k paragraphs from USC opinions with 41k links to the US Constitution. Paper.
StAtutory Reasoning Assessment (SARA): Collection of rules extracted from US Internal Revenue Code and natural language questions requiring application of those rules. Paper.
PrivacyQA: 1750 questions on mobile application privacy policies and 3500 relevant expert annotations. Paper
CaseHOLD: 53,000+ MC questions that require identifying the correct holding for a case citation from the preceeding context. Paper
LegalSupport: inferring BlueBook support signals from legal texts

Document classification

Training a model to classify a (typically lengthy) legal filing or document.

EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.

Misc

Datasets which don't fit into the above categories:

Segmenting sentences in US cases: ~26k sentences from 80 cases. Paper.

choronX/legal-ml-datasets