/legal-ml-datasets

A collection of datasets and tasks for legal machine learning

Datasets for Machine Learning in Law

This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.

This page is continually being updated. If there's a dataset/resource you think should be included here, please make a pull request adding it, or contact me at nguha@stanford.edu and I'll add it!

Neel Guha

Task agnostic datasets

These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.

  • Caselaw Access Project: all official, book-published United States case law.
  • Legifrance: a French legal publisher providing access to law codes and legal decisions. Requires scraping (Paper).
  • US Supreme Court Database: information about every case decided by the US Supreme Court between 1791 and today.
  • European Parliment Proceedings: Parallel text of the proceedings of the European Parliment, collected in 11 languages.
  • US Code: downloadable version of the US Code in XML format
  • Patent Litigation Docket Reports: detailed patent litigation data on over 80k unique district court cases
  • Pile of Law: a 256GB dataset of legal, administrative, and contractual texts.
  • Open Australian Legal Corpus: The first and only multijurisdictional open corpus of Australian legislative and judicial documents.
  • Ontario Laws and Regs: A dataset comprised of the most recent version of all current and revoked laws and regulations from Ontario, Canada, totalling around 5,000 documents.
  • The Cambridge Law Corpus: A dataset consisting of raw text and metadata for 250,000+ court cases from the UK, dating back to the 16th century. Additional expert annotations are provided for a sample of 638 cases.

Benchmarks which combine multiple types of tasks

  • LexGlue: a GLUE inspired set of legal tasks
  • LegalBench: a large language model benchmark for legal reasoning

Judgement prediction

Training a model to predict the outcome of a case from various case specific features.

Document/contract annotation

Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).

Summarization

Training a model to summarize complex contractual jargon or legal analysis.

Linking / question answering

Training a model to answer questions or to identify passages from a target document that are relevant to a specified query.

Document classification

Training a model to classify a (typically lengthy) legal filing or document.

  • EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.

Misc

Datasets which don't fit into the above categories: