/legal-ml-datasets

A collection of datasets and tasks for legal machine learning

Datasets for Machine Learning in Law

This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.

This page is continually being updated. If I missed something, please contact me at nguha@stanford.edu and I'll add it!

Neel Guha

Task agnostic datasets

These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.

Benchmarks which combine multiple types of tasks

  • LexGlue: a GLUE inspired set of legal tasks
  • LegalBench: a large language model benchmark for legal reasoning

Judgement prediction

Training a model to predict the outcome of a case from various case specific features.

Document/contract annotation

Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).

Summarization

Training a model to summarize complex contractual jargon or legal analysis.

Linking / question answering

Training a model to answer questions or to identify passages from a target document that are relevant to a specified query.

Document classification

Training a model to classify a (typically lengthy) legal filing or document.

  • EDGAR: Online public database for US Securities and Exchange Commission. Filings can be classified by filing type. Paper.

Misc

Datasets which don't fit into the above categories: