/scdb

Exports of the Supreme Court Database into Python-friendly formats. (Mirror of https://dagshub.com/drmrd/scdb. Head there for datasets and a more complete experience!)

Primary LanguagePython

The Supreme Court Database, now in Python-friendly formats!

This repository contains Feathers and Parquet files derived from the most recent versions of the legacy and modern Supreme Court Database datasets. As discussed on the SCDB website, the SCDB is released annually in a variety of formats that differ from one another along several axes (time period, unit of analysis, database record granularity, and file format). This repository contains a minimally-altered version of each of these datasets.

Comparison to Official Datasets

I've made an active effort to ensure that, apart from datasets in the data/preprocessed directory, the feather and parquet files in this repository are faithful reproductions of those found in the official releases. They should differ from expectations only in that

  1. Human-readable strings are used instead of numeric codes for variable values. These strings match the ones found in the SPSS release.
  2. In string-valued and categorical columns, np.nan values are replaced by the description 'MISSING_VALUE'.
  3. Variable data types are converted to accurate and more-or-less optimal (in terms of storage space) data types. This includes using the experimental pd.StringDtype from pandas. As a result of this and, mostly, general advantages of these file formats, the largest feather and parquet files we create here are 6.5 MB and 3.4 MB, respectively, roughly 1.7% and 6.5% the size of the largest .sav file from which we imported.

Available Files

  • data/raw contains the officially-released SPSS files from which I've derived datasets.
  • data/feather contains all of the generated feathers
  • data/parquet contains—yep you guessed it—the parquet files
  • data/preprocessed contains a more refined version of the case-centric, citation-level dataset. This is a combination of the legacy and modern datasets that also includes some mild error correction and imputation work. If you're curious for more details, all changes are documented in the repository's dvc.yaml file, the data_pipeline package and, with more prose, on my blog beginning with this post. If you're interested in getting involved, contributions are welcomed as are feature requests and issues!

Disclaimer

I'm not affiliated with the Supreme Court Database, and this project is not officially endorsed by members of the Supreme Court Database.