/jedi-datasets

Contains the experimental data of the JEDI paper at SIGMOD 2022.

Primary LanguagePythonMIT LicenseMIT

DOI

jedi-datasets

This repository contains the experimental data for JSON similarity lookups that are published in the following publication:

@inproceedings{huetter2021jedi,
  title        = {{JEDI}: These aren't the {JSON} documents you're looking for...},
  author       = {H{\"u}tter, Thomas and Augsten, Nikolaus and Christoph M, Kirsch and Michael J, Carey and Li, Chen},
  year         = 2022,
  booktitle    = {Proceedings of the 2022 International Conference on Management of Data}
}

Fetch, preprocess, and analyze data

The raw JSON datasets are fetched from repositories (see directory raw-data) and converted into bracket notation which serves as the input data for the algorithms (see directory input-data). Further, the characteristics of the JSON datasets are analyzed. All steps are performed by the following script:

sh scripts/download-prepare.sh

This process takes approximately one hour for all datasets. One might consider to exclude datasets that are not needed.

Datasets

The following datasets are included:

  • arxiv
  • cards
  • clothing
  • dblp
  • denf
  • device
  • face
  • fenf
  • movies
  • nasa
  • nba
  • reads
  • recipes
  • reddit
  • schema
  • smsen
  • smszh
  • spotify
  • standev
  • stantrain
  • trees
  • twitter
  • twitter2
  • virus