/openrefine-provenance

Project to model the data-cleaning history management features of OpenRefine

Primary LanguageTeX

An Investigation of OpenRefine Provenance

Overview

This repo represents the history and current state of a project to model the data-cleaning history management features of OpenRefine. The project has three principal aims:

  1. Investigate and clearly document the actual data-cleaning history capabilities of OpenRefine.
  2. Demonstrate queries of the historical information captured by OpenRefine, focusing primarily on queries that reveal critical elements of the provenance of a cleaned data set.
  3. Publish a paper reporting the results of the project, and complement this paper with the means to reproduce those results.

This repo is meant not just to make it possible for others to reproduce the results of this project, but also to make it as easy as possible for others to review the work. The repo will include the notes taken while doing the research; definitions of key aspects of the computing environments used to run any software employed to obtain reported results; and the source code for any custom software developed to carry out the project.

The notes folder contains the daily records kept throughout the project.

Project planning is done using issues. Please feel free to comment, make suggestions, or contribute ideas.

Miscellaneous updates and thoughts about the project are posted on twitter by @tmcphillips.

Poster presentation at TaPP 2019

We presented an early report of our progress at TaPP 2019 (11th International Workshop on Theory and Practice of Provenance), on June 3, 2019.

The poster, and accompanying one-page paper (with LaTeX sources) are included in the tapp2019 directory.

A working demonstration corresponding the examples on the poster are in demos/03_poster_demo.