/data-linter

Primary LanguagePythonApache License 2.0Apache-2.0

Data Linter

Summary

This code accompanies the NIPS 2017 ML Systems Workshop paper/poster, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets."

The Data Linter identifies potential issues (lints) in your ML training data.

Using the Data Linter

Prerequisites

You'll need the following installed to use the Data Linter:

  1. Python
  2. Apache Beam
  3. TensorFlow
  4. Facets

Data Linter Demo

The easiest way to see how to use the Data Linter is to follow the demo instructions found in demo/README.md.

Running the Data Linter

Running the Data Linter requires the following steps:

  1. Encoding your data in TFRecord format.
  2. Generating summary statistics for those data, using Facets.
  3. Running the Data Linter.
  4. Using the Lint Explorer to produce the lint results.

Creating Data in the TFRecord Format

To see how to convert CSV files to the TFRecord format, look at the example code in demo/convert_to_tfrecord.py.

Summarizing Your Data Using Facets

To see how to generate summary statistics for your data, see the example code in demo/summarize_data.py.

Executing the Data Linter

Once you have both the data and summary statistics, you can run the Data Linter as such:

python data_linter_main.py --dataset_path PATH_TO_TFRECORDS \
  --stats_path PATH_TO_FACETS_SUMMARIES --results_path PATH_FOR_SAVING_RESULTS

For example, if you follow the instructions in the demo folder, you'll invoke the Data Linter like this:

python data_linter_main.py --dataset_path /tmp/adult.tfrecords \
  --stats_path /tmp/adult_summary.bin \
  --results_path /tmp/datalinter/results/lint_results.bin

Viewing Results with the Lint Explorer

After the Data Linter is done examining your data, you can view the results using this command:

python lint_explorer_main.py --results_path PATH_TO_RESULTS

For example:

python lint_explorer_main.py --results_path \
  /tmp/datalinter/results/lint_results.bin

Notes

The code makes use of Google's protobuf format. The protos are defined in protos/.

To make it easier to run the code, we include protobuf definitions from TensorFlow and Facets in this distribution.

Support

This is not an official Google project. This project will not be supported or maintained, and we will not accept any pull requests.

Authors

The Data Linter was created by Nick Hynes (nhynes@berkeley.edu) during an internship at Google with Michael Terry (michaelterry@google.com).