/nist-schemagen

A simple "schema generator" that will use the data from an input CSV to infer its schema.

Primary LanguagePythonMIT LicenseMIT

Schema Generator

This repository contains a simple "schema generator" that will use the data from an input CSV to infer its schema. This is primarily intended for use by the teams participating in the NIST 2020 differential privacy challenge, as the resulting schema that gets generated by this tool mimics the schemae that were used for their challenge datasets. Teams (or others!) are welcome to include this module in their final projects or link to it on github.

Assertions

This tool assumes that the schema is learnable from the input dataset, and specifically that all categorical values are included in the input (and therefore can be inferred by this tool).

Additionally, by default this tool does not automatically pad or otherwise adjust min/max values for categorical ranges (it uses the actual min/max values that are in the data), so in order to fully protect privacy those should be reviewed and adjusted as needed direction in the resultant JSON file.

Note that this approach of taking the schema from the input data is generally considered to be "cheating" when it comes to differential privacy! However, in a real world context and when attempting to make something that's as easy for researchers to use as possible, learning the schema from the input data is often justified.

When Not to Use This Tool

Do not use this tool in situations where the input data is sparse enough that values that should be in the schema don’t actually occur in the data, or if there are hundreds of possible categorical values and some values aren’t included in the data. Additionally, don't use this tool if reading values from the input data itself could be considered a breach of privacy, or if generating a schema from those values would reveal information about the data that should be kept private.

Input File

The input file for this tool must meet the following requirements:

  • It must be a CSV
  • The first row must be a header row
  • Each row should consist of the same number of columns as the header row (although rows may have blank or missing values for a given column)

Output Produced

The output will be two files (and/or Python dicts, if using as a module):

  • parameters.json, which contains information such as datatype (dtype), plus a list of acceptable values (for categorical variables) or a min/max and optional binning information (for numeric variables)
  • column_datatypes.json, which contains a list of all the columns and their datatypes

These files will be formatted according to the JSON schemae provided in this tool's json_schemae directory, and can be validated using the validation tool provided (see below).

Example Output:

These examples were created from an input file that looked something like like this:

ID,Age,Date,State
A-123,14,2012-01-01,CA
B-526,22,2014-06-09,CT
C-9725,87,2020-12-31,TX
...

parameters.json

{
  "schema": {
    "ID": {
      "dtype": "str",
      "kind": "text"
    },
    "Age": {
      "dtype": "uint8",
      "kind": "numeric",
      "min": 14,
      "max": 87,
      "bins": 10
    },
    "Date": {
      "dtype": "date",
      "kind": "date",
      "min": "2012-01-01 00:00:00",
      "max": "2020-12-31 00:00:00"
    },
    "State": {
      "dtype": "str",
      "kind": "categorical",
      "values": [
        "CA",
        "CT",
        "FL",
        "TX"
      ],
      "codes": [
        1,
        2,
        3,
        4
      ]
    }
  }
}

column_datatypes.json

{
  "dtype": {
    "ID": "str",
    "Age": "uint8",
    "Date": "date",
    "State": "str"
  }
}

Using the Tool

Dependencies

This tool requires Python version 3.9 or above.

The requirements.txt file in the root directory contains information about the modules that this tool depends on.

Virtual Environment

It is strongly recommended that you use a Python virtual environment. If you are not familiar with Python virtual environments, [this] (https://docs.python.org/3/library/venv.html) is a good place to start. (If you do not wish to use a virtual environment, you can jump straight to step 3 below.)

  1. To create a virtual environment (this only needs to be done once), run the following from the root directory of this project: python3 -m venv .env

  2. Then, any time you want to work on the project, start by activating the virtual environment. Full instructions for different platforms can be found in the python documentation, but in general it will be some variation (depending on your platform) on source .env/bin/activate (do this whenver you open a new command prompt and want to work with this project)

  3. Once you have created and activated a virtual environment, you can install the requirements: pip install -r requirements.txt (do this only once)

Generating a Schema

Approach

The schema generator acts on the assumption that the provided datafile contains at least one record for every possible value for categorical variables. It will determine the datatype and, if relevant, build a list of acceptable values by looking at this set of data. When run from a command line, it can also prompt the user to review the generated schema and correct any situations where its inferences were incorrect.

Running From the Command Line

To run the tool from a command line, do the following:

  1. (if using a virtual environment) Activate your venv (if you haven't already): source .env/bin/activate (or a variation depending on your platform, see above)

  2. Run the tool using its defaults: python main.py /path/to/input.csv

  3. After the tool has processed your input CSV file, two files will be written out to the current working directory:

    • parameters.json
    • column_datatypes.json

Command-Line Arguments

The script provides a variety of command-line options, which can be viewed by running python main.py -h. They are:

Flag Description Default
-o, --output_dir <DIR> An output directory into which the output files should be written current directory
-m, --max_categorical <INT> The maximum number of values a column can have to be considered categorical. columns with more than this number of values will be treated as ranges or text. 40
-b, --num_bins <INT> Provide default 'number of bins' information for numeric variables. This number will be included in the output schema for each numeric variable, for informational purposes only. if needed, values for individual variables should be adjusted directly in the output file. Set this to 0 to suppress output of bucketing information entirely. 10
-c, --categorical <LIST> A list of column names that the generator should always consider as categorical, regardless of the number of values. Specify in quotes, as a comma-separated list (e.g. "Hospital Name, Gender"). Case-sensitive. None
-g, --geographical <LIST> A list of column names that the generator should always consider as the special "geographical" kind. Geographical columns are considered categorical, but are labeled specially in order to facilitate geographical processing. Specify in quotes, as a comma-separated list (e.g. "State, Country"). Case-sensitive. None
-s, --skip_columns <LIST> A list of column names that should be skipped entirely. Specify in quotes, as a comma-separated list (e.g. "ColA, Column B") None
-i, --include_na Include "NaN" as one of the categorical values, when there are some rows that don't have a value for that column. By default, this is omitted. False
-t, --include_text_columns Include in the schema the names of columns that are non-categorical strings (type "text"). By default, these are omitted. False

Example Usage:

To parse a file called foo.csv, using an output directory called output_dir, and treating the column ID as categorical regardless of the number of values:

python main.py -o output_dir -c "ID" foo.csv

To parse a file called input_files/foobar.csv, specifying that any column with more than 10 values should not be considered categorical, and including "NaN" as a value if empty values are present in that column:

python main.py -m 10 -i input_files/foobar.csv

Using as a Module

The tool can also be used as a module within a larger program. For details, please refer to the documentation available here.

In brief:

import schemagen
schema_generator = schemagen.SchemaGenerator()
input_file = "/path/to/file.csv"
schema_generator.read_and_parse_csv(input_file)
parameters_json = schema_generator.get_parameters_json()

Schema Description

The schema that is produced is a JSON file, which conforms to the JSON schema file included here. The schema contains one root key (schema) that itself contains a JSON object with one key per column. Only those columns that should be processed are included in the schema (ref. the --skip_columns and --include_text_columns command-line arguments).

The value for each column key is a JSON object (a Python dict), which itself contains the following keys:

  • dtype (required): A string representing the native datatype for this column. Available values are: int8, int16, int32, int64, uint8, uint16, uint32, uint64, str, float, or date
  • kind (required): A string representing what type of data this column contains. Available values are:
    • categorical: This column represents a variable that can take on one of a limited number of possible values. For categorical columns, the schema will include a values array, containing the actual values that are present in the CSV file, as well as a codes array, containing an integer mapping for each value, which can be used during processing. Categorical columns are determined based on the number of values included in the source CSV (ref. also the --max_categorical and --categorical command-line arguments). By default, they do not contain NaN as one of the available values, even if some rows of the source data are missing a value (ref. the --include_na command-line argument).
    • geographical: A special type of categorical variable that contains information that is semantically geographic in nature. To use geographical columns, they must be specified on the command line when running the tool (ref. the --geographical command-line argument).
    • numeric: This column is strictly numeric in nature. Numeric columns will have a min, max, and (optionally) bins number specified (ref. the --num_bins command-line argument).
    • date: This column represents a date. The tool attempts to parse the values from the column as a datetime, and if successful, assumes that the column contains date information. A min and max date will be included. Date columns should be double-checked for accuracy, as the parsing attempt is inexact.
    • text: By default, text columns are not included in the schema. A column is considered to be text if it contains more than max_categorical possible values and the values are non-numeric and not parsable as dates (ref. the --include_text_columns command-line argument).
    • id: The schema generator will never mark anything as an id kind of column, but it is valid to give a particular column a kind of id manually.
  • values and codes (only when kind is categorical or geographic): Two arrays containing the actual values present in the input CSV, as well as an integer code that can be used when processing the data. Note that for situations where the values themselves are integers, they may be the same as or different from the integer codes. The dtype of this column will represent the actual datatype of the values. For instance, if the values present in the CSV are M, F, O, the dtype will be str, the values array will be { "M", "F", "O" }, and the codes array will be { 1, 2, 3 }.
  • min, max, and bins (only when kind is numeric or date): The min value and the max value represent the range of the values. (TODO: Optionally, these can be slightly smaller and larger (respectively) than the range specified in the source CSV.) The bins number is a recommendation for the number of bins into which to divide the values during processing, and is informational only. The number of bins can be specified on the command line and is a constant for all columns (it is not inferred by the tool) (ref. the --num_bins command-line argument). bins can be suppressed entirely by specifying a --num_bins less than or equal to 0. (bins is never included for date columns.)

Validating an Existing Schema

This tool can also be used to validate an existing schema file that was produced either by this tool or by hand. When running the tool from the command line, the produced schema files will be validated for you, but you can also run with the following (after activating your venv, if using one):

python validate.py -p output/parameters.json

or

python validate.py -d output/column_datatypes.json

The validator will check the format of the file against the associated JSON schema file and report any issues. It can be used to check the validity of schema files produced by this tool or produced by hand.

Running Tests

To run tests, from the main project directory do the following (after activating your venv, if using one):

To run all tests:

python -m unittest

To run a single test:

python -m unittest -k test__build_schema (or other test name)

Running PyLint

This code attempts to conform as closely as possible to the [Google Style Guide] (https://google.github.io/styleguide/pyguide.html). A pylint file is included in this project's root directory, taken from that style guide. If you make changes to the codebase and want to run pylint against them, you can do so by running (from the root directory, with an appropriate environment activated):

pylint <name of thing to lint>

For instance, to run pylint against the full SchemaGenerator module itself, run:

pylint schemagen

Or, to run against the main.py script:

pylint main.py