This repository contains a simple "schema generator" that will use the data from an input CSV to infer its schema. This is primarily intended for use by the teams participating in the NIST 2020 differential privacy challenge, as the resulting schema that gets generated by this tool mimics the schemae that were used for their challenge datasets. Teams (or others!) are welcome to include this module in their final projects or link to it on github.
This tool assumes that the schema is learnable from the input dataset, and specifically that all categorical values are included in the input (and therefore can be inferred by this tool).
Additionally, by default this tool does not automatically pad or otherwise adjust min/max values for categorical ranges (it uses the actual min/max values that are in the data), so in order to fully protect privacy those should be reviewed and adjusted as needed direction in the resultant JSON file.
Note that this approach of taking the schema from the input data is generally considered to be "cheating" when it comes to differential privacy! However, in a real world context and when attempting to make something that's as easy for researchers to use as possible, learning the schema from the input data is often justified.
Do not use this tool in situations where the input data is sparse enough that values that should be in the schema don’t actually occur in the data, or if there are hundreds of possible categorical values and some values aren’t included in the data. Additionally, don't use this tool if reading values from the input data itself could be considered a breach of privacy, or if generating a schema from those values would reveal information about the data that should be kept private.
The input file for this tool must meet the following requirements:
- It must be a CSV
- The first row must be a header row
- Each row should consist of the same number of columns as the header row (although rows may have blank or missing values for a given column)
The output will be two files (and/or Python dicts, if using as a module):
parameters.json
, which contains information such as datatype (dtype
), plus a list of acceptable values (for categorical variables) or a min/max and optional binning information (for numeric variables)column_datatypes.json
, which contains a list of all the columns and their datatypes
These files will be formatted according to the JSON schemae provided in this
tool's json_schemae
directory, and can be validated using the validation tool
provided (see below).
These examples were created from an input file that looked something like like this:
ID,Age,Date,State
A-123,14,2012-01-01,CA
B-526,22,2014-06-09,CT
C-9725,87,2020-12-31,TX
...
{
"schema": {
"ID": {
"dtype": "str",
"kind": "text"
},
"Age": {
"dtype": "uint8",
"kind": "numeric",
"min": 14,
"max": 87,
"bins": 10
},
"Date": {
"dtype": "date",
"kind": "date",
"min": "2012-01-01 00:00:00",
"max": "2020-12-31 00:00:00"
},
"State": {
"dtype": "str",
"kind": "categorical",
"values": [
"CA",
"CT",
"FL",
"TX"
],
"codes": [
1,
2,
3,
4
]
}
}
}
{
"dtype": {
"ID": "str",
"Age": "uint8",
"Date": "date",
"State": "str"
}
}
This tool requires Python version 3.9 or above.
The requirements.txt
file in the root directory contains information about
the modules that this tool depends on.
It is strongly recommended that you use a Python virtual environment. If you are not familiar with Python virtual environments, [this] (https://docs.python.org/3/library/venv.html) is a good place to start. (If you do not wish to use a virtual environment, you can jump straight to step 3 below.)
-
To create a virtual environment (this only needs to be done once), run the following from the root directory of this project:
python3 -m venv .env
-
Then, any time you want to work on the project, start by activating the virtual environment. Full instructions for different platforms can be found in the python documentation, but in general it will be some variation (depending on your platform) on
source .env/bin/activate
(do this whenver you open a new command prompt and want to work with this project) -
Once you have created and activated a virtual environment, you can install the requirements:
pip install -r requirements.txt
(do this only once)
The schema generator acts on the assumption that the provided datafile contains at least one record for every possible value for categorical variables. It will determine the datatype and, if relevant, build a list of acceptable values by looking at this set of data. When run from a command line, it can also prompt the user to review the generated schema and correct any situations where its inferences were incorrect.
To run the tool from a command line, do the following:
-
(if using a virtual environment) Activate your venv (if you haven't already):
source .env/bin/activate
(or a variation depending on your platform, see above) -
Run the tool using its defaults:
python main.py /path/to/input.csv
-
After the tool has processed your input CSV file, two files will be written out to the current working directory:
parameters.json
column_datatypes.json
The script provides a variety of command-line options, which can be
viewed by running python main.py -h
. They are:
Flag | Description | Default |
---|---|---|
-o, --output_dir <DIR> |
An output directory into which the output files should be written | current directory |
-m, --max_categorical <INT> |
The maximum number of values a column can have to be considered categorical. columns with more than this number of values will be treated as ranges or text. | 40 |
-b, --num_bins <INT> |
Provide default 'number of bins' information for numeric variables. This number will be included in the output schema for each numeric variable, for informational purposes only. if needed, values for individual variables should be adjusted directly in the output file. Set this to 0 to suppress output of bucketing information entirely. | 10 |
-c, --categorical <LIST> |
A list of column names that the generator should always consider as categorical, regardless of the number of values. Specify in quotes, as a comma-separated list (e.g. "Hospital Name, Gender"). Case-sensitive. | None |
-g, --geographical <LIST> |
A list of column names that the generator should always consider as the special "geographical" kind. Geographical columns are considered categorical, but are labeled specially in order to facilitate geographical processing. Specify in quotes, as a comma-separated list (e.g. "State, Country"). Case-sensitive. | None |
-s, --skip_columns <LIST> |
A list of column names that should be skipped entirely. Specify in quotes, as a comma-separated list (e.g. "ColA, Column B") | None |
-i, --include_na |
Include "NaN" as one of the categorical values, when there are some rows that don't have a value for that column. By default, this is omitted. | False |
-t, --include_text_columns |
Include in the schema the names of columns that are non-categorical strings (type "text"). By default, these are omitted. | False |
To parse a file called foo.csv
, using an output directory called output_dir
,
and treating the column ID
as categorical regardless of the number of values:
python main.py -o output_dir -c "ID" foo.csv
To parse a file called input_files/foobar.csv
, specifying that any column with
more than 10 values should not be considered categorical, and including "NaN"
as a value if empty values are present in that column:
python main.py -m 10 -i input_files/foobar.csv
The tool can also be used as a module within a larger program. For details, please refer to the documentation available here.
In brief:
import schemagen
schema_generator = schemagen.SchemaGenerator()
input_file = "/path/to/file.csv"
schema_generator.read_and_parse_csv(input_file)
parameters_json = schema_generator.get_parameters_json()
The schema that is produced is a JSON file, which conforms to the JSON schema file
included here. The schema contains one root key (schema
) that itself
contains a JSON object with one key per column. Only those columns that should be
processed are included in the schema (ref. the --skip_columns
and --include_text_columns
command-line arguments).
The value for each column key is a JSON object (a Python dict), which itself contains the following keys:
dtype
(required): A string representing the native datatype for this column. Available values are:int8
,int16
,int32
,int64
,uint8
,uint16
,uint32
,uint64
,str
,float
, ordate
kind
(required): A string representing what type of data this column contains. Available values are:categorical
: This column represents a variable that can take on one of a limited number of possible values. Forcategorical
columns, the schema will include avalues
array, containing the actual values that are present in the CSV file, as well as acodes
array, containing an integer mapping for each value, which can be used during processing. Categorical columns are determined based on the number of values included in the source CSV (ref. also the--max_categorical
and--categorical
command-line arguments). By default, they do not containNaN
as one of the available values, even if some rows of the source data are missing a value (ref. the--include_na
command-line argument).geographical
: A special type ofcategorical
variable that contains information that is semantically geographic in nature. To use geographical columns, they must be specified on the command line when running the tool (ref. the--geographical
command-line argument).numeric
: This column is strictly numeric in nature. Numeric columns will have amin
,max
, and (optionally)bins
number specified (ref. the--num_bins
command-line argument).date
: This column represents a date. The tool attempts to parse the values from the column as a datetime, and if successful, assumes that the column contains date information. Amin
andmax
date will be included. Date columns should be double-checked for accuracy, as the parsing attempt is inexact.text
: By default, text columns are not included in the schema. A column is considered to betext
if it contains more thanmax_categorical
possible values and the values are non-numeric and not parsable as dates (ref. the--include_text_columns
command-line argument).id
: The schema generator will never mark anything as anid
kind of column, but it is valid to give a particular column a kind ofid
manually.
values
andcodes
(only whenkind
iscategorical
orgeographic
): Two arrays containing the actual values present in the input CSV, as well as an integer code that can be used when processing the data. Note that for situations where thevalues
themselves are integers, they may be the same as or different from the integercodes
. Thedtype
of this column will represent the actual datatype of thevalues
. For instance, if the values present in the CSV areM
,F
,O
, thedtype
will bestr
, thevalues
array will be{ "M", "F", "O" }
, and thecodes
array will be{ 1, 2, 3 }
.min
,max
, andbins
(only whenkind
isnumeric
ordate
): Themin
value and themax
value represent the range of the values. (TODO: Optionally, these can be slightly smaller and larger (respectively) than the range specified in the source CSV.) Thebins
number is a recommendation for the number of bins into which to divide the values during processing, and is informational only. The number of bins can be specified on the command line and is a constant for all columns (it is not inferred by the tool) (ref. the--num_bins
command-line argument).bins
can be suppressed entirely by specifying a--num_bins
less than or equal to0
. (bins
is never included fordate
columns.)
This tool can also be used to validate an existing schema file that was produced either by this tool or by hand. When running the tool from the command line, the produced schema files will be validated for you, but you can also run with the following (after activating your venv, if using one):
python validate.py -p output/parameters.json
or
python validate.py -d output/column_datatypes.json
The validator will check the format of the file against the associated JSON schema file and report any issues. It can be used to check the validity of schema files produced by this tool or produced by hand.
To run tests, from the main project directory do the following (after activating your venv, if using one):
To run all tests:
python -m unittest
To run a single test:
python -m unittest -k test__build_schema
(or other test name)
This code attempts to conform as closely as possible to the [Google Style Guide] (https://google.github.io/styleguide/pyguide.html). A pylint file is included in this project's root directory, taken from that style guide. If you make changes to the codebase and want to run pylint against them, you can do so by running (from the root directory, with an appropriate environment activated):
pylint <name of thing to lint>
For instance, to run pylint against the full SchemaGenerator module itself, run:
pylint schemagen
Or, to run against the main.py
script:
pylint main.py