This is a Python package for validating Overlays Capture Architecture (OCA) data sets. It includes three classes: OCADataSet
, OCADataSetErr
, and OCABundle
. For more information about OCA, please check OCA Specification v1.0.0.
-
OCADataSet
represents an OCA data set to be validated, and can be loaded from a pandas DataFrame, an OCA Excel Data Entry File, or a CSV file. -
OCADataSetErr
represents the result set of an OCA data set validation. This class is generated by the data set validation, contains all the error information, and also provides three methods for a quick overview:overview()
,first_error_col()
, andget_error_col(attr_name)
. -
OCABundle
represents schema overlays from a loaded.json
OCA bundle used to validate the data set.
- pandas
- pathlib
Install the package by typing pip install oca-ds-validator==0.0.2
to the console. Then you could import the classes from any Python scripts.
- The package can be found here: oca_ds_validator
- Import the OCA Bundle using
OCABundle(path)
. - Import the OCA Data Set using
OCADataSet(pandas_dataframe)
orOCADataSet.from_path(path)
. - Generate the validation result using
validate()
method for classOCABundle
.
from oca_ds_validator import OCADataSet, OCADataSetErr, OCABundle
test_bundle = OCABundle("/path/to/oca/bundle.json")
test_data = OCADataSet(data_set_dataframe)
# test_data = OCADataSet.from_path("/path/to/oca/data_entry_file.xlsx")
# test_data = OCADataSet.from_path("/path/to/oca/data_set_file.csv")
test_rslt = test_bundle.validate(test_data)
#########################################################################################
# Example of a possible test_rslt:
# attr_err:
# [('missing_attribute',
# 'Missing attribute (attribute not found in the data set).'),
# ('unmatched_attribute',
# 'Unmatched attribute (attribute not found in the OCA Bundle).')]
# format_err:
# {'attribute_with_format_error_on_row_0': {0: 'Format mismatch.'},
# 'array_attribute_without_array_data_on_row_0': {0: 'Valid array required.'},
# 'attribute_with_format_error_on_row_42': {42: 'Format mismatch.'},
# 'attribute_with_errors_on_row_0_and_1': {0: 'Format mismatch.',
# 1: 'Valid array required.'},
# 'mandatory_attribute_with_missing_data': {0: 'Missing mandatory attribute.'},
# 'attribute_without_error': {}}
# ecode_err: # Not matching any of the entry codes
# {'attribute_with_entry_codes': {0: 'One of the entry codes required.'}}
#########################################################################################
There are three optional boolean arguments to control the message printed.
argument | default value | usage |
---|---|---|
show_data_preview |
False |
If enabled, prints a pandas preview of the data set before validation. |
enable_flagged_alarm |
True |
If enabled, prints a warning message for the existence of flagged attributes. |
enable_version_alarm |
True |
If enabled, prints a warning message for each overlay that contains an OCA version number different from the development version of this script (1.0). |
The errors of the data set is stored in the generated OCADataSetErr
class.
# Prints a brief summary of errors.
test_rslt.overview()
#########################################################################################
# Attribute error.
# {'missing_attribute'} found in the OCA Bundle but not in the data set;
# {'unmatched_attribute'} found in the data set but not in the OCA Bundle.
# Found 3 problematic row(s) in the following attribute(s):
# {'attribute_with_format_error_on_row_0',
# 'array_attribute_without_array_data_on_row_0',
# 'attribute_with_format_error_on_row_42',
# 'attribute_with_errors_on_row_0_and_1',
# 'mandatory_attribute_with_missing_data',
# 'attribute_with_entry_codes'}
#########################################################################################
# Prints the information of the first problematic column.
test_rslt.first_err_col()
#########################################################################################
# The first problematic column is: attribute_with_format_error_on_row_0
# Format error(s) would occur in the following rows:
# row 0 : Format mismatch.
# No entry code error found in the column.
#########################################################################################
# Prints the information of some certain column.
test_rslt.get_err_col("attribute_with_format_error_on_row_42")
#########################################################################################
# Format error(s) would occur in the following rows of column
# attribute_with_format_error_on_row_42:
# row 42 : Format mismatch.
# No entry code error found in the column.
#########################################################################################
# Get objects of full error details.
# You may find it useful for data visualization or further analysis.
test_rslt.get_attr_err()
test_rslt.get_format_err()
test_rslt.get_ecode_err()
test_rslt.get_char_encode_err()
This script is created with support by Agri-food Data Canada, funded by CFREF through the Food from Thought grant held at the University of Guelph. Currently, we do not provide any warranty of any kind regarding the accuracy, security, completeness or reliability of this script or any of its parts.
At the moment, this script is developed for the validation of the following OCA attribute types:
- Text (with regular expressions)
- DateTime (with ISO 8601 formats)
- Array[Type]; for any Types that are not mentioned above, only the validness of the array will be checked.
Also, besides the format overlay, the data set will be validated with the following overlays:
- Conformance Overlay, for any missing mandatory data
- Entry Code Overlay, for any data that mismatches entry codes
- Character Encoding, for any data that mismatches the specified character encoding
JSON data types are NOT validated due to the type coercion while importing Excel or CSV files. We also recommend that you import the data set as Pandas DataFrame to prevent unexpected DateTime formatting by software such as Microsoft Excel.
Any validation errors other than the above are NOT guaranteed to be filtered by this script. Please feel free to contact us with any suggestions for future development.
You could also find a well-developed OCA Validator by The Human Colossus Lab (Rust required).
EUPL (European Union Public License), version 1.2
We have distilled the most crucial license specifics to make your adoption seamless: see here for details.