Error validating manifest
andrewelamb opened this issue ยท 8 comments
Describe the bug
schematic model -c ../schematic/config.yml validate -mp synapse_storage_manifest.csv -dt Patients
Causes error below
schema
Expected behavior
Either for the manifest to validate or clearly describe what is wrong with the manifest
Priority (select one)
- Minor โฌ๏ธ
- Major ๐ข
- Critical ๐
(schematicpy-gKAjXOOq-py3.9) (base) alamb@ALamb:~/repos/iatlasManifests$ schematic model -c ../schematic/config.yml validate -mp synapse_storage_manifest.csv -dt Patients
Starting schematic...
The (model > input > location) argument with value '../iAtlasSchema/iatlas_schema.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as ../iAtlasSchema/iatlas_schema.Patients.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning: return concat(self.root_render_func(self.new_context(vars)))
10 expectation(s) included in expectation_suite.
Calculating Metrics: 78%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 52/67 [00:00<00:00, 131.28it/s]
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning: return concat(self.root_render_func(self.new_context(vars)))
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning: return concat(self.root_render_func(self.new_context(vars)))
Traceback (most recent call last):
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/bin/schematic", line 6, in <module>
sys.exit(main())
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/home/alamb/repos/schematic/schematic/models/commands.py", line 232, in validate_manifest
errors, warnings = metadata_model.validateModelManifest(
File "/home/alamb/repos/schematic/schematic/models/metadata.py", line 254, in validateModelManifest
errors, warnings, manifest = validate_all(self, errors, warnings, manifest, manifestPath, self.sg, jsonSchema, restrict_rules, project_scope)
File "/home/alamb/repos/schematic/schematic/models/validate_manifest.py", line 253, in validate_all
manifest, vmr_errors, vmr_warnings = vm.validate_manifest_rules(manifest, sg, restrict_rules, project_scope)
File "/home/alamb/repos/schematic/schematic/models/validate_manifest.py", line 158, in validate_manifest_rules
errors, warnings = ge_helpers.generate_errors(
File "/home/alamb/repos/schematic/schematic/models/GE_Helpers.py", line 405, in generate_errors
observed_type=result_dict['result']['observed_value']
KeyError: 'observed_value'
manifest:
synapse_storage_manifest.csv
Related to inRange rule.
Upon investigation, the observed_value
was missing from the result
dictionary in the variable result_dict
, but there was a dictionary value for the exception_info
key indicating that an exception was raised during the running of the expectation suite.
I've added functionality to GE_Helpers.generate_errors
to parse and raise any exceptions raised during GE validation. In this case the trace is as displayed below:
Exception has occurred: GreatExpectationsError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Traceback (most recent call last):
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 650, in _process_direct_and_bundled_metric_computation_configurations
] = metric_computation_configuration.metric_fn(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\metric_provider.py", line 90, in inner_func
return metric_fn(*args, **kwargs)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\map_metric_provider.py", line 371, in inner_func
meets_expectation_series = metric_fn(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 205, in _pandas
return temp_column.map(is_between)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\series.py", line 4539, in map
new_values = self._map_values(arg, na_action=na_action)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\base.py", line 890, in _map_values
new_values = map_f(values, mapper)
File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 141, in is_between
raise TypeError(
TypeError: Column values, min_value, and max_value must either be None or of the same type.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\validator\validation_graph.py", line 272, in _resolve
self._execution_engine.resolve_metrics(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 375, in resolve_metrics
return self._process_direct_and_bundled_metric_computation_configurations(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 654, in _process_direct_and_bundled_metric_computation_configurations
raise gx_exceptions.MetricResolutionError(
great_expectations.exceptions.exceptions.MetricResolutionError: Column values, min_value, and max_value must either be None or of the same type.
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\GE_Helpers.py", line 416, in generate_errors
raise GreatExpectationsError(result_dict['exception_info']['exception_traceback'])
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\validate_manifest.py", line 158, in validate_manifest_rules
errors, warnings = ge_helpers.generate_errors(
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\validate_manifest.py", line 253, in validate_all
manifest, vmr_errors, vmr_warnings = vm.validate_manifest_rules(manifest, sg, restrict_rules, project_scope)
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\metadata.py", line 254, in validateModelManifest
errors, warnings, manifest = validate_all(self, errors, warnings, manifest, manifestPath, self.sg, jsonSchema, restrict_rules, project_scope)
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\commands.py", line 232, in validate_manifest
errors, warnings = metadata_model.validateModelManifest(
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\decorators.py", line 38, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\__main__.py", line 45, in <module>
main()
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
great_expectations.exceptions.exceptions.GreatExpectationsError: Traceback (most recent call last):
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 650, in _process_direct_and_bundled_metric_computation_configurations
] = metric_computation_configuration.metric_fn(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\metric_provider.py", line 90, in inner_func
return metric_fn(*args, **kwargs)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\map_metric_provider.py", line 371, in inner_func
meets_expectation_series = metric_fn(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 205, in _pandas
return temp_column.map(is_between)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\series.py", line 4539, in map
new_values = self._map_values(arg, na_action=na_action)
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\base.py", line 890, in _map_values
new_values = map_f(values, mapper)
File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 141, in is_between
raise TypeError(
TypeError: Column values, min_value, and max_value must either be None or of the same type.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\validator\validation_graph.py", line 272, in _resolve
self._execution_engine.resolve_metrics(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 375, in resolve_metrics
return self._process_direct_and_bundled_metric_computation_configurations(
File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 654, in _process_direct_and_bundled_metric_computation_configurations
raise gx_exceptions.MetricResolutionError(
great_expectations.exceptions.exceptions.MetricResolutionError: Column values, min_value, and max_value must either be None or of the same type.
The issue appears to be related to they types of entries in the manifest. In the manifest provided, there are NA
values entered that get converted to empty strings during import. I believe the error is arising because there are string values and numerical values in the same column being compared to numerical values.
As part of the PR I've allowed cross-type comparisons so that this error will not be raised, but the NA values will still be counted as "out of range" and display an error or warning.
The issue appears to be related to they types of entries in the manifest. In the manifest provided, there are
NA
values entered that get converted to empty strings during import. I believe the error is arising because there are string values and numerical values in the same column being compared to numerical values.As part of the PR I've allowed cross-type comparisons so that this error will not be raised, but the NA values will still be counted as "out of range" and display an error or warning.
@andrewelamb's manifest seems to be anther use case for #980
cc'ing @MiekoHash @milen-sage to prioritize
@GiaJordan Could you elaborate on 'NA' values? Should they be stored as something else in the CSV?
In your manifest, you have some values specified as NA
for an attribute with the inRange
rule. They're converted to empty strings ""
when imported. Ideally, they wouldn't be strings they'd be numbers too but we can add support for that with #980
@GiaJordan I'm now seeing the below error. This is what you were expecting with NA's in columns with the inRange rule until #980 is addressed correct?
schematic model -c config.yml validate -mp synapse_storage_manifest.csv -dt Patients
WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
value = getattr(object, key)
WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
value = getattr(object, key)
WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
value = getattr(object, key)
Starting schematic...
The (model > input > location) argument with value '../iAtlasSchema/iatlas_schema.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as ../iAtlasSchema/iatlas_schema.Patients.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
5 expectation(s) included in expectation_suite.
Calculating Metrics: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 36/36 [00:00<00:00, 630.39it/s]
warning: On row 95 the attribute age_at_diagnosis does not contain the proper value type int.
error: age_at_diagnosis values in rows [95] are out of the specified range.
[[[95], 'age_at_diagnosis', 'age_at_diagnosis values in rows [95] are out of the specified range.', {''}]]
@andrewelamb yes, the error: age_at_diagnosis values in rows [95] are out of the specified range.
error is expected. The other warning should be addressed as well