Sage-Bionetworks/schematic

Error validating manifest

andrewelamb opened this issue ยท 8 comments

Describe the bug
schematic model -c ../schematic/config.yml validate -mp synapse_storage_manifest.csv -dt Patients Causes error below
schema

Expected behavior
Either for the manifest to validate or clearly describe what is wrong with the manifest

Priority (select one)

  • Minor โฌ‡๏ธ
  • Major ๐Ÿ“ข
  • Critical ๐Ÿ†˜
(schematicpy-gKAjXOOq-py3.9) (base) alamb@ALamb:~/repos/iatlasManifests$ schematic model -c ../schematic/config.yml validate -mp synapse_storage_manifest.csv -dt Patients
Starting schematic...
The (model > input > location) argument with value '../iAtlasSchema/iatlas_schema.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as ../iAtlasSchema/iatlas_schema.Patients.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning:   return concat(self.root_render_func(self.new_context(vars)))
	10 expectation(s) included in expectation_suite.
Calculating Metrics:  78%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                                  | 52/67 [00:00<00:00, 131.28it/s]
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning:   return concat(self.root_render_func(self.new_context(vars)))
warning: /home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/jinja2/environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
warning:   return concat(self.root_render_func(self.new_context(vars)))
Traceback (most recent call last):
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/bin/schematic", line 6, in <module>
    sys.exit(main())
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/alamb/.cache/pypoetry/virtualenvs/schematicpy-gKAjXOOq-py3.9/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/home/alamb/repos/schematic/schematic/models/commands.py", line 232, in validate_manifest
    errors, warnings = metadata_model.validateModelManifest(
  File "/home/alamb/repos/schematic/schematic/models/metadata.py", line 254, in validateModelManifest
    errors, warnings, manifest = validate_all(self, errors, warnings, manifest, manifestPath, self.sg, jsonSchema, restrict_rules, project_scope)
  File "/home/alamb/repos/schematic/schematic/models/validate_manifest.py", line 253, in validate_all
    manifest, vmr_errors, vmr_warnings = vm.validate_manifest_rules(manifest, sg, restrict_rules, project_scope)
  File "/home/alamb/repos/schematic/schematic/models/validate_manifest.py", line 158, in validate_manifest_rules
    errors, warnings = ge_helpers.generate_errors(
  File "/home/alamb/repos/schematic/schematic/models/GE_Helpers.py", line 405, in generate_errors
    observed_type=result_dict['result']['observed_value']
KeyError: 'observed_value'

Related to inRange rule.

Upon investigation, the observed_value was missing from the result dictionary in the variable result_dict, but there was a dictionary value for the exception_info key indicating that an exception was raised during the running of the expectation suite.

I've added functionality to GE_Helpers.generate_errors to parse and raise any exceptions raised during GE validation. In this case the trace is as displayed below:

Exception has occurred: GreatExpectationsError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Traceback (most recent call last):
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 650, in _process_direct_and_bundled_metric_computation_configurations
    ] = metric_computation_configuration.metric_fn(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\metric_provider.py", line 90, in inner_func
    return metric_fn(*args, **kwargs)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\map_metric_provider.py", line 371, in inner_func
    meets_expectation_series = metric_fn(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 205, in _pandas
    return temp_column.map(is_between)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\series.py", line 4539, in map
    new_values = self._map_values(arg, na_action=na_action)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\base.py", line 890, in _map_values
    new_values = map_f(values, mapper)
  File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 141, in is_between
    raise TypeError(
TypeError: Column values, min_value, and max_value must either be None or of the same type.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\validator\validation_graph.py", line 272, in _resolve
    self._execution_engine.resolve_metrics(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 375, in resolve_metrics
    return self._process_direct_and_bundled_metric_computation_configurations(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 654, in _process_direct_and_bundled_metric_computation_configurations
    raise gx_exceptions.MetricResolutionError(
great_expectations.exceptions.exceptions.MetricResolutionError: Column values, min_value, and max_value must either be None or of the same type.
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\GE_Helpers.py", line 416, in generate_errors
    raise GreatExpectationsError(result_dict['exception_info']['exception_traceback'])
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\validate_manifest.py", line 158, in validate_manifest_rules
    errors, warnings = ge_helpers.generate_errors(
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\validate_manifest.py", line 253, in validate_all
    manifest, vmr_errors, vmr_warnings = vm.validate_manifest_rules(manifest, sg, restrict_rules, project_scope)
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\metadata.py", line 254, in validateModelManifest
    errors, warnings, manifest = validate_all(self, errors, warnings, manifest, manifestPath, self.sg, jsonSchema, restrict_rules, project_scope)
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\models\commands.py", line 232, in validate_manifest
    errors, warnings = metadata_model.validateModelManifest(
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\gjordan\Documents\GitHub\schematic\schematic\__main__.py", line 45, in <module>
    main()
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\gjordan\anaconda3\envs\schematic\Lib\runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
great_expectations.exceptions.exceptions.GreatExpectationsError: Traceback (most recent call last):
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 650, in _process_direct_and_bundled_metric_computation_configurations
    ] = metric_computation_configuration.metric_fn(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\metric_provider.py", line 90, in inner_func
    return metric_fn(*args, **kwargs)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\map_metric_provider.py", line 371, in inner_func
    meets_expectation_series = metric_fn(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 205, in _pandas
    return temp_column.map(is_between)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\series.py", line 4539, in map
    new_values = self._map_values(arg, na_action=na_action)
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\pandas\core\base.py", line 890, in _map_values
    new_values = map_f(values, mapper)
  File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\expectations\metrics\column_map_metrics\column_values_between.py", line 141, in is_between
    raise TypeError(
TypeError: Column values, min_value, and max_value must either be None or of the same type.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\validator\validation_graph.py", line 272, in _resolve
    self._execution_engine.resolve_metrics(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 375, in resolve_metrics
    return self._process_direct_and_bundled_metric_computation_configurations(
  File "C:\Users\gjordan\anaconda3\envs\schematic\lib\site-packages\great_expectations\execution_engine\execution_engine.py", line 654, in _process_direct_and_bundled_metric_computation_configurations
    raise gx_exceptions.MetricResolutionError(
great_expectations.exceptions.exceptions.MetricResolutionError: Column values, min_value, and max_value must either be None or of the same type.

The issue appears to be related to they types of entries in the manifest. In the manifest provided, there are NA values entered that get converted to empty strings during import. I believe the error is arising because there are string values and numerical values in the same column being compared to numerical values.

As part of the PR I've allowed cross-type comparisons so that this error will not be raised, but the NA values will still be counted as "out of range" and display an error or warning.

The issue appears to be related to they types of entries in the manifest. In the manifest provided, there are NA values entered that get converted to empty strings during import. I believe the error is arising because there are string values and numerical values in the same column being compared to numerical values.

As part of the PR I've allowed cross-type comparisons so that this error will not be raised, but the NA values will still be counted as "out of range" and display an error or warning.

@andrewelamb's manifest seems to be anther use case for #980
cc'ing @MiekoHash @milen-sage to prioritize

@GiaJordan Could you elaborate on 'NA' values? Should they be stored as something else in the CSV?

In your manifest, you have some values specified as NA for an attribute with the inRange rule. They're converted to empty strings "" when imported. Ideally, they wouldn't be strings they'd be numbers too but we can add support for that with #980

@GiaJordan I'm now seeing the below error. This is what you were expecting with NA's in columns with the inRange rule until #980 is addressed correct?

schematic model -c config.yml validate -mp synapse_storage_manifest.csv -dt Patients
WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  value = getattr(object, key)

WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  value = getattr(object, key)

WARNING: [2023-02-16 08:06:19] py.warnings - /home/alamb/miniconda3/lib/python3.9/inspect.py:351: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  value = getattr(object, key)

Starting schematic...
The (model > input > location) argument with value '../iAtlasSchema/iatlas_schema.jsonld' is being read from the config file.
The (model > input > file_type) argument with value 'local' is being read from the config file.
JSON schema successfully generated from schema.org schema!
JSON schema file log stored as ../iAtlasSchema/iatlas_schema.Patients.schema.json
FileDataContext loading zep config
GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
GxConfig.parse_yaml() returning empty `xdatasources`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
EphemeralDataContext has not implemented `_load_zep_config()` returning empty `GxConfig`
Loading 'datasources' ->
{}
Loaded 'datasources' ->
{}
	5 expectation(s) included in expectation_suite.
Calculating Metrics: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 36/36 [00:00<00:00, 630.39it/s]
warning: On row 95 the attribute age_at_diagnosis does not contain the proper value type int.
error: age_at_diagnosis values in rows [95] are out of the specified range.
[[[95], 'age_at_diagnosis', 'age_at_diagnosis values in rows [95] are out of the specified range.', {''}]]

@andrewelamb yes, the error: age_at_diagnosis values in rows [95] are out of the specified range. error is expected. The other warning should be addressed as well