equinor/gordo

Error in mlflow reporting Enum Error code ...

flikka opened this issue · 2 comments

Seems to be something with the mlflow logging.

ioc-1901 does't have models built, and the last workflow, ioc-1901-1581661405269-zw245, the first model builder (ioc-1901-1581661405269-zw245-4084157029) has this stacktrace:
Gordo version 0.50.0

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gordo/cli/cli.py", line 150, in build
    machine_out.report()
  File "/usr/local/lib/python3.7/site-packages/gordo/machine/machine.py", line 137, in report
    reporter.report(self)
  File "/usr/local/lib/python3.7/site-packages/gordo/reporters/mlflow.py", line 441, in report
    log_machine(mlflow_client, run_id, machine)
  File "/usr/local/lib/python3.7/site-packages/gordo/reporters/mlflow.py", line 413, in log_machine
    mlflow_client.log_batch(run_id, **get_batch_kwargs(machine))
  File "/usr/local/lib/python3.7/site-packages/mlflow/tracking/client.py", line 242, in log_batch
    self._tracking_client.log_batch(run_id, metrics, params, tags)
  File "/usr/local/lib/python3.7/site-packages/mlflow/tracking/_tracking_service/client.py", line 231, in log_batch
    self.store.log_batch(run_id=run_id, metrics=metrics, params=params, tags=tags)
  File "/usr/local/lib/python3.7/site-packages/mlflow/store/tracking/rest_store.py", line 240, in log_batch
    self._call_endpoint(LogBatch, req_body)
  File "/usr/local/lib/python3.7/site-packages/azureml/mlflow/_internal/store.py", line 88, in _call_endpoint
    return super(AzureMLRestStore, self)._call_endpoint(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/mlflow/store/tracking/rest_store.py", line 32, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.7/site-packages/mlflow/utils/rest_utils.py", line 137, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.7/site-packages/mlflow/utils/rest_utils.py", line 103, in verify_rest_response
    raise RestException(json.loads(response.text))
  File "/usr/local/lib/python3.7/site-packages/mlflow/exceptions.py", line 62, in __init__
    super(RestException, self).__init__(message, error_code=ErrorCode.Value(error_code))
  File "/usr/local/lib/python3.7/site-packages/google/protobuf/internal/enum_type_wrapper.py", line 71, in Value
    self._enum_type.name, name))
ValueError: Enum ErrorCode has no value defined for name 1

Looking at one of the builders on this gordo, it appears that it's on the postgres reporter, failing due to a machine with the same key already being inserted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gordo/cli/cli.py", line 150, in build
    machine_out.report()
  File "/usr/local/lib/python3.7/site-packages/gordo/machine/machine.py", line 137, in report
    reporter.report(self)
  File "/usr/local/lib/python3.7/site-packages/gordo/reporters/postgres.py", line 81, in report
    raise PostgresReporterException(exc)
gordo.reporters.postgres.PostgresReporterException: duplicate key value violates unique constraint "machine_name"
DETAIL:  Key (name)=(c5f17844-2913-4a96-b34a-6e05248da252-9999) already exists.

Perhaps some logic is missing to handle duplicate insertions.

Turned out this was because of the (undocumented) Azure limit of max 200 metrics per call, which you fixed in #934 right @ryanjdillon ?