keras-team/keras-tuner

Python typing hint causes gRPC error when Tuner search executed

Closed this issue · 2 comments

hmf commented

Describe the bug

Don't know if this is a bug or simply a valid restriction on use. While doing a search using a tiuner I got the following error:

Traceback (most recent call last):
  File "/workspaces/Unsupervised-Anomaly-Detection-with-SSIM-AE/AE_tune.py", line 604, in <module>
    tuner.search(
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 220, in search
    trial = self.oracle.create_trial(self.tuner_id)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/distribute/oracle_client.py", line 69, in create_trial
    response = self.stub.CreateTrial(
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: too many indices for array: array is 0-dimensional, but 1 were indexed"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2023-07-11T11:55:07.886445067+00:00", grpc_status:2, grpc_message:"Exception calling application: too many indices for array: array is 0-dimensional, but 1 were indexed"}"
>

The error points to the call:

      tuner.search(
          x=data.train_ds.batch(props.batch_size_train),
          epochs=args.max_epochs, # TODO: place in Properties
          steps_per_epoch=props.steps_per_epoch,
          validation_data=data.val_ds.batch(props.batch_size_validate), # .take(validation_size),
          validation_steps=props.validation_steps,
          verbose=1, # 0,
          use_multiprocessing=True,
          workers=args.n_workers,
          shuffle=True,
          batch_size=args.batch_size,
          # Use the TensorBoard callback.
          callbacks=[early_stopping_callback, tensorboard_callback],
      )

After several days of analyzing this I found that the problem stems from the construction of the Tuner. Here is the code I used:

  tuner = kt.BayesianOptimization(
      hypermodel=build_model_(props),  # None
      objective=kt.Objective('val_loss', direction="min"),  # None
      max_trials=props.max_trials,  # 50, #10 #15 #220
      num_initial_points=None,
      alpha=0.0001,  # 1e-4.
      beta=2.6,
      seed=None,
      hyperparameters=None,
      tune_new_entries=True,
      allow_new_entries=True,
      max_retries_per_trial=0,
      max_consecutive_failed_trials=3,
      # **kwargs
      # https://keras.io/api/keras_tuner/tuners/base_tuner/#tuner-class
      executions_per_trial=props.executions_per_trial,  # 1
      overwrite=False,
      directory=tune_dir,
      project_name=project_name
  )

The problem is in the following function:

def build_model_(props: dp.Properties) -> Callable[[kt.HyperParameters], keras.Model]:
    def f(hp: kt.HyperParameters):
        return build_model(hp, props)
    return f

I changed the build_model method to its original version as shown below:

def build_model_(props: dp.Properties):
    def f(hp: kt.HyperParameters):
        return build_model(hp, props)
    return f

The code now runs.

To Reproduce

Cannot get you a minimal reproducible example but I have repeated the experiment of adding and removing the Python typing hints and gRPC invariably fails when typing is used.

Expected behavior

I expect the code to work with or without Python typing hints. I am assuming that type does not affect run-time. My assumption may be wrong as I have no knowledge of what is happening under the hood.

Additional context

Tests were executed on a Linux VM. Library versions used shown at the end.

Would you like to help us fix it?

I can try. My objective with this report is to record it in case anyone else has the same issue and save them time.

Successfully installed MarkupSafe-2.1.3 PyWavelets-1.4.1 absl-py-1.4.0 astunparse-1.6.3 cachetools-5.3.1 certifi-2023.5.7 charset-normalizer-3.2.0 contourpy-1.1.0 cycler-0.11.0 flatbuffers-23.5.26 fonttools-4.40.0 gast-0.4.0 google-auth-2.22.0 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.56.0 gviz-api-1.10.0 h5py-3.9.0 idna-3.4 imageio-2.31.1 joblib-1.3.1 keras-2.13.1 keras-tuner-1.3.5 kiwisolver-1.4.4 kt-legacy-1.0.5 lazy_loader-0.3 libclang-16.0.0 markdown-3.4.3 matplotlib-3.7.2 networkx-3.1 numpy-1.24.3 oauthlib-3.2.2 opencv-contrib-python-headless-4.8.0.74 opencv-python-headless-4.8.0.74 opt-einsum-3.3.0 packaging-23.1 pillow-10.0.0 protobuf-4.23.4 pyasn1-0.5.0 pyasn1-modules-0.3.0 pyparsing-3.0.9 python-dateutil-2.8.2 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 scikit-image-0.21.0 scikit-learn-1.3.0 scipy-1.11.1 six-1.16.0 tensorboard-2.13.0 tensorboard-data-server-0.7.1 tensorboard_plugin_profile-2.13.0 tensorflow-2.13.0 tensorflow-estimator-2.13.0 tensorflow-io-gcs-filesystem-0.32.0 termcolor-2.3.0 threadpoolctl-3.1.0 tifffile-2023.7.10 tqdm-4.65.0 typing-extensions-4.5.0 urllib3-1.26.16 werkzeug-2.3.6 wrapt-1.15.0
hmf commented

Still have the same issue. Seems like this is not the cause. The test now don't fail on the first termination of a worker but still fail later.

hmf commented

Trying to create a reproducible example.