secondmind-labs/trieste

GaussianProcessRegression() optimize does not work in a subprocess

ioananikova opened this issue · 5 comments

Describe the bug
When a pool of processes is used for executing calls (like with concurrent.futures.ProcessPoolExecutor), the optimize() method of GaussianProcessRegression() will take forever and never finish. More specifically this happens in evaluate_loss_of_model_parameters().

To reproduce
Steps to reproduce the behaviour:

  1. Create a pool of processes
  2. Make sure a GPR model is created in a process
  3. Update the model
  4. Then try to optimize the model (it will fail here)

A minimal reproducible code example is included to illustrate the problem. (change to .py)
test_concurrent_trieste.txt

Expected behaviour
The expected behavior is that the optimize function behaves as it would in a normal process (not subprocess). Usually this step takes less than a second to finish.

System information

  • OS: Ubuntu-20.04 (in WSL), on Windows 10
  • Python version: 3.9.9
  • Trieste version: 0.13.0 (the pip version, release tag or commit hash)
  • TensorFlow version: 2.10.0
  • GPflow version: 2.6.3

Additional context
Even if the import statements are in the subprocess, it fails.

(Confirmed that this is still broken with latest version, possibly hitting some sort of deadlock.)

This is somehow connected to the use of tf.function compilation. Disabling tracing with tf.config.run_functions_eagerly(True) allows the code example to run (though at the obvious expense of executing everything eagerly each time). Will investigate further.

It's also somehow connected to something trieste or one its dependant libraries does:

# COMMENTING OUT EITHER import trieste OR @tf.function MAKES THIS PASS!
import concurrent.futures
import tensorflow as tf
import trieste

@tf.function
def say_hi():
    tf.print("hi")

def concurrency_test(n):
    print(f"I'm going to say hi!")
    say_hi()

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
        executor.map(concurrency_test, [10])

Ok, so it looks like this is due to some state initialisation performed by tensorflow when you call it for the first time. Replacing import trieste with tf.constant(42) or similar in the example above also hangs.

The solution is to avoid importing trieste until you're inside the subprocess:

import concurrent.futures

WORKERS = 1

def test_concurrent(num_initial_points):
    from trieste.objectives.single_objectives import Branin
    import trieste
    from trieste.models.gpflow import GaussianProcessRegression, build_gpr
    print(f'num_initial_points: {num_initial_points}')
    branin_obj = Branin.objective
    search_space = Branin.search_space
    observer = trieste.objectives.utils.mk_observer(branin_obj)

    initial_query_points = search_space.sample_halton(num_initial_points)
    initial_data = observer(initial_query_points)
    print('initial data created')

    gpflow_model = build_gpr(initial_data, search_space, likelihood_variance=1e-7)
    model = GaussianProcessRegression(gpflow_model)
    print('model created')

    model.update(initial_data)
    print('model updated')
    model.optimize(initial_data)
    print('model optimized')


if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(max_workers=WORKERS) as executor:
        executor.map(test_concurrent, [10])

I'll see whether we can document this anywhere. Does this solve your issue? (if you can remember back to October 2022!)