GaussianProcessRegression() optimize does not work in a subprocess
ioananikova opened this issue · 5 comments
Describe the bug
When a pool of processes is used for executing calls (like with concurrent.futures.ProcessPoolExecutor
), the optimize()
method of GaussianProcessRegression()
will take forever and never finish. More specifically this happens in evaluate_loss_of_model_parameters()
.
To reproduce
Steps to reproduce the behaviour:
- Create a pool of processes
- Make sure a GPR model is created in a process
- Update the model
- Then try to optimize the model (it will fail here)
A minimal reproducible code example is included to illustrate the problem. (change to .py)
test_concurrent_trieste.txt
Expected behaviour
The expected behavior is that the optimize function behaves as it would in a normal process (not subprocess). Usually this step takes less than a second to finish.
System information
- OS: Ubuntu-20.04 (in WSL), on Windows 10
- Python version: 3.9.9
- Trieste version: 0.13.0 (the pip version, release tag or commit hash)
- TensorFlow version: 2.10.0
- GPflow version: 2.6.3
Additional context
Even if the import statements are in the subprocess, it fails.
(Confirmed that this is still broken with latest version, possibly hitting some sort of deadlock.)
This is somehow connected to the use of tf.function
compilation. Disabling tracing with tf.config.run_functions_eagerly(True)
allows the code example to run (though at the obvious expense of executing everything eagerly each time). Will investigate further.
It's also somehow connected to something trieste or one its dependant libraries does:
# COMMENTING OUT EITHER import trieste OR @tf.function MAKES THIS PASS!
import concurrent.futures
import tensorflow as tf
import trieste
@tf.function
def say_hi():
tf.print("hi")
def concurrency_test(n):
print(f"I'm going to say hi!")
say_hi()
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
executor.map(concurrency_test, [10])
Ok, so it looks like this is due to some state initialisation performed by tensorflow when you call it for the first time. Replacing import trieste
with tf.constant(42)
or similar in the example above also hangs.
The solution is to avoid importing trieste until you're inside the subprocess:
import concurrent.futures
WORKERS = 1
def test_concurrent(num_initial_points):
from trieste.objectives.single_objectives import Branin
import trieste
from trieste.models.gpflow import GaussianProcessRegression, build_gpr
print(f'num_initial_points: {num_initial_points}')
branin_obj = Branin.objective
search_space = Branin.search_space
observer = trieste.objectives.utils.mk_observer(branin_obj)
initial_query_points = search_space.sample_halton(num_initial_points)
initial_data = observer(initial_query_points)
print('initial data created')
gpflow_model = build_gpr(initial_data, search_space, likelihood_variance=1e-7)
model = GaussianProcessRegression(gpflow_model)
print('model created')
model.update(initial_data)
print('model updated')
model.optimize(initial_data)
print('model optimized')
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=WORKERS) as executor:
executor.map(test_concurrent, [10])
I'll see whether we can document this anywhere. Does this solve your issue? (if you can remember back to October 2022!)