google-deepmind/graph_nets

Performance issue

ping-dong-tm opened this issue · 9 comments

I am using the graph module to solve an ETA estimation problem. Could anybody tell me how to make the training faster by using all the available cores on my computer? I tried many things in TensorFlow 2.5, such as setting the number of thread by tf.config.threading.set_inter_op_parallelism_threads and tf.config.threading.set_intra_op_parallelism_threads, but nothing works. Training the MLP networks took a very long time.

Could anybody tell me how to make the training faster by using all the available cores on my computer?

Thank you for your message, could you confirm that you are using a tf.function and passing the input_signature to it to indicate the variable size shapes:
compiled_update_step = tf.function(update_step, input_signature=input_signature)
such as in the TF2 example?

If this is not the case, then it may be that the tf.function is tracing your computation at every training step, and that would be very slow. If you are not using tf.function at all, then I would expect the program also to be very slow.

Training the MLP networks took a very long time.

Are you talking about just training a regular MLP, or training a GraphNetwork with MLPs inside? Also, could you confirm that you are talking about using your CPU for training rather than GPU?

Beyond the suggestion above, I cannot think of any reason for this to be unusually slow (beyond what is expected of CPU vs GPU) that is specific to the usage of the graph_nets library, and I would recommend to follow up with any general TF recommendations for running code as fast as possible on CPUs.

Thank you for your comments. I do use compiled_update_step for training, thus the code is running in graph mode. Currently, I am training the model using only CPU.

I found my training is executed only on one core, even though my computer has 16 cores. I am not sure why the code is not executed in parallel. I followed some suggestions from Intel, such as

num_threads = 8
os.environ["OMP_NUM_THREADS"] = "8"
os.environ["TF_NUM_INTRAOP_THREADS"] = "8"
os.environ["TF_NUM_INTEROP_THREADS"] = "8"
os.environ['openmp'] = 'True'
os.environ['KMP_SETTINGS'] = 'True'

tf.config.threading.set_inter_op_parallelism_threads(
1 # number of socket
)
tf.config.threading.set_intra_op_parallelism_threads(
num_threads # number of physical cores
)
tf.config.set_soft_device_placement(True)

This does not work.

Thanks for your reply, could you confirm if:

  1. You are referring to a GraphNeural network or just a single MLP.
  2. In the case it is a neural network, whether you are passing a signature argument to the tf.function as indicated about.
    Thanks!

I am referring to the graph neural network. My code is based on the example code in demo-tf2. I made some modifications to the EncodeProcessDecode class, where I used MLPs as update functions to nodes, edges, and globals embeddings.

I also passed the signature argument to tf.function.

input_signature = [
utils_tf.specs_from_graphs_tuple(inputs_tr),
tf.TensorSpec(shape=(batch_size, 4), dtype=tf.float32)
]

compiled_update_step = tf.function(update_step, input_signature=input_signature)

Do you have any suggestions on how to make the code run on multiple CPU cores?

Thanks for the clarification, could you confirm if you observe the same low CPU utilisation when running a single MLP in a similar setting, rather than a full Graph Network?

I guess the reason that I cannot run the GNN on multiple CPU cores is that we cannot distribute the input data (GraphsTuple structure) to different cores. Did you ever run the demo examples on multiple cores or multiple GPU hardware?

You mean that you have specific code for distributing the data for an MLP on CPU, but otherwise, if you don't distribute it explicitly, it does not use the multiple cores? Could you share an example of MLP code which successfully runs on multiple CPUs on your machine?

I was assuming TensorFlow should be able to use multiple cores for things like matrix multiplication, without necessarily having to distribute the input data across CPU cores explicitly, but we never run on CPU for any serious training, so it is hard to say for sure.

Here is an example of how to run CNN neural networks on multi-core CPU ( The previous example is for neural network training on multi-GPUs. I made some minor changes). This works because it created a distributed dataset.

import sonnet as snt
import tensorflow as tf
import tensorflow_datasets as tfds

strategy = snt.distribute.Replicator( ["/device:CPU:{}".format(i) for i in range(1)], tf.distribute.ReductionToOneDevice("CPU:0"))

NOTE: This is the batch size across all GPUs.

batch_size = 100 * 4

def process_batch(images, labels):
images = tf.cast(images, dtype=tf.float32)
images = ((images / 255.) - .5) * 2.
return images, labels

def cifar10(split):
dataset = tfds.load("cifar10", split=split, as_supervised=True)
dataset = dataset.map(process_batch)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
dataset = dataset.cache()
return dataset

cifar10_train = cifar10("train").shuffle(10)
cifar10_test = cifar10("test")

learning_rate = 0.1

with strategy.scope():
model = snt.nets.Cifar10ConvNet()
optimizer = snt.optimizers.Momentum(learning_rate, 0.9)

#Training the model
def step(images, labels):
"""Performs a single training step, returning the cross-entropy loss."""
with tf.GradientTape() as tape:
logits = model(images, is_training=True)["logits"]
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels,
logits=logits))
grads = tape.gradient(loss, model.trainable_variables)

Aggregate the gradients from the full batch.

replica_ctx = tf.distribute.get_replica_context()
grads = replica_ctx.all_reduce("mean", grads)

optimizer.apply(grads, model.trainable_variables)
return loss

@tf.function
def train_step(images, labels):
per_replica_loss = strategy.run(step, args=(images, labels))
return strategy.reduce("sum", per_replica_loss, axis=None)

def train_epoch(dataset):
"""Performs one epoch of training, returning the mean cross-entropy loss."""
total_loss = 0.0
num_batches = 0

Loop over the entire training set.

for images, labels in dataset:
total_loss += train_step(images, labels).numpy()
num_batches += 1

return total_loss / num_batches

cifar10_train_dist = strategy.experimental_distribute_dataset(cifar10_train)

for epoch in range(20):
print("Training epoch", epoch, "...", end=" ")
print("loss :=", train_epoch(cifar10_train_dist))

Thanks for the reply, yes this indeed is a way to run on multi device using replicator, but in principle it should not be necessary to use this to make use of multiple CPU cores.

Unfortunately, it is not immediately obvious to do this with the library (You would need to build batches of batches, but because each batch has a different dimension, the second batch needs to be built with with something like tf.data.experimental.dense_to_ragged_batch)

I would recommend trying to follow up with TensorFlow directly if the solutions from this thread don't help.