[Feature Request] Add parallel_iterations and experimental_use_pfor parameters in `_compute_inv_hessian` (ExactIHVP)
lucashervier opened this issue · 0 comments
Is your feature request related to a problem? Please describe.
When using the first-order-influence-koh-liang branch I have some trouble when I want to compute the exact inverse hessian product on a semantic segmentation model. Here is a minimal example and the corresponding outpout logs that I got:
import tensorflow as tf
from influenciae.common.model_wrappers import InfluenceModel
from influenciae.influence.inverse_hessian_vector_product import ExactIHVP
IMG_SIZE = 768
NUM_CLASSES = 20
inp = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
# A conv block
x = tf.keras.layers.Conv2D(filters=32, kernel_size=1, strides=(1, 1))(inp)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation('relu')(x)
# FCN block
x = tf.keras.layers.UpSampling2D(
size=(IMG_SIZE // x.shape[1], IMG_SIZE// x.shape[2]),
interpolation="bilinear",
)(x)
model_output = tf.keras.layers.Conv2D(NUM_CLASSES, kernel_size=(1, 1), padding="same")(x)
# define model
model = tf.keras.Model(inputs=inp, outputs=model_output)
# freeze all layers except last one
for layer in model.layers:
layer.trainable = False
for layer in model.layers[-1:]:
layer.trainable = True
print(model.summary())
# define a loss for semantic segmentation fitting reduction None
class CustomLoss2(tf.keras.losses.Loss):
def __init__(self, num_classes, ignore_label):
super(CustomLoss2, self).__init__(name='CustomLoss2', reduction=tf.keras.losses.Reduction.NONE)
self.num_classes = num_classes
self.ignore_label = ignore_label
def call(self, y_true, y_pred):
sample_weights = tf.cast(tf.not_equal(y_true, self.ignore_label), dtype=tf.float32)
one_hot_gt = tf.stop_gradient(tf.one_hot(y_true, self.num_classes))
loss = tf.nn.softmax_cross_entropy_with_logits(one_hot_gt, y_pred)
weighted_loss = tf.multiply(loss, tf.squeeze(sample_weights))
# Compute mean loss over spatial dimension.
num_non_zero = tf.reduce_sum(
tf.cast(tf.not_equal(weighted_loss, 0.0), tf.float32), 1)
loss_sum_per_sample = tf.reduce_sum(weighted_loss, 1)
return tf.reduce_sum(tf.math.divide_no_nan(loss_sum_per_sample, num_non_zero), 1)
if __name__ == "__main__":
random_input = tf.random.normal(shape=(4, IMG_SIZE, IMG_SIZE, 3))
random_target = tf.random.uniform(shape=(4, IMG_SIZE, IMG_SIZE), minval=0, maxval=NUM_CLASSES-1, dtype=tf.int32)
random_dataset = tf.data.Dataset.from_tensor_slices((random_input, random_target))
# define InfluenceModel
influence_model = InfluenceModel(model, target_layer=-1, loss_function=CustomLoss2(NUM_CLASSES, ignore_label=255))
# freeze all layers except last one
for layer in influence_model.layers:
layer.trainable = False
for layer in influence_model.layers[-1:]:
layer.trainable = True
ihvp_calculator = ExactIHVP(influence_model, random_dataset.take(1).batch(1))
Logs:
(bdd_env) (base) lucas.hervier@soda01:~/bdd100$ python issue_minimal.py
2022-02-11 10:59:15.926556: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-11 10:59:17.602358: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-02-11 10:59:17.658599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.659380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:21:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.86GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-02-11 10:59:17.659421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.660154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0000:4a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.86GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-02-11 10:59:17.660173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-11 10:59:17.661903: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-02-11 10:59:17.661930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2022-02-11 10:59:17.662492: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-02-11 10:59:17.662623: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-02-11 10:59:17.663131: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2022-02-11 10:59:17.663545: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2022-02-11 10:59:17.663616: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-02-11 10:59:17.663665: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.664449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.665198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.665944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.666664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2022-02-11 10:59:17.666925: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-11 10:59:17.786724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.787447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:21:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.86GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-02-11 10:59:17.787484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.788149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0000:4a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.86GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-02-11 10:59:17.788187: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.788895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.789599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.790300: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:17.790980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2022-02-11 10:59:17.791016: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-11 10:59:18.257978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-11 10:59:18.258015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
2022-02-11 10:59:18.258021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N N
2022-02-11 10:59:18.258024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: N N
2022-02-11 10:59:18.258195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.258947: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.259658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.260379: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.261077: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.261784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22302 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6)
2022-02-11 10:59:18.262082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-11 10:59:18.262783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22312 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:4a:00.0, compute capability: 8.6)
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 768, 768, 3)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 768, 768, 32) 128
_________________________________________________________________
dropout (Dropout) (None, 768, 768, 32) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 768, 768, 32) 128
_________________________________________________________________
activation (Activation) (None, 768, 768, 32) 0
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 768, 768, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 768, 768, 20) 660
=================================================================
Total params: 916
Trainable params: 660
Non-trainable params: 256
_________________________________________________________________
None
2022-02-11 10:59:18.626040: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-11 10:59:18.644320: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3700110000 Hz
2022-02-11 10:59:18.672693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-02-11 10:59:19.060262: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2022-02-11 10:59:19.548581: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-02-11 10:59:19.925576: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
WARNING:tensorflow:Using a while_loop for converting Conv2D
WARNING:tensorflow:Using a while_loop for converting Conv2DBackpropInput
WARNING:tensorflow:Using a while_loop for converting ResizeBilinearGrad
2022-02-11 11:04:57.055515: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 45.00GiB (rounded to 48318382080)requested by op loop_body/PartitionedCall/pfor/PartitionedCall/gradients/gradient_tape/model/conv2d_1/Conv2D/Conv2DBackpropFilter_grad/Conv2D/pfor/Tile
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-02-11 11:04:57.055561: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for GPU_0_bfc
2022-02-11 11:04:57.055569: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256): Total Chunks: 24, Chunks in use: 24. 6.0KiB allocated for chunks. 6.0KiB in use in bin. 1.3KiB client-requested in use in bin.
2022-02-11 11:04:57.055575: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512): Total Chunks: 1, Chunks in use: 1. 512B allocated for chunks. 512B in use in bin. 384B client-requested in use in bin.
2022-02-11 11:04:57.055581: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-02-11 11:04:57.055587: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048): Total Chunks: 5, Chunks in use: 4. 12.5KiB allocated for chunks. 10.5KiB in use in bin. 10.5KiB client-requested in use in bin.
2022-02-11 11:04:57.055593: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055598: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055605: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055613: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055621: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055631: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055636: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055641: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288): Total Chunks: 1, Chunks in use: 0. 571.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055647: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055656: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152): Total Chunks: 3, Chunks in use: 3. 6.75MiB allocated for chunks. 6.75MiB in use in bin. 6.06MiB client-requested in use in bin.
2022-02-11 11:04:57.055664: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055671: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608): Total Chunks: 1, Chunks in use: 1. 9.00MiB allocated for chunks. 9.00MiB in use in bin. 9.00MiB client-requested in use in bin.
2022-02-11 11:04:57.055679: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216): Total Chunks: 1, Chunks in use: 1. 27.00MiB allocated for chunks. 27.00MiB in use in bin. 27.00MiB client-requested in use in bin.
2022-02-11 11:04:57.055687: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432): Total Chunks: 4, Chunks in use: 3. 172.68MiB allocated for chunks. 135.00MiB in use in bin. 135.00MiB client-requested in use in bin.
2022-02-11 11:04:57.055695: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864): Total Chunks: 5, Chunks in use: 5. 360.00MiB allocated for chunks. 360.00MiB in use in bin. 333.00MiB client-requested in use in bin.
2022-02-11 11:04:57.055702: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055709: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456): Total Chunks: 1, Chunks in use: 0. 21.22GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-02-11 11:04:57.055718: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 45.00GiB was 256.00MiB, Chunk State:
2022-02-11 11:04:57.055732: I tensorflow/core/common_runtime/bfc_allocator.cc:1020] Size: 21.22GiB | Requested Size: 45.00MiB | in_use: 0 | bin_num: 20, prev: Size: 72.00MiB | Requested Size: 72.00MiB | in_use: 1 | bin_num: -1, for: loop_body/PartitionedCall/pfor/PartitionedCall/gradients/model/conv2d_1/Conv2D_grad/Conv2DBackpropFilter/pfor/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, stepid: 15496386686427765080, last_action: 4278547630, for: UNUSED, stepid: 15496386686427765080, last_action: 4278547628
2022-02-11 11:04:57.055739: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 23385669632
2022-02-11 11:04:57.055747: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000000 of size 1280 by op ScratchBuffer action_count 4278547493 step 0 next 1
2022-02-11 11:04:57.055753: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000500 of size 256 by op Fill action_count 4278547503 step 0 next 5
2022-02-11 11:04:57.055758: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000600 of size 256 by op Fill action_count 4278547504 step 0 next 2
2022-02-11 11:04:57.055764: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000700 of size 256 by op Sub action_count 4278547495 step 0 next 3
2022-02-11 11:04:57.055768: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000800 of size 256 by op Sub action_count 4278547496 step 0 next 4
2022-02-11 11:04:57.055774: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000900 of size 256 by op Fill action_count 4278547505 step 0 next 8
2022-02-11 11:04:57.055778: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000a00 of size 256 by op Fill action_count 4278547506 step 0 next 9
2022-02-11 11:04:57.055784: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000b00 of size 256 by op Fill action_count 4278547507 step 0 next 6
2022-02-11 11:04:57.055790: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000c00 of size 512 by op Add action_count 4278547500 step 0 next 7
2022-02-11 11:04:57.055795: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000e00 of size 256 by op Fill action_count 4278547508 step 0 next 10
2022-02-11 11:04:57.055801: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6000f00 of size 256 by op Fill action_count 4278547509 step 0 next 11
2022-02-11 11:04:57.055806: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001000 of size 256 by op Fill action_count 4278547519 step 0 next 15
2022-02-11 11:04:57.055812: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001100 of size 256 by op AssignVariableOp action_count 4278547520 step 0 next 18
2022-02-11 11:04:57.055818: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001200 of size 256 by op Mul action_count 4278547522 step 0 next 20
2022-02-11 11:04:57.055823: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001300 of size 256 by op Add action_count 4278547524 step 0 next 22
2022-02-11 11:04:57.055829: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001400 of size 256 by op Equal action_count 4278547529 step 0 next 24
2022-02-11 11:04:57.055835: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001500 of size 256 by op CustomLoss2/weighted_loss/Const action_count 4278547533 step 0 next 26
2022-02-11 11:04:57.055841: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001600 of size 256 by op CustomLoss2/NotEqual_1/y action_count 4278547534 step 0 next 27
2022-02-11 11:04:57.055847: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001700 of size 256 by op model/batch_normalization/FusedBatchNormV3 action_count 4278547560 step 13684086849625510338 next 34
2022-02-11 11:04:57.055852: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001800 of size 256 by op model/batch_normalization/FusedBatchNormV3 action_count 4278547561 step 13684086849625510338 next 35
2022-02-11 11:04:57.055858: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001900 of size 256 by op model/batch_normalization/FusedBatchNormV3 action_count 4278547562 step 13684086849625510338 next 12
2022-02-11 11:04:57.055864: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001a00 of size 256 by op Sub action_count 4278547511 step 0 next 13
2022-02-11 11:04:57.055869: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001b00 of size 256 by op Sub action_count 4278547512 step 0 next 14
2022-02-11 11:04:57.055875: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001c00 of size 256 by op model/batch_normalization/FusedBatchNormV3 action_count 4278547563 step 13684086849625510338 next 36
2022-02-11 11:04:57.055880: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001d00 of size 256 by op model/batch_normalization/FusedBatchNormV3 action_count 4278547564 step 13684086849625510338 next 37
2022-02-11 11:04:57.055886: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6001e00 of size 256 by op gradient_tape/UnsortedSegmentSum/pfor/mul_1 action_count 4278547624 step 0 next 45
2022-02-11 11:04:57.055892: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 7fdcc6001f00 of size 2048 by op UNUSED action_count 0 step 0 next 16
2022-02-11 11:04:57.055898: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6002700 of size 2560 by op Add action_count 4278547516 step 0 next 17
2022-02-11 11:04:57.055903: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6003100 of size 9437184 by op RandomUniformInt action_count 4278547528 step 0 next 19
2022-02-11 11:04:57.055909: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6903100 of size 3072 by op gradient_tape/CustomLoss2/Tile action_count 4278547532 step 0 next 25
2022-02-11 11:04:57.055915: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6903d00 of size 2560 by op gradient_tape/model/conv2d_1/Conv2D/Conv2DBackpropFilter action_count 4278547612 step 13684086849625510338 next 44
2022-02-11 11:04:57.055921: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6904700 of size 2560 by op gradient_tape/UnsortedSegmentSum/pfor/Tile action_count 4278547623 step 0 next 43
2022-02-11 11:04:57.055927: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 7fdcc6905100 of size 584704 by op UNUSED action_count 4278547633 step 0 next 28
2022-02-11 11:04:57.055934: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6993d00 of size 2359296 by op CustomLoss2/ArithmeticOptimizer/ReorderCastLikeAndValuePreserving_float_Cast action_count 4278547544 step 13684086849625510338 next 33
2022-02-11 11:04:57.055940: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6bd3d00 of size 2359296 by op gradient_tape/UnsortedSegmentSum/pfor/UnsortedSegmentSum action_count 4278547632 step 15496386686427765080 next 41
2022-02-11 11:04:57.055946: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc6e13d00 of size 2359296 by op CustomLoss2/softmax_cross_entropy_with_logits action_count 4278547592 step 13684086849625510338 next 42
2022-02-11 11:04:57.055952: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 7fdcc7053d00 of size 39515136 by op UNUSED action_count 4278547619 step 13684086849625510338 next 21
2022-02-11 11:04:57.055957: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcc9603100 of size 28311552 by op Add action_count 4278547525 step 0 next 23
2022-02-11 11:04:57.055963: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdccb103100 of size 75497472 by op model/activation/Relu-0-1-TransposeNCHWToNHWC-LayoutOptimizer action_count 4278547566 step 13684086849625510338 next 31
2022-02-11 11:04:57.055969: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdccf903100 of size 47185920 by op gradient_tape/CustomLoss2/softmax_cross_entropy_with_logits/mul action_count 4278547610 step 13684086849625510338 next 32
2022-02-11 11:04:57.055975: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcd2603100 of size 75497472 by op model/conv2d/BiasAdd-0-1-TransposeNCHWToNHWC-LayoutOptimizer action_count 4278547558 step 13684086849625510338 next 29
2022-02-11 11:04:57.055981: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcd6e03100 of size 75497472 by op model/up_sampling2d/resize/ResizeBilinear action_count 4278547568 step 13684086849625510338 next 30
2022-02-11 11:04:57.055988: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcdb603100 of size 75497472 by op loop_body/PartitionedCall/pfor/PartitionedCall/gradients/CustomLoss2/softmax_cross_entropy_with_logits_grad/Softmax/pfor/Softmax action_count 4278547625 step 15496386686427765080 next 38
2022-02-11 11:04:57.055994: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdcdfe03100 of size 47185920 by op CustomLoss2/softmax_cross_entropy_with_logits action_count 4278547593 step 13684086849625510338 next 39
2022-02-11 11:04:57.056000: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdce2b03100 of size 47185920 by op model/conv2d_1/BiasAdd-0-0-TransposeNCHWToNHWC-LayoutOptimizer action_count 4278547589 step 13684086849625510338 next 40
2022-02-11 11:04:57.056006: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7fdce5803100 of size 75497472 by op loop_body/PartitionedCall/pfor/PartitionedCall/gradients/model/conv2d_1/Conv2D_grad/Conv2DBackpropFilter/pfor/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer action_count 4278547630 step 15496386686427765080 next 46
2022-02-11 11:04:57.056012: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 7fdcea003100 of size 22781677312 by op UNUSED action_count 4278547628 step 15496386686427765080 next 18446744073709551615
2022-02-11 11:04:57.056017: I tensorflow/core/common_runtime/bfc_allocator.cc:1051] Summary of in-use Chunks by size:
2022-02-11 11:04:57.056024: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 24 Chunks of size 256 totalling 6.0KiB
2022-02-11 11:04:57.056030: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 512 totalling 512B
2022-02-11 11:04:57.056038: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 1280 totalling 1.2KiB
2022-02-11 11:04:57.056048: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 3 Chunks of size 2560 totalling 7.5KiB
2022-02-11 11:04:57.056057: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 3072 totalling 3.0KiB
2022-02-11 11:04:57.056067: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 3 Chunks of size 2359296 totalling 6.75MiB
2022-02-11 11:04:57.056076: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 9437184 totalling 9.00MiB
2022-02-11 11:04:57.056087: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 28311552 totalling 27.00MiB
2022-02-11 11:04:57.056096: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 3 Chunks of size 47185920 totalling 135.00MiB
2022-02-11 11:04:57.056104: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 5 Chunks of size 75497472 totalling 360.00MiB
2022-02-11 11:04:57.056113: I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 537.77MiB
2022-02-11 11:04:57.056122: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 23385669632 memory_limit_: 23385669632 available bytes: 0 curr_region_allocation_bytes_: 46771339264
2022-02-11 11:04:57.056135: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats:
Limit: 23385669632
InUse: 563890432
MaxInUse: 600314368
NumAllocs: 92
MaxAllocSize: 99865600
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-02-11 11:04:57.056161: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ***_________________________________________________________________________________________________
2022-02-11 11:04:57.056221: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:198 : Resource exhausted: OOM when allocating tensor with shape[640,1,768,768,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "issue_minimal.py", line 67, in <module>
ihvp_calculator = ExactIHVP(influence_model, random_dataset.take(1).batch(1))
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/Influenciae-0.0.1-py3.8.egg/influenciae/influence/inverse_hessian_vector_product.py", line 59, in __init__
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/Influenciae-0.0.1-py3.8.egg/influenciae/influence/inverse_hessian_vector_product.py", line 83, in _compute_inv_hessian
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/backprop.py", line 1175, in jacobian
output = pfor_ops.pfor(loop_fn, target_size,
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/ops/parallel_for/control_flow_ops.py", line 206, in pfor
outputs = f()
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 956, in _call
return self._concrete_stateful_fn._call_flat(
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/home/lucas.hervier/bdd100/bdd_env/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[640,1,768,768,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node loop_body/PartitionedCall/pfor/PartitionedCall/gradients/gradient_tape/model/conv2d_1/Conv2D/Conv2DBackpropFilter_grad/Conv2D/pfor/Tile}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_f_1291]
Function call stack:
f
As you can see, I face an OOM issue when trying to allocate a tensor with shape [640, 1, 768, 768, 32]. 640 is the number of weights (so basically the gradient vector size) 1 the number of inputs and [768, 768, 32] is the size of the input ONCE he got through all the layers except the last one. And as you might notice, this vector is allocated when we try to do:
hess = tf.squeeze(tape_hess.jacobian(grads, weights))
In the function _compute_inv_hessian
in the inverse_hessian_vector_product.py
file.
Describe the solution you'd like
I know that to compute the hessian we need this vector. But I was wondering if we cannot split this vector among the grads dim and my colleague @dv-ai has found out a workaround solution if you make some little change in the _compute_inv_hessian
function:
Old:
def _compute_inv_hessian(self, dataset: tf.data.Dataset) -> tf.Tensor:
"""
Compute the (pseudo)-inverse of the hessian matrix wrt to the model's parameters using backward-mode AD.
Disclaimer: this implementation trades memory usage for speed, so it can be quite memory intensive, especially
when dealing with big models.
Args:
dataset: tf.data.Dataset
A TF dataset containing the whole or part of the training dataset for the computation of the inverse
of the mean hessian matrix.
Returns:
A tf.Tensor with the resulting inverse hessian matrix
"""
weights = self.model.weights
with tf.GradientTape(persistent=False, watch_accessed_variables=False) as tape_hess:
tape_hess.watch(weights)
grads = self.model.batch_gradient(dataset) if dataset._batch_size == 1 \
else self.model.batch_jacobian(dataset)
hess = tf.squeeze(tape_hess.jacobian(grads, weights))
hessian = tf.reduce_mean(tf.reshape(hess, (-1, int(tf.reduce_prod(weights.shape)), int(tf.reduce_prod(weights.shape)))), axis=0)
return tf.linalg.pinv(hessian)
Alternative:
def _compute_inv_hessian(self, dataset: tf.data.Dataset) -> tf.Tensor:
"""
Compute the (pseudo)-inverse of the hessian matrix wrt to the model's parameters using
backward-mode AD.
Disclaimer: this implementation trades memory usage for speed, so it can be quite
memory intensive, especially when dealing with big models.
Parameters
----------
dataset
A TF dataset containing the whole or part of the training dataset for the
computation of the inverse of the mean hessian matrix.
Returns
----------
inv_hessian
A tf.Tensor with the resulting inverse hessian matrix
"""
weights = self.model.weights
with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape_hess:
tape_hess.watch(weights)
grads = self.model.batch_gradient(dataset) if dataset._batch_size == 1 \
else self.model.batch_jacobian(dataset) # pylint: disable=W0212
hess = tf.squeeze(tape_hess.jacobian(grads, weights, parallel_iterations=10, experimental_use_pfor=False))
hessian = tf.reduce_mean(tf.reshape(hess,
(-1, int(tf.reduce_prod(weights.shape)),
int(tf.reduce_prod(weights.shape)))), axis=0)
return tf.linalg.pinv(hessian)
By changing: persistent
to True
and by setting in the .jacobian
call the parameters: parallel_iterations=10
and experimental_use_pfor=False
the computation is done.
N.B: 10 is not important as long it is a natural divider of the number of grads length (unfortunate for prime number though)
See if I add to my script:
print(ihvp_calculator.inv_hessian)
I got:
tf.Tensor(
[[ 2.3457441 -0.07872738 -0.11368337 ... 0.02131678 0.02238739
0.04105094]
[-0.07837234 2.576137 -0.12778574 ... 0.02324321 0.02715976
0.03761083]
[-0.11375846 -0.12770845 2.8135462 ... 0.02000072 0.0255051
0.03220554]
...
[ 0.02132007 0.02319163 0.01998054 ... 0.7005969 -0.01072854
-0.0270703 ]
[ 0.02241289 0.02717561 0.02547131 ... -0.0106203 0.87094194
-0.03467852]
[ 0.04103031 0.03757853 0.03215647 ... -0.02701988 -0.0346096
0.77158403]], shape=(640, 640), dtype=float32)
The computation still take some times but that make sense since there is a lot of parameters. Is there any way to set those parameters in the constructor or at least when calling _compute_inv_hessian
. Or otherwise, to automatically split the computation over the different gradients ?
Additional remarks
While doing those experimentations I also noticed a few thing:
- In
compute_hvp
you do:
if self.hessian is None:
self.hessian = tf.linalg.pinv(self.inv_hessian)
But you can only go into this if statement since you never do self.hessian=smth
. Why not affecting self.hessian = hessian
in _compute_inv_hessian
? Since you need the hessian to compute the inverse why are you using again tf.linalg.pinv
which is very costly ?
- In
comon/model_wrappers.py
:
I would change in both_gradient
and_jacobian
the following lines:
with tf.GradientTape() as tape:
To:
with tf.GradientTape(watch_accessed_variables=False) as tape:
But maybe there is a good reason to not do it ?
Otherwise, it is a really nice work and I know my issue is already related to more advanced Use Cases, I apologize for that!