Lazy loading of large numpy data files
tallamjr opened this issue · 0 comments
tallamjr commented
It would be better to move towards a memory mapped data flow model such that when training or running tests locally, not all the data is materialised.
This would allow for full dataset training locally on the laptop, and better turnarounds for quick iterations.
By using tf.data.Dataset
input pipelines and numpy.memmap
this can be done using Python generators (see astronet.datasets
for an example), such that data is only materialise as batches, instead of the "full" dataset at runtime. A nice consequence is that by using the input pipeline workflow, better utilisation of the GPU is done, now up to 97% for most runs (locally on arm64
)
Example lazy load workflow:
# Lazy load data
X_train = np.load(f"{asnwd}/data/plasticc/processed/X_train.npy", mmap_mode="r")
Z_train = np.load(f"{asnwd}/data/plasticc/processed/Z_train.npy", mmap_mode="r")
y_train = np.load(f"{asnwd}/data/plasticc/processed/y_train.npy", mmap_mode="r")
train_ds = (
lazy_load_plasticc_wZ(X_train, Z_train, y_train)
.shuffle(1000, seed=RANDOM_SEED)
.batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.AUTOTUNE)
)
test_ds = (
lazy_load_plasticc_wZ(X_test, Z_test, y_test)
.batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.AUTOTUNE)
)
astronet.datasets
def lazy_load_plasticc_wZ(X, Z, y):
# generator function
def generator():
for x, z, L in zip(X, Z, y):
yield ({"input_1": x, "input_2": z}, L)
# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
generator=generator,
output_signature=(
{
"input_1": tf.type_spec_from_value(X[0]),
"input_2": tf.type_spec_from_value(Z[0]),
},
tf.type_spec_from_value(y[0]),
),
)
return dataset