tallamjr/astronet

Lazy loading of large numpy data files

Closed this issue · 0 comments

It would be better to move towards a memory mapped data flow model such that when training or running tests locally, not all the data is materialised.

This would allow for full dataset training locally on the laptop, and better turnarounds for quick iterations.

By using tf.data.Dataset input pipelines and numpy.memmap this can be done using Python generators (see astronet.datasets for an example), such that data is only materialise as batches, instead of the "full" dataset at runtime. A nice consequence is that by using the input pipeline workflow, better utilisation of the GPU is done, now up to 97% for most runs (locally on arm64)

Example lazy load workflow:

        # Lazy load data
        X_train = np.load(f"{asnwd}/data/plasticc/processed/X_train.npy", mmap_mode="r")
        Z_train = np.load(f"{asnwd}/data/plasticc/processed/Z_train.npy", mmap_mode="r")
        y_train = np.load(f"{asnwd}/data/plasticc/processed/y_train.npy", mmap_mode="r")

                train_ds = (
                    lazy_load_plasticc_wZ(X_train, Z_train, y_train)
                    .shuffle(1000, seed=RANDOM_SEED)
                    .batch(BATCH_SIZE, drop_remainder=True)
                    .prefetch(tf.data.AUTOTUNE)
                )
                test_ds = (
                    lazy_load_plasticc_wZ(X_test, Z_test, y_test)
                    .batch(BATCH_SIZE, drop_remainder=True)
                    .prefetch(tf.data.AUTOTUNE)
                )

astronet.datasets

def lazy_load_plasticc_wZ(X, Z, y):

    # generator function
    def generator():
        for x, z, L in zip(X, Z, y):
            yield ({"input_1": x, "input_2": z}, L)

    # create tf dataset from generator fn
    dataset = tf.data.Dataset.from_generator(
        generator=generator,
        output_signature=(
            {
                "input_1": tf.type_spec_from_value(X[0]),
                "input_2": tf.type_spec_from_value(Z[0]),
            },
            tf.type_spec_from_value(y[0]),
        ),
    )

    return dataset