Rewrite augmentation pipeline
ebgoldstein opened this issue · 14 comments
In the TF docs from 2.9 on, tf.keras.preprocessing
has a deprecation warning:
https://www.tensorflow.org/versions/r2.10/api_docs/python/tf/keras/preprocessing
This will impact the make_data script, which relies on this suite of tools (i.e., tf.keras.preprocessing.image.ImageDataGenerator
) to make the augmented imagery. See here:
segmentation_gym/make_nd_dataset.py
Lines 578 to 800 in c1669a0
In light of this, it seems wise to think/plan/prepare for the moment when we need to convert the augmentation routines to the recommended workflow using tf.keras.utils
.. the relevant links in the TF documentation can be found in the link above.
note that this has been discussed: #60
https://albumentations.ai/docs/api_reference/augmentations/ seems best, especially because we are concerned with environmental imagery, and the functional augs include sun glint, snow, and fog https://albumentations.ai/docs/api_reference/augmentations/functional/
2024 and this is still a christmas wish
I think I could take this on this year and would base it around
dataset = tf.keras.utils.image_dataset_from_directory(
folder,
labels='inferred',
label_mode='int',
class_names=None,
batch_size=32,
image_size=TARGET_SIZE,
shuffle=False,
seed=None,
validation_split=None,
subset=None,
interpolation="bilinear"
)
Question: so I am guessing these augmentations get done at the time of training, and new images are not actually saved?
I think it would be easier (at least for me) to integrate albumentations by actually saving the augmented images with the rest of the dataset.
Correct. Gym works by preparing your dataset for you and making batched tensors of augmented data. This is deliberately done so you always know what data is used for training and what for validation. Importantly only the training data is augmented.
I would recommend we eventually modified the make_dataset.py function with an albumentations based workflow. But yes for now you could trial model training by augmenting the imagery first. But note that would be suboptimal in the long term because it needlessly duplicates image files. So let's put a basic wirkflow together and then ideally wrap that into the existing Gym workflow.
Just so we are all on the same page - make_datasets actually creates the augmented images, which are saved as npz files. then train_model uses those (augmented) images (which are npz) to train the model. So images are not augmented 'on the fly' like in many workflows (i.e., preprocessing layers in the model, data generators, etc), but rather pre-augmented. I recall the biggest reason we did this was for efficiency (GPU utilization is always near 100% for me, compared with many 'on the fly' augmentation strategies where GPu utilization is lower, at the expense of more CPU)
@mlundine - i agree that albumentations is the correct way to go.
@dbuscombe-usgs - i agree that we don;t want to duplicate/save augmented images
Yes that's a good summary. Pre augmentation (as oppsed to on the fly) has reproducibility benefits too. In the sense that the augmented data are saved in the "gpu ready" npz format, and it would be possible to in theory assess the distributions of augmented data post-hoc rather than the non-reproducible ad-hoc.
I think we're all interested in albumentations and I'm keen to get it at least as an option in the gym workflow
@mlundine - just loopiong back to getting Albumentations working w/o rewriting the augmentation pipeline:
Since we use the deprecated/old-style keras generators, the easiest method is to add a preprocessing function (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) in 3 easy steps:
- course adding an import:
#import albumentations
import albumentations as A
- defining a preprocessing function with your chosen albumentation augs:
#preprocessing function with albumentations.. example with channel shuffle
def albumentize(image):
aug = A.Compose([
A.ChannelShuffle(),
])
AugI = aug(image=image)['image']
return AugI
- add a call to the preprocessing function on line 719-739 of
make_dataset.py
so add
preprocessing_function = albumentize,
under fill_mode='reflect',
for both generators
hope this helps as a quick way to get Albumentations working!
segmentation_gym/make_dataset.py
Lines 719 to 749 in cb13c70
Clarifying that more: we don't want duplicates (original image and
augmented) in the training set? Or do we want a big training set with all
original images plus each augmentation?
The way the we wrote it, the trainign split will all be augmentations, Val split is all non-augmented images in the validation. That being said, all the augmentations are random, so there is a possibility to get nonagumented (or weakly augmented) images in the training.
note also that in the config, AUG_COPIES
will oversample your training split, so you can give it a bunch of different augmented copies of the training data...
I suggest if you want an albumentation version of Gym, feel free to create a branch (locally or on GH)... you could hard code it all in for your personal needs, but it would be awesome if you added variables to the config so that they can be turned on/off globally for everyone eventually
I agree with Evan. It seems the change he is suggesting here #81 (comment) is simple enough it could be incorporated in the existing workflow easily (on a new branch). Doodleverse is definitely designed with a broad range of users and use-cases in mind. Perhaps it could be passed a list of albumentations-style augmentations you'd like. And if the list if empty (default), it just defaults to the status quo.
And yes, I have noticed that models tend to train better when presented with original plus augmented training data. There is no data leakage because the validation files are stored in a separate folder and are not augmented. If you wish to test this yourself,
- run make_datasets.py, then train_model.py to train a model
- delete all the non-augmented data (the files say 'noaug' in the name), then train_model.py again
- compare the 2 models
If you wish, you could add a config file parameter than suppresses the use of original imagery in training, but I recommend keeping original+augmentation by default