ignore one class subfolder while using image_dataset_from_directory() function
maitra opened this issue · 10 comments
I am looking into the R package keras
and the function image_dataset_from_directory()
According to the help page,
If your directory structure is:
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
Then calling ‘image_dataset_from_directory(main_directory, labels='inferred')’ will return a ‘tf.data.Dataset’ that yields batches of images from the subdirectories class_a and class_b, together with labels 0 and 1 (0 corresponding to class_a and 1 corresponding to class_b).
However, I have three folders:
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...
...class_b/
......b_image_1.jpg
......b_image_2.jpg
...
...class_c/
......c_image_1.jpg
......c_image_1.jpg
...
I want to read only two of these classes (and ignore the third). Is there a way to do this using the image_dataset_from_directory()
or some other function?
This is not directly supported by the convenience function image_dataset_from_directory()
, but it should be straightforward to hack around the limitations of the function to achieve what you want.
The simplest fix is probably to use {tfdatasets}
, either dataset_map()
or dataset_filter()
to drop out the labels you're uninterested in. This is an expedient path if data loading is not the bottleneck in your pipeline.
# first get the label sorted in the same order as keras
# sorted directories in main_directory
library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(main_directory) |> iter_next() |> _[[2]]
labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels
my_unwanted_label <- labels %>% .[!names(.) %in% c("class_c")] %>% unname()
library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(....) %>%
dataset_map(\(images, labels) {
keep <- my_unwanted_labels |>
lapply(\(bad_label) labels != bad_label) |>
purrr::reduce(`&`)
tuple(images[keep], labels[keep])
})
or using dataset_filter()
my_unwanted_labels %<>% as_tensor()
ds <- image_dataset_from_directory(....) %>%
dataset_unbatch() %>%
dataset_filter(\(image, label) !k_any(label == my_unwanted_labels)
dataset_batch(batch_size = 32)
Alternatively, instead of fixing up the output of image_dataset_from_directory()
, you can instead fix-up the input, by creating a directory with a curated set of symlinks. Something like:
library(fs)
library(keras)
curated_dataset <- fs::path("curated_dataset") |> path_abs()
dir_create(curated_dataset)
class_dirs <- dir_ls(main_directory, recurse = FALSE) %>%
.[!basename(.) %in% c("class_c")] %>%
path_abs()
link_create(class_dirs, # link target
path(curated_dataset, basename(class_dirs)) # link location
ds <- image_dataset_from_directory(curated_dataset, follow_links = FALSE)
(all the code snippets are above untested, but I trust you can figure out the rest).
Thanks for this! Very helpful.
The fixing up the input solution works, (you do have a parenthesis missing, but that can be easily fixed).
library(fs)
library(keras)
curated_dataset <- fs::path("curated_dataset") |> path_abs()
dir_create(curated_dataset)
class_dirs <- dir_ls(main_directory, recurse = FALSE) %>%
.[!basename(.) %in% c("class_c")] %>%
path_abs()
link_create(class_dirs, # link target
path(curated_dataset, basename(class_dirs)))# link location
ds <- image_dataset_from_directory(curated_dataset, follow_links = FALSE)
However, I think that the fixing the output solution is more desirable. For one it does not create all these curated symlinks. However, I can not see if the other arguments of image_dataset_from_directory
can be made to work.
# first get the label sorted in the same order as keras
# sorted directories in main_directory
library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(main_directory) |> iter_next() |> _[[2]]
labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels
my_unwanted_label <- labels %>% .[!names(.) %in% c("class_c")] %>% unname()
library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(....) %>%
dataset_map(\(images, labels) {
keep <- my_unwanted_labels |>
lapply(\(bad_label) labels != bad_label) |>
purrr::reduce(`&`)
tuple(images[keep], labels[keep])
})
First, I wanted to note that I get:
Warning messages:
1: In `[.tensorflow.tensor`(images, keep) :
Incorrect number of dimensions supplied. The number of supplied arguments, (not counting any NULL, tf$newaxis or np$newaxis) must match thenumber of dimensions in the tensor, unless an all_dims() was supplied (this will produce an error in the future)
2: In force(if_any_TRUE) :
Indexing tensors are passed as-is to python, no index offsetting or R to python translation is performed. Selected options for one_based and inclusive_stop are ignored and treated as FALSE. To silence this warning, set options(tensorflow.extract.warn_tensors_passed_asis = FALSE)
But I also would lke to pass some arguments to the image_dataset_from_directory()
function, like:
image_size = c(180, 180),
validation_split = 0.6,
subset = "training",
seed = random.seed,
batch_size = 32
How do I do this? Thanks again for this wonderful resource that allows me to use R with keras!
To get rid of this warning:
In `[.tensorflow.tensor`(images, keep) :
Incorrect number of dimensions supplied....
You can change the call images[keep]
to images[keep, all_dims()]
(same for labels[keep]
and any other calls to [
where you are implicitly slicing along the first dim of a multidimensional tensor.)
(tensorflow::all_dims()
, reticulate::py_ellipsis()
or reticulate::py_eval("...")
all return the same thing).
The second note is issued when you are subsetting a tensor with another tensor. It's a one-time warning per R session, to help remind you that x[1]
is not the same as x[as_tensor(1L)]
. You an silence it globally by calling options(tensorflow.extract.warn_tensors_passed_asis = FALSE)
I think that all the other arguments should still work. The one thing that might change is the exact output shape of the tfdataset that is returned, and you'd have to adjust the formals of the function passed to dataset_map()
or dataset_filter()
to match.
When in doubt about what the exact signature is needed, and to avoid a guessing game, you can quickly test by passing a function with ...
, something like this:
image_dataset_from_directory(<many args>) %>%
dataset_map(function(...) {
str(list(...))
# you can also do "browser-driven development", and write the body of the
# function with live references to the symbolic "graph-mode" tensors available
# for interactive, line-by-line testing, by dropping into a browser() context here:
browser()
# just be sure to exit the browser() by "(c)ontinuing" and not by "(q)uiting".
# If you quit, tensorflow keeps the tracing context open, leaving the sesion in a
# broken state that requires an R session restart to fix.
})
Then when you are done experimenting/writing, you can update the function signature for future readability:
image_dataset_from_directory(...., validation_split = .... ) %>%
dataset_map(function(train, val) {
names(train) <- names(val) <- c("images", "labels")
for(nm in c("images", "labels")) {
train[[nm]] %<>% .[keep, all_dims()]
val[[nm]] %<>% .[keep, all_dims()]
}
tuple(lapply(list(train, val), unname))
})
My apologies: I have been trying several things for a while, but I am still confused about this. Let us use the dataset (AFHQ in your co-authored book). I want to only focus on the cats and dogs to sort of match what you are doing there in Chapter 8, but as a learning experience,I do not want to create a new folder of images as you have done there.
base_dir <- fs::path("afhq")
library(fs)
library(keras)
random.seed <- 415588819
library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(base_dir / "train") |> iter_next() |> _[[2]]
labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels
my_wanted_labels <- labels %>% .[!names(.) %in% c("wild")] %>% unname()
library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(base_dir / "train", validation_split = 0.8, image_size = c(180, 180), batch_size = 32, subset = "both", seed = random.seed) |>
dataset_map(\(images, labels) {
keep <- my_wanted_labels |>
lapply(\(bad_label) labels != bad_label) |>
purrr::reduce(`&`)
tuple(images[keep, all_dims()], labels[keep, all_dims()])
})
However, I get:
Found 14630 files belonging to 3 classes.
Using 2926 files for training.
Using 11704 files for validation.
Error in dataset$map(map_func = as_py_function(map_func), num_parallel_calls = as_integer_tensor(num_parallel_calls, :
attempt to apply non-function
I feel like I am almost there, however, I am still stuck.
Thanks again for all your help! And thanks also for the book, and the resource!
Here is a working example using a mnist dataset (most convenient for me right now)
library(purrr)
library(fs)
library(keras)
library(tfdatasets)
class_names <- xfun::n2w(0:9)
unwanted_class_names <- xfun::n2w(c(6, 9))
class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names
unwanted_labels <- local({
class_labels %>% .[names(.) %in% unwanted_class_names]
})
dir <- tempfile("mnist-")
dir_create(dir, class_names)
mnist <- dataset_mnist()
walk(seq_len(nrow(mnist$train$x)), \(i) {
img <- mnist$train$x[i,,]/255
lbl <- mnist$train$y[i]
jpeg::writeJPEG(image = img,
target = path(dir, xfun::n2w(lbl), i, ext = "jpeg"))
})
ds <- image_dataset_from_directory(dir, class_names = class_names)
ds <- ds %>%
dataset_unbatch() %>%
dataset_filter(\(img, lbl) k_all(lbl != unwanted_lbls)) %>%
dataset_batch(32)
# confirm the unwanted labels aren't there
seen_labels <- ds %>%
dataset_take(10) %>%
as_array_iterator() %>%
reticulate::iterate(\(x) {
c(images, labels) %<-% x
unique(labels)
}) %>%
unlist() %>% unique() %>% sort()
# 0 1 2 3 4 5 7 8
stopifnot(!unwanted_labels %in% seen_labels)
# Note, in the upcoming keras 3 / keras_core, passing a subset of names to `class_names` will work:
ds <- image_dataset_from_directory(dir, class_names = class_names[1:3])
And thanks also for the book, and the resource!
Thank you! I'm glad to hear you find it helpful.
Here is a working example using a mnist dataset (most convenient for me right now)
library(purrr) library(fs) library(keras) library(tfdatasets) class_names <- xfun::n2w(0:9) unwanted_class_names <- xfun::n2w(c(6, 9)) class_labels <- seq.int(from = 0, along.with = class_names) names(class_labels) <- class_names unwanted_labels <- local({ class_labels %>% .[names(.) %in% unwanted_class_names] }) dir <- tempfile("mnist-") dir_create(dir, class_names) mnist <- dataset_mnist() walk(seq_len(nrow(mnist$train$x)), \(i) { img <- mnist$train$x[i,,]/255 lbl <- mnist$train$y[i] jpeg::writeJPEG(image = img, target = path(dir, xfun::n2w(lbl), i, ext = "jpeg")) }) ds <- image_dataset_from_directory(dir, class_names = class_names) ds <- ds %>% dataset_unbatch() %>% dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>% dataset_batch(32) # confirm the unwanted labels aren't there seen_labels <- ds %>% dataset_take(10) %>% as_array_iterator() %>% reticulate::iterate(\(x) { c(images, labels) %<-% x unique(labels) }) %>% unlist() %>% unique() %>% sort() # 0 1 2 3 4 5 7 8 stopifnot(!unwanted_labels %in% seen_labels)
Thank you! There is a typo there, for anyone looking at this for future reference. It is obvious, but unwanted_lbls
should be unwanted_labels
in the reduction part of the code.
Btw, after the reduction, lenght(ds)
no longer works after the reduction. We need to use this later in the coding for pretraining. How do we refer to this? Many thanks again!
Here is a working example using a mnist dataset (most convenient for me right now)
library(purrr) library(fs) library(keras) library(tfdatasets) class_names <- xfun::n2w(0:9) unwanted_class_names <- xfun::n2w(c(6, 9)) class_labels <- seq.int(from = 0, along.with = class_names) names(class_labels) <- class_names unwanted_labels <- local({ class_labels %>% .[names(.) %in% unwanted_class_names] }) dir <- tempfile("mnist-") dir_create(dir, class_names) mnist <- dataset_mnist() walk(seq_len(nrow(mnist$train$x)), \(i) { img <- mnist$train$x[i,,]/255 lbl <- mnist$train$y[i] jpeg::writeJPEG(image = img, target = path(dir, xfun::n2w(lbl), i, ext = "jpeg")) }) ds <- image_dataset_from_directory(dir, class_names = class_names) ds <- ds %>% dataset_unbatch() %>% dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>% dataset_batch(32) # confirm the unwanted labels aren't there seen_labels <- ds %>% dataset_take(10) %>% as_array_iterator() %>% reticulate::iterate(\(x) { c(images, labels) %<-% x unique(labels) }) %>% unlist() %>% unique() %>% sort() # 0 1 2 3 4 5 7 8 stopifnot(!unwanted_labels %in% seen_labels)Thank you! There is a typo there, for anyone looking at this for future reference. It is obvious, but
unwanted_lbls
should beunwanted_labels
in the reduction part of the code.Btw,
lenght(ds)
no longer works after the reduction. I get aNA
. We need to use this later in the coding for pretraining. How do we refer to this? Many thanks again!
Making length()
of a TF Dataset non-NA after applying a dataset_filter()
requires manually injecting the length information into the pipeline. There isn't a non-experimental way to do this yet, but this works in TF 2.14.
n_images <- list.files(dir, full.names = TRUE) %>%
.[!basename(.) %in% unwanted_class_names] %>%
list.files(pattern = "\\.jpe?g$") %>%
length()
ds <- image_dataset_from_directory(dir, class_names = class_names)
ds <- ds %>%
dataset_unbatch() %>%
dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
{ .$apply(tf$data$experimental$assert_cardinality(n_images)) } %>%
dataset_batch(32)
length(ds) # 1505
Odd. I have a problem with the AFHQ dataset (sorry):
base_dir <- fs::path("afhq")
library(tfdatasets)
library(keras)
class_names <- c("cat", "dog", "wild")
unwanted_class_names <- c("wild")
class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names
unwanted_labels <- local({
class_labels %>% .[names(.) %in% unwanted_class_names]
})
ds <- image_dataset_from_directory(base_dir / "train",
class_names = class_names)
n_images <- list.files(base_dir, full.names = TRUE) %>%
.[!basename(.) %in% unwanted_class_names] |>
list.files(pattern = "\\.jpe?g$") %>%
length()
ds <- ds |>
dataset_unbatch() |>
dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
{ .$apply(tf$data$experimental$assert_cardinality(n_images)) } |>
dataset_batch(32)
I get:
Found 14630 files belonging to 3 classes.
Then,
length(ds)
0
I don't quite understand what is going wrong here. Thanks!
I think rather than working around the current TF Dataset cardinality limitations, it's simpler to create temporary links:
library(fs)
library(keras)
image_dataset_from_directory_subset <- function(directory, ..., class_names) {
directory2 <- dir_create(path_temp(file_temp(), path_file(directory)))
stopifnot(class_names %in% list.files(directory))
link_create(path(directory, class_names), # link target
path(directory2, class_names)) # link location
keras::image_dataset_from_directory(directory2, ..., class_names = class_names)
}
ds <- image_dataset_from_directory_subset(dir, class_names = class_names[1:5])