Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences()
Voronov-Andrey opened this issue · 5 comments
I use keras to develop an application for sentimental analysis of text in Russian using deep learning models. The vast majority of guides and examples use the methods text_tokenizer, fit_text_tokenizer, texts_to_sequences, pad_sequences to convert texts into numeric sequences, as well as to_categorical() to apply one-hot encoding to class labels. But I ran into the problem that the keras3 package for the R language does not contain the methods I listed. Using keras and keras3 at the same time leads to errors in the code and i always need to restart R session and load the packages to fit model or to use those methods again. As an alternative, I tried to use layer_text_vectorization(), but when using it with the same data as with the text_tokenizer() and other methods, the model does not learn at all. Is there any solutions to this problem?
There is a model plot with using text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() :
And there is with using layer_text_vectorization as alternative:
In {keras3}, much of the legacy text processing API has been removed. Almost everything is now possible with just layer_text_vectorization()
. The layer can be used with helpers get_vocabulary()
, set_vocabulary()
and adapt()
. {keras3} also provides text_dataset_from_directory()
which maybe be useful.
Some helpful links:
layer_text_vectorization()
reference page with many examples and extended descriptions of features: https://keras.posit.co/reference/layer_text_vectorization.html- An end-to-end example showing
layer_text_vectorization()
in use. https://keras.posit.co/articles/examples/nlp/text_classification_from_scratch.html
If you're running into specific issue with layer_text_vectorization()
, please provide a minimal reproducible example and I can help figure out what's going wrong.
Thanks for the answer, below I have provided the MRE and dput() output file of the data I use. If you need any additional information, I am always ready to provide it.
library(reticulate)
library(keras3)
library(dplyr)
library(stringr)
library(caret)
data_tidy <- dget("dput_output.txt", keep.source = FALSE)
#Dividing the data into training and test samples
split_data <- function(df) {
set.seed(123)
trainIndex <- createDataPartition(df$Sentiment, p = 0.8, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]
return(list(train_data = train_data, test_data = test_data))
}
splitted_data <- split_data(data_tidy)
train_data <- splitted_data$train_data
test_data <- splitted_data$test_data
train_data_x <- train_data$text_tidy
train_data_y <- to_categorical(train_data$Sentiment)
test_data_x <- test_data$text_tidy
test_data_y <- to_categorical(test_data$Sentiment)
#Setting model constants
max_features <- 50000L
embedding_dim <- 128L
sequence_length <- max(sapply(data_tidy$text_tidy, str_count, pattern = "\\w+")) + 1L
#Initialising vectorize layer (data is already standardized)
vectorize_layer <- layer_text_vectorization(
standardize = NULL,
max_tokens = max_features,
output_mode = "int",
output_sequence_length = sequence_length,
)
#Adapting vectorize layer on text data
vectorize_layer %>% adapt(data_tidy$text_tidy)
#Building LSTM model
model <- keras_model_sequential() %>%
vectorize_layer() %>%
layer_embedding(input_dim = max_features, output_dim = embedding_dim) %>%
layer_lstm(units = 128, dropout = 0.5) %>%
layer_dense(3, activation = 'softmax')
model %>% compile(
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = c('accuracy'),
loss = 'categorical_crossentropy'
)
lstm_model <- model %>% fit(
train_data_x, train_data_y,
batch_size = 64,
epochs = 5,
verbose = 1,
validation_split = 0.2
)
Hi, did you resolve the issue?
Hello, not exactly, i just restructured model with Bidirectional LSTM layers and used manual data separation into training and validation samples to use validation_data instead of validation_split. Under such conditions, the fitting of the model is successful. Using a structure with regular LSTM layers leads to the same problem.
So the issue disappears when switching from validation_split
to validation_data
? That sounds like a bug. What is the version of keras? keras3:::keras_version()
? If not 3.3.3
, can you re-run keras3::install_keras()
to get the latest?