Issue when running the Unsupervised Segmentation when fitting the NP Bayesian Models.
Closed this issue · 6 comments
Convert data to list by ID
dat.list <- dat.disc %>%
- df_to_list(., "id")
Only retain id and discretized data streams
dat.list.sub <- map(dat.list, subset, select = c(id, SL, TA, Disp))
Convert the id column to integer for each data frame in the list
dat.list.sub <- map(dat.list.sub, ~ .x %>% mutate(id = as.integer(id)))
Check the structure to confirm the change
glimpse(dat.list.sub)
List of 3
$ 6009:'data.frame': 1174 obs. of 4 variables:
..$ id : int [1:1174] 6009 6009 6009 6009 6009 6009 6009 6009 6009 6009 ...
..$ SL : int [1:1174] 1 2 2 2 5 5 2 5 4 4 ...
..$ TA : int [1:1174] 3 3 3 2 2 2 4 3 3 3 ...
..$ Disp: int [1:1174] 1 1 1 1 1 2 2 2 2 1 ...
$ 6011:'data.frame': 1174 obs. of 4 variables:
..$ id : int [1:1174] 6011 6011 6011 6011 6011 6011 6011 6011 6011 6011 ...
..$ SL : int [1:1174] 2 1 3 3 5 2 1 2 2 1 ...
..$ TA : int [1:1174] 3 3 3 3 3 1 1 3 3 3 ...
..$ Disp: int [1:1174] 1 1 1 1 1 1 1 1 1 1 ...
$ 6018:'data.frame': 1175 obs. of 4 variables:
..$ id : int [1:1175] 6018 6018 6018 6018 6018 6018 6018 6018 6018 6018 ...
..$ SL : int [1:1175] 1 2 3 5 2 1 2 2 2 1 ...
..$ TA : int [1:1175] 3 3 3 3 3 1 4 3 4 3 ...
..$ Disp: int [1:1175] 1 1 1 1 1 1 1 1 1 1 ...Run the segmentation model (unsupervised)
set.seed(123)
alpha <- 1 # hyperparameter for prior (Dirichlet) distribution
ngibbs <- 10000 # number of iterations for Gibbs sampler
nbins <- c(3, 6, 8) # define number of bins per data stream (SL, TA, Disp)
progressr::handlers(progressr::handler_progress(clear = FALSE)) # to initialize progress barSet the number of workers to a minimum of 1
n_workers <- max(1, availableCores() - 2)
future::plan(multisession, workers = n_workers)
dat.res.seg1 <- segment_behavior(data = dat.list.sub, ngibbs = ngibbs, nbins = nbins, alpha = alpha)
\ [............................................................................] 0% :message
Progress interrupted by FutureError condition: MultisessionFuture () failed to receive message results from cluster RichSOCKnode #1 (PID 14356 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 16 globals exported is 112.61 KiB. The three largest globals are ‘...furrr_fn’ (68.24 KiB of class ‘function’), ‘...furrr_chunk_args’ (18.62 KiB of class ‘list’) and ‘behav_gibbs_sampler’ (11.89 KiB of class ‘function’)
Error in unserialize(node$con) :
MultisessionFuture () failed to receive message results from cluster RichSOCKnode #1 (PID 14356 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 16 globals exported is 112.61 KiB. The three largest globals are ‘...furrr_fn’ (68.24 KiB of class ‘function’), ‘...furrr_chunk_args’ (18.62 KiB of class ‘list’) and ‘behav_gibbs_sampler’ (11.89 KiB of class ‘function’)
- [*******************************************************o............................] 67%
I tried the example code for running this model from the 04_Fit NP Bayesian models in bayesmove.R script and didn't receive an error. Since I'm not sure what dataset you're using and how you've discretized the data streams (since this doesn't appear to be the dataset from the example), I can't replicate your error.
Based on the error message you included, the only thing I can tell is that there's a problem when trying to run this model in parallel using the {furrr} and {future} packages. I would suggest running the model not in parallel (i.e., sequentially) to see if there's a specific issue you're having with the segment_behavior() function from the {bayesmove} package. It also may be worth checking that the exact example (linked above) from the repo works as expected on your computer.
It's also worth mentioning that none of the bins created when discretizing can be empty. So if you run apply(dat.disc[,c('SL','TA','Disp')], 2, table)
, make sure that none of these bins have 0 observations assigned. If so, you'll need to reconfigure how you discretize this data stream.
It turns out this is an issue with memory size allocation on my mac it has low memory because i am using a large dataset, any recommendations on how to run the code if i have a large dataset?
If it works on your computer, I would suggest just running the segmentation model on a single core (instead of in parallel). So, just running dat.res.seg1 <- segment_behavior(data = dat.list.sub, ngibbs = ngibbs, nbins = nbins, alpha = alpha)
without future::plan(multisession, workers = n_workers)
before it.
If that still poses a problem, I would try splitting your dataset (dat.list.sub
) up into chunks. So maybe start by defining one R object that contains the first half of that list and another that contains the second half. Each ID (or list element) is evaluated independently, so this won't affect the results.
Hope that helps.
Thank you so much we give you feedback if it works
I found the solution to the problem. I was not a memory issue but rather as mismatch in the number of bins that was causing it. Thank you for the help, I was only about to identify it when i reduced the chunk the chunk size to 1 id.
Glad you were able to figure out the problem and that it was simple to fix.