Different sampling between legacy and current version?

Question

Different sampling between legacy and current version?

Closed this issue 3 years ago · 4 comments

Maybe this was addressed in NEWS, but I came across some different behavior in cross-mapping using v0.7.5 (found in ha0ye's repository) and the latest version (v1.9.x). From what I can understand in the code, there seems to be a different sampling algorithm while using random sampling without replacement. This causes high variability in the confidence interval of the cross-mapping skill

I'm using paramecium_didinium as an example

data <- paramecium_didinium
data <- data %>% 
  mutate(paramecium = (paramecium-mean(paramecium))/sd(paramecium),
         didinium = (didinium-mean(didinium))/sd(didinium))

Behavior with 1.9.x

ccm_1.9 <- CCM(dataFrame = data, E = 5, 
               columns = "didinium", target = "paramecium", 
               libSizes = "12 67 5", sample = 100,
               seed = 123, includeData = TRUE)

# median and confidence intervals
ccm_1.9 <- ccm.0_1.9$CCM1_PredictStat %>% 
    group_by(LibSize) %>%
    summarise(rho_0.025 = quantile(rho, probs = 0.025),
              rho_0.500 = quantile(rho, probs = 0.500),
              rho_0.975 = quantile(rho, probs = 0.975))

# Plot
ggplot(ccm_1.9, aes(x = LibSize)) +
    geom_ribbon(aes(ymin = rho_0.025, ymax = rho_0.975), alpha = 0.5) + 
    geom_line(aes(y = rho_0.500)) +
    coord_cartesian(ylim = c(0.3, 0.9))

Behavior with v0.7.5

ccm_0.7 <- ccm(block = data, E = 5, 
             lib_column = "didinium", target_column = "paramecium", 
             lib = c(1, 67), pred = c(1, 67),
             lib_sizes = seq(12, 67, by = 5), num_samples = 100,
             RNGseed = 123, silent = TRUE)

# median and confidence intervals
ccm_0.7 <- ccm_0.7 %>%
  rename(LibSize = lib_size) %>%
  group_by(LibSize) %>%
    summarise(rho_0.025 = quantile(rho, probs = 0.025),
              rho_0.500 = quantile(rho, probs = 0.500),
              rho_0.975 = quantile(rho, probs = 0.975))

# plot
ggplot(ccm_0.7, aes(x = LibSize)) +
    geom_ribbon(aes(ymin = rho_0.025, ymax = rho_0.975), alpha = 0.5) + 
    geom_line(aes(y = rho_0.500)) +
    coord_cartesian(ylim = c(0.3, 0.9))

I'm trying to implement this on a delayed ccm, but there the differences between the two versions are more pronounced. Do you have any suggestions/comments about the sampling method?

Thanks!
Juan

Answer 1 · 2021-10-24T09:42:45.000Z

Hi Juan,

Thank you for the note and example. Much appreciated.

Please help me understand your example. A few preliminary notes to get on the same page... ;)

I don't use ggplot, so can't recreate the plots, Not a big deal. (I use base-R for graphics... dinosaur style.)
Just a comment that tidyverse is handy, but, presents problems in production code as it has a huge web of dependencies, and, changes significantly (syntax and results) from version to version (partly dependent on the huge package/version dependency base). I therefore suggest base-R for simple examples as it removes a complex variable for debugging.
The example comments say "confidence intervals", but it isn't clear to me why a quantile can be considered a confidence bound.

Please see if this pedantic, R-base code illustrates your concern:

library(rEDM)

if ( is.null( dev.list() ) { dev.new() } 
par( mfrow = c( 3, 4 ) )

data       = paramecium_didinium
data[,2:3] = apply( data[,2:3], 2, scale )

Issue54 = function( libSizes = "12 67 5" ) {
  x = CCM( dataFrame = data, E = 5, columns = "didinium", target = "paramecium",
           libSizes = libSizes, sample = 100, seed = 123, includeData = TRUE )
  
  libSizes = x $ LibMeans $ LibSize
  stats    = x $ CCM1_PredictStat

  L = list()
  for ( lib in libSizes ) {
    L[[as.character(lib)]] = stats[ which( stats $ LibSize == lib ), ]
  }
  
  for ( i in 1:length( L ) ) {
    plot( ecdf( L[[i]] $ rho ), xlim = c( 0.4, 0.9 ), xlab = 'rho',
          main = paste('libSize', libSizes[i]) )
  }
}

Attached are two plots from Issue54() and Issue54( libSizes = "10 65 5" )

These results look reasonable to me. Note that LibSize = 67 has a constant rho = 0.797. With replacement = FALSE (default) and the full library "sampled", the predictions will be the same no matter how many times the library is sampled. At libSizes less than the full library one is generating a (presumed) less complete representation of the dynamics, with different sampled libraries providing a spectrum of predictions.

Answer 2 · 2021-10-29T13:04:23.000Z

Thank you for your reply. I noticed there was a change in the default value for the replacement argument between both versions, so I was doing a ccm with replacement in the old version, giving me a different result 😅

I, however, still see some small differences in the results, but maybe they are not that important (sorry if this is then a not that relevant issue!). I agree that at LibSize = 67 there should be no variance, and in both versions, I obtain the same rho in this case. However, for other values of LibSize the small difference is there. Note I'm using the same seed, so the sampling should be taking the same instances

here is a code (in base R)

data <- paramecium_didinium
data[, 2:3] <- apply(data[, 2:3], 2, scale)


# Run at version 1.2.3 or older
ccm_123 <- ccm(
  block = data, E = 5, lib_sizes = seq(12, 67, by = 5),
  num_samples = 100, replace = FALSE,
  lib_column = "didinium", target_column = "paramecium",
  RNGseed = 123
)

# Run at version 1.9.1 or newer
ccm_191 <- CCM(
  dataFrame = data, E = 5, libSizes = "12 67 5",
  sample = 100,
  columns = "didinium", target = "paramecium",
  seed = 123, includeData = TRUE
)
ccm_191 <- ccm_191$CCM1_PredictStat

plots and quantiles are here

par(mfrow = c(3, 2))
# testing some lib sizes
for (lib in c(22, 47, 62)){
  stats_123 <- ccm_123[which(ccm_123$lib_size == lib), ]
  stats_191 <- ccm_191[which(ccm_191$LibSize == lib), ]

  plot(ecdf(stats_123$rho), xlim = c(0.5, 0.9), xlab = "rho", main = paste0("package <1.2.3\nlibsize ", lib))
  plot(ecdf(stats_191$rho), xlim = c(0.5, 0.9), xlab = "rho", main = paste0("package 1.9.1\nlibsize ", lib))

  cat("package <1.2.3\tlibsize ", lib, "\n")
  print(quantile(stats_123$rho, probs = c(0.025, 0.5, 0.975)))
  cat("package 1.9.1\tlibsize ", lib, "\n")
  print(quantile(stats_191$rho, probs = c(0.025, 0.5, 0.975)))
}

one extra question, what would be the advantage of having sampling with replacement in CCM?

Thank you!
Juan

Answer 3 · 2021-10-29T13:11:05.000Z

Hi Juan! To respond to the methodological question, "what would be the advantage of having sampling with replacement in CCM?" I recommend against using "sampling with replacement" for new studies using CCM. However, the feature is still included so previously results can be reproduced, as it was used in at least one early CCM paper. I think it was a case of logic-by-analogy to boostrap resampling for estimating statistical distributions, etc., but does not have a mathematical purpose in nearest-neighbor prediction.

Answer 4 · 2021-10-30T07:58:43.000Z

Gentlemen: thank you for the clarifications!

One can still expect small (?) differences even if the seed = x is degenerate. The reason is that a static seed only guarantees the same starting point for the pseudo-random generator. The current release implements this in the cppEDM code using the C++ default_random_engine and a C++ set object, which is different from the implementation used in the rEDM 0.7 code that uses an old algorithm proposed by Knuth.

One can also note that even if the same code is used, but compiled on different compilers, the generated sequences can be different. Each compiler only has to adhere to the C++ standards, which do not specify how to write the code, but minimal requirements for behavior. As R releases evolve, so do the compilers that R and the packages are built from, and, they are certainly different accross different OS's & versions.

Thanks again for your clear and careful examination of this issue +1.