some question about big sample size

Question

some question about big sample size

Closed this issue a year ago · 2 comments

Greetings,
My data constructed by some parts which have different sample size from 2000 to 10000+, node num is 50+. My objective is to individually construct networks for these segments and then compare the differences in their network structures.

After considering Mr Epskamp‘s valuable advice provided earlier and reading the papers(in this issue) , I have refined my method. I have chosen to use appointed arguments within the 'estimatenetwork' package instead of directly employing 'ggmModselect'. Additionally, I have moved away from using thresholds and have opted for Pearson correlation (which seems to be the default in 'estimatenetwork') instead of 'cor_auto'.

Below is the code implementation:

Stepwise_part1<-estimateNetwork(data1,default = ('ggmModSelect'),stepwise = TRUE)
Stepwise_part2<-estimateNetwork(data2,default = ('ggmModSelect'),stepwise = TRUE)
Stepwise_part3<-estimateNetwork(data3,default = ('ggmModSelect'),stepwise = TRUE)

NCT(Stepwise_part1$graph,Stepwise_part2$graph,it=10,test.centrality=TRUE)
NCT(Stepwise_part1$graph,Stepwise_part3$graph,it=10,test.centrality=TRUE)
NCT(Stepwise_part2$graph,Stepwise_part3$graph,it=10,test.centrality=TRUE)

PLZ help me in these questions:

Are these code implementations appropriate? It appears that the results from 'ggmModSelect' remain stable even for sample sizes exceeding 2000.
Is it advisable to utilize methods like CEM for sample size reduction, ensuring equal sizes across parts, before performing the Network Comparison Test (NCT)? Alternatively, is it reasonable to directly compare network structures generated from segments with different sample sizes? partX<-estimateNetwork(dataX,default = ('ggmModSelect'),stepwise = FALSE)
Considering the considerable number of nodes and samples in the networks and giving the long calculation time of the NCT for Stepwise networks, is the implementation of the Stepwise necessary?

Profoundly grateful for your kind consideration of these questions!

Answer 1 · 2023-08-22T02:04:41.000Z

Hi! the choice of estimator depends on you research question. Mainly, if you are interested in individual edges then a non-regularized estimator like ggmModSelect is good, but given the large number of nodes here you could still opt to go with EBICglasso. The code is not correct for NCT, you have to use the estimteNetwork objects directly in NCT. E.g., NCT(Stepwise_part1, Stepwise_part2), but it will be quire slow with stepwise estimation and this many nodes.

Yes ggmModSelect could be a good choice here, but it will be very slow.
I do not recommend reducing sample size. The NCT can handle non-equal sample sizes.
Stepwise greatly helps, but it isn't nessesary per se. You can also opt to use default = "pcor" with threshold = TRUE for significance thresholding here.

Another option is to use psychonetrics for a homogeneity test. See below for some example code.


# Load psychonetrics and dplyr:
library("psychonetrics")
library("dplyr")

# Extract the variable names:
vars <- names(data1)

# Add part variable:
data1$part <- 1
data2$part <- 2
data3$part <- 3

# Combine datasets:
data_combined <- bind_rows(data1, data2, data3)

# form multi-group model:
mod <- ggm(data_combined, vars = vars, groups = "part")

# Estimate saturated model (all edges included):
mod_saturated <- mod %>% runmodel

# Estimate model with saturated networks set equal (equivalent to testing if correlations are equal):
mod_saturated_equal <- mod_saturated %>% groupequal("omega") %>% runmodel

# Compare the models:
compare(
    saturated_not_equal = mod_saturated,
    saturated_equal = mod_saturated_equal
)
# The model with lower AIC/BIC is preferred. 

# You can also compare sparse models. With this number of nodes and sample size, I'd recommend to only use pruning. See Network Psychometrics with R chapter 7:
mod_sparse <- mod_saturated %>% prune(alpha = 0.01) %>% runmodel

# Now to test for equality, we need to compute a "union" model that includes all edges that are present in at least one sample:
mod_sparse_union <- mod_sparse %>% unionmodel %>% runmodel

# Now we can set the edges equal:
mod_sparse_union_equal <- mod_sparse_union %>% groupequal("omega") %>% runmodel

# and compare the models:
compare(
    sparse_union_not_equal = mod_sparse_union,
    sparse_union_equal = mod_sparse_union_equal
)
# If the equal model is not preferred, then select the mod_sparse model (not the union model as it includss some non-significant edges).

# Extract the networks:
getmatrix(mod_sparse, "omega") # <- replace 'mod_sparse' with preferred model

Answer 2 · 2023-08-22T02:23:55.000Z

Oh I forget to change the code, in the latest I only input the estimateNetwork objects in NCT.
I do include as many possible edges as possible to see the connections between nodes, so ggm might be a good way.
And thank you very much for your response to the NCT sample size and Homogeneity test, I will study these methods carefully.

Thanks again for all your help!!!