uqrmaie1/admixtools

running many replicates of find_graphs?

laufran opened this issue · 1 comments

Hi there,

In the older version of find_graphs, find_graphs_old, I see there's an argument numrep that allows for multiple runs of the same parameter arguments through one function call. Just to confirm, there's no such option for the current version of find_graphs / qpgraph? If so, how would you all recommend running many replicates? And how many replicates would you recommend, given the findings in Maier et al. 2023 that "models fitting the data as well as or better than the true one are common, and their topological diversity is in most cases so high that it precludes consensus inference of topology by analysis of multiple topologies"?

Best,
Lauren

Yes, that is correct, there is no equivalent of numrep in find_graphs(). It seems a bit clearer to me to explicitly call the function multiple times than to use an argument for that, and it's not difficult to do. Here are two ways to do that:

numrep = 3
reslist = list()
for(i in seq_len(numrep)) {
  reslist[[i]] = find_graphs(f2_blocks, ...)
}

The code above uses only standard R syntax and functions, and the results in reslist will be in a list of data frames. Alternatively, you could do it like this:

numrep = 3
res = map(seq_len(numrep), ~find_graphs(f2_blocks, ...)) %>% bind_rows(.id='rep')

This will give you a single data frame where the replicate number is indicated by the column rep.
You could then get the graph with the lowest score in each replicate like this:

res %>% slice_min(score, by = rep)

If you model a complex graph or let it run for many generations, each replicate could take a while to run. In that case it might be better to parallelize across replicates, for example by submitting one job per replicate on a compute cluster, or using the furrr or doParallel R packages.

And how many replicates would you recommend?

It depends on a few factors. Initially I would start with a small handful of replicates that you can inspect manually to get a feel of what the results look like. Later on, you might want to increase the number of replicates, depending on what the initial results look like:

  • One possible outcome is that almost all replicates converge on the same best graph. In that case, there isn't really a point in running it too many times.
  • Another possible outcome is that in every replicate you get a very different best graph, and all these best graphs have similarly low scores. In that case it also doesn't make sense to run it many times, since the results suggest that there isn't enough data to get anything meaningful, and you might want to reduce the complexity of the fitted model (reduce the number of admixture events).
  • One outcome where it might make sense to run more replicates is if some of the replicates converge on a very good graph, but most replicates get stuck in a local optimum where the best graphs have poor scores. In that case, you might only find the global optimum graph after running many replicates.

Hope this helps!