dsilvestro/PyRate

Using -plot on PyRateMBD without having to run the MBD analysis?

bmchorse opened this issue · 6 comments

I'd like to run PyRateMBD in a variety of places (e.g. doing linear and exponential model, for a handful of different subclades, etc) and then come back later to run the -plot flag.

However, when you run the command python PyRateMBD.py -plot path_to_MBD_logfile.log , you wind up getting a ValueError on the attempted np.loadtxt() because PyRateMBD will try to read in a dataset on line 64. If you provide the dataset, you must provide an MBD predictors folder, and if you provide these it simply will run a fresh MBD analysis.

Am I missing a way to come back later to run the plot flag on an output of PyRateMBD? If not, I have some ideas. It seems that PyRate.py gets around plot vs. analysis issue by 1) having almost everything abstracted out to functions and 2) checking to see if any arguments that require an input file are not empty and then going for a try/except.

Here are my suggestions for a fairly simple way to do this:

  1. Move the MBD modeling functionality to a function, run_MBD() or similar.
  2. If neither -d nor -plot is provided, warn the user and exit.
  3. If -d is provided, run the modeling function as normal. If not, we should be able to skip that section without penalty.
  4. If -plot is provided, run the plotting code. If not, no plot is attempted. (already implemented in code as written with an if-statement)

If you're interested I am happy to draft this out and make a PR.

Hmm, I see that the plotting stuff is reading from local variables that come from modeling. Are these things we could read in easily from the logfile, or no?

The plotting function does require you to provide the input data (ts/te table) and the path to all the variables. That is because the information in the logfile is not sufficient to plot the marginal rates which are a function of all curves including (by default) the clade's own diversity trajectory. Note that you also need to specify the model (exp or linear) in order to get correct rates through time. I'm not sure this answers you questions... (Apologies for the lack of detailed documentation for this model - hopefully it'll come soon)

I should add that computing the marginal rates can be quite computationally intense and take several minutes (even hours for large data sets).

Ah, that makes sense. I think that answers my questions! So the correct steps then would be:

  • Run MBD with input data, specify model, etc
  • After this, run MBD with -plot but again providing input data and model etc.

Is that correct? If so, do you have a general suggestion for number of iterations for each one? I see the default is 1 million. So you would run the first step for 1 million and then the second for 1 million?

For the actual analysis 1 M iterations is probably too little (10 or 100 are more likely to do be necessary, but that depends on the data set as well). For plotting, the number of iterations does not matter because only input data and the logfile are used and the program quits after doing the plots (i.e. it doesn't actually run an MCMC).

Got it, thanks!