MultiNest Scalability Test Results

Question

MultiNest Scalability Test Results

Closed this issue 2 years ago · 7 comments

qyx268 commented 4 years ago

Here are some scaling features when using the MultiNest sampler, obtained using a suite of very small simulations

HPC: bridges2-rm psc xsede

21cmfast version: 21cmfast/21cmFAST@07b99e5

21cmmc version: fbb18e4

shared materials : sample script, full results including MultiNest outputs, notebooks and posteriors

Hyper Parameters

user_params = {'HII_DIM':50, 
        'BOX_LEN':100.0, 
        "USE_FFTW_WISDOM": True,
        "HMF": 1,
        "USE_RELATIVE_VELOCITIES": False,
        "POWER_SPECTRUM": 0,
        "N_THREADS": 1,
        "PERTURB_ON_HIGH_RES": False,
        "NO_RNG": False,
        "USE_INTERPOLATION_TABLES": True,
        "FAST_FCOLL_TABLES": False}


cosmo_params = {'SIGMA_8': 0.8118, 
        'hlittle': 0.6688, 
        'OMm': 0.321, 
        'OMb':0.04952, 
        'POWER_INDEX':0.9626}

flag_options = {        "USE_HALO_FIELD": False,
        "USE_MINI_HALOS": False,
        "USE_MASS_DEPENDENT_ZETA": True,
        "SUBCELL_RSD": False,
        "INHOMO_RECO": False,
        "USE_TS_FLUCT": False,
        "M_MIN_in_Mass": False,
        "PHOTON_CONS": False,
    }

global_params = {'Z_HEAT_MAX': 15.0, 'ZPRIME_STEP_FACTOR': 1.2}

Likelihoods

 LikelihoodPlanck()
 LikelihoodNeutralFraction()
 LikelihoodLuminosityFunction(name='lfz6', simulate = False)
 LikelihoodLuminosityFunction(name='lfz7', simulate = False)
 LikelihoodLuminosityFunction(name='lfz8', simulate = False)
 LikelihoodLuminosityFunction(name='lfz10', simulate = False)

Model Parameters

F_STAR10
F_ESC10
ALPHA_ESC              (3param runs include parameters above this line)
ALPHA_STAR            (4param runs include parameters above this line)
M_TURN                    (5param runs include parameters above this line)
t_STAR                       (6param runs include parameters above this line)
L_X                             (7param runs include parameters above this line)
NU_X_THRESH          (8param runs include parameters above this line)
X_RAY_SPEC_INDEX (9param runs include parameters above this line)

Results

scaling with the number of CPUs

p.s. <=128-core jobs are done on 1 node while 256-core runs are distributed onto 2 nodes * 128 core per node

missing xlabel: number of cores (ncpu)

human hour decreases with increasing number of cores (ncpu)
however, the number of models finished per cpu hour decreases with ncpu (mostly due to the waiting time I believe)
the number of models in total and in posterior remains (almost) constant with ncpu
the evidence and its uncertainty also remain unchanged with ncpu

sampling_efficiency

increasing sampling_efficiency leads to a larger fraction of models accepted in the posterior
because of that, it seems to reach convergence faster, therefore, few models will be run. don't know why the number of models per cpu hour decreases...
the evidence becomes less accurate (compared to other models shown below with sampling_efficiency=0.3) when increasing sampling_efficiency, that's also why the MultiNest developers suggested a lower sampling_efficiency (0.3) for evaluating the evidence and a higher one (0.8) for simply getting a good posterior

evidence_tolerance

reducing evidence_tolerance leads to a bit more models to run (so the posterior can be better sampled)
number of models per cpu hour remain very constant
fraction of models accepted in the posterior deceases a bit
for these model, the evidence actually remain unchanged with reduced evidence_tolerance

n_live_points

with more live points, more models are performed -> smoothed posterior
there will be more cpu hour needed, but it seems that the number of model per cpu hour increases too and the scalability against ncpu becomes better with more live points
evidence uncertainty massively reduced (though not scaled with the increased number of live points)

scaling with the number of model parameters

in general, more complex modes requires to run more models and take longer running time, but it looks like timing difference between 3param and 6param is less than a factor of 2
the number of model per cpu hour remain mostly unchanged
while the number of models in total and in the posterior all increases, the fraction of models accepted decreases
evidence also reduces (since the prior volume increases)
when the extra parameters are not constrained, adding more parameters do not lead to more models to be performed

importance_nested_sampling and multimodal

p.s. models above and here 9params are insON and multimodalNO

These models mostly have only one mode, so enabling multimodal leads to no change
with no INS, it seems that the number of models in the posterior only increases when sample_efficiency=0.8, while for 0.3, it remains the same...
but in most cases, disabling INS leads to a decreased fraction of models in the posterior and therefore needs to run more models. By definition, INS uses more models from previous iterations.

posteriors

I'm only showing the parameter posteriors for 9param_1000_1_0.3_0.5_0_128 here, other posteriors can be found in the google drive.
numbers in the filename correspond to number of parameters (9), n_live_points (1000), importance_nested_sampling (1/True), sampling_efficiency (0.3), evidence_tolerance (0.5), multimodal (0/False), ncpu (128).

Answer 1 · 2021-03-04T08:28:27.000Z

not sure where to put this...

Answer 2 · 2021-03-16T19:06:18.000Z

Thanks @qyx268, this is awesome. It's a lot to digest, but will be a good reference for the future. I noticed that I think the plot for the n_live_points section might not be the right plot?

Answer 3 · 2021-03-16T21:28:55.000Z

It is the right plot, I just didn't use n_live_points as the horizontal axis as I only did 100 and 1000. A fair comparison is the first and last column of that plot where sampling_efficiency and evidence_tolerance are the same.

But there is still budget left, I gues I can do a suite of n_live_points = 300.

Answer 4 · 2021-05-07T00:38:57.000Z

NOTE: the parameter posteriors have a WRONG log_10(Mturn/Msol) range. It should be 8 to 10 rather than 8 to 12!

Answer 5 · 2022-04-07T08:24:11.000Z

We found discrepancy when counting the number of performed models between using different methods.

First, MultiNest itself reports the number of Total Likelihood Evaluations, which is the value I quoted above for #model. Then, I wrote outputting scripts in the likelihood functions to output those forward-modeled properties for each model. So we can also count the number of files and use this as #model. When using the with-C version of 21CMMC (many times), I did see counting files normally gave 20%-50% higher #model compared to what MultiNest reported. I assumed this was due to waste when restarting, and therefore expected the scalability test shown above should still be accurate.

However, a student that I'm working with, Changxiang Mao, has been re-doing scalability test on CHPC as requested by the HPC Support. This is the first time that we have a complete run of this python-only version of 21CMMC together with MultiNest. He found that counting files could have orders of magnitude higher #model compared to what MultiNest reports. This is making the scalability test result above less creditable.

P.S. CHPC has 24 cores on each node

Answer 6 · 2022-04-07T09:11:12.000Z

NEW SCALABILITY TEST RESULTS (BETTER RESULTS)

We suspected the poor scaling we found above was likely due to the choice of a very simple model. Therefore, the balance between communication (for MultiNest sampling) and computation (21cmFAST+likelihood evaluation) was poor. Note 400 models per CPU as shown in the 32-core runs while a typical model we use for a real inference costs 10-100x more CPU hours. However, as shown above, it can also because I was counting #model wrongly. I read https://github.com/JohannesBuchner/MultiNest/blob/master/src/nested.F90 a bit, but couldn't find any reason for causing the discrepancy...

Changxiang's scalability test uses a model with a similar complexity as above. His has more snapshots (65 vs 4) but poorer resolution (HII_DIM=10 vs 50). However, he found a much improved (strong) scaling, suggesting the poor scaling shown above was likely due to wrong model counting instead of model complexity.

HPC: NCSIS CHPC the large computing queue
21cmfast version: py21cmfast.version=3.1.3
21cmmc version: py21cmmc.version=1.0.0dev3
posteriors: https://drive.google.com/file/d/186cIpIyf-nV-Zx89-IJwskZ65lF6tlob/view?usp=sharing

Hyper parameters:

user_params = {'HII_DIM':10, 
        'BOX_LEN':100.0, 
        "USE_FFTW_WISDOM": False,
        "HMF": 1,
        "USE_RELATIVE_VELOCITIES": False,
        "POWER_SPECTRUM": 0,
        "N_THREADS": 1,
        "PERTURB_ON_HIGH_RES": False,
        "NO_RNG": False,
        "USE_INTERPOLATION_TABLES": True,
        "FAST_FCOLL_TABLES": False}

cosmo_params = {'SIGMA_8': 0.8118, 
        'hlittle': 0.6688, 
        'OMm': 0.321, 
        'OMb':0.04952, 
        'POWER_INDEX':0.9626}

flag_options = {"USE_HALO_FIELD": False,
        "USE_MINI_HALOS": False,
        "USE_MASS_DEPENDENT_ZETA": True,
        "SUBCELL_RSD": False,
        "INHOMO_RECO": True,
        "USE_TS_FLUCT": False,
        "M_MIN_in_Mass": False,
        "PHOTON_CONS": False,
    }

global_params = {'Z_HEAT_MAX': 20.0, 'ZPRIME_STEP_FACTOR': 1.02,'Pop3_ion':1.}

Likelihood:

LikelihoodNeutralFraction(redshift=5.2,xHI=0.01,xHI_sigma=0.01),
LikelihoodLuminosityFunction(name='lfz6', simulate = False, ),
LikelihoodLuminosityFunction(name='lfz7', simulate = False, ),
LikelihoodLuminosityFunction(name='lfz8', simulate = False, ),
LikelihoodLuminosityFunction(name='lfz10', simulate = False, ),
LikelihoodPlanckPowerSpectra(name_lkl = 'Planck_lowl_EE')

Model parameters:

F_STAR10 = [-1.3, -3., 0., 1.0],
ALPHA_STAR = [0.5, -0.5, 1.0, 1.0],

Scaling

We see, with increasing node numbers, #model per cpu hour remains nearly constant and #model increases linearly, suggesting no computational waste when using > 2k cores.

Answer 7 · 2022-04-26T06:23:03.000Z

new results with each node number tested 3 times. The black point and errorbar shows the min, mid and max for the 3 runs.

I'm closing the issue and going to put this figure on the front page