CobayaSampler/cobaya

*ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent.

Closed this issue · 17 comments

I have this error when I try to resume a job. I was able to resume it at least one time but this second tie it gives this. I tried several times but with same message. His is the job.out file content:

[0 : output] Found existing info files with the requested output prefix: 'results/ow0waCDM_all'
[0 : output] Let's try to resume/load.
[2 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[0 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[1 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[1 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[3 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[3 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : output] Found an old sample. Resuming.
[0 : prior] *WARNING* External prior 'SZ' loaded. Mind that it might not be normalized!
[0 : camb] `camb` module loaded successfully from /global/cfs/cdirs/desicollab/users/adematti/perlmutter/cosmodesiconda/20221205-1.0.0/conda/lib/python3.10/site-packages/camb
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['fsigmar', 'DV_over_rd'].
[0 : planck_2018_highl_plik.ttteee] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : planck_2018_lensing.clik] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : mcmc] Resuming from previous sample!
[0 : prior] *WARNING* There are unbounded parameters (['A_planck', 'calib_100T', 'calib_217T', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217', 'DES_DzL1', 'DES_DzL2', 'DES_DzL3', 'DES_DzL4', 'DES_DzL5', 'DES_DzS1', 'DES_DzS2', 'DES_DzS3', 'DES_DzS4', 'DES_m1', 'DES_m2', 'DES_m3', 'DES_m4']). Prior bounds are given at 0.9999995 confidence level. Beware of likelihood modes at the edge of the prior
[1 : samplecollection] Loaded 990 sample points from 'results/ow0waCDM_all.2.txt'
[2 : samplecollection] Loaded 1011 sample points from 'results/ow0waCDM_all.3.txt'
[0 : samplecollection] Loaded 1079 sample points from 'results/ow0waCDM_all.1.txt'
[3 : samplecollection] Loaded 1084 sample points from 'results/ow0waCDM_all.4.txt'
[0 : samplecollection] *ERROR* The sum of logpriors in the sample is not consistent.
[0 : samplecollection] *ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent.
[1 : mcmc] Initial point: ombh2:0.02261121, omch2:0.1181356, H0:69.84661, logA:3.045463, ns:0.971925, omk:-0.0009965226, w:-0.9430333, wa:-0.4804295, tau:0.05782186, A_planck:1.001588, calib_100T:0.9993421, calib_217T:0.9989519, A_cib_217:51.14609, xi_sz_cib:0.3915068, A_sz:4.471394, ksz_norm:3.81948, gal545_A_100:7.050906, gal545_A_143:13.35773, gal545_A_143_217:18.55076, gal545_A_217:94.86781, ps_A_100_100:319.5084, ps_A_143_143:37.68866, ps_A_143_217:35.88573, ps_A_217_217:105.5367, galf_TE_A_100:0.128669, galf_TE_A_100_143:0.1368194, galf_TE_A_100_217:0.4279111, galf_TE_A_143:0.2070875, galf_TE_A_143_217:0.6186202, galf_TE_A_217:1.842039, DES_DzL1:0.004783368, DES_DzL2:-0.003013851, DES_DzL3:0.0008851392, DES_DzL4:0.004369828, DES_DzL5:0.003481381, DES_b1:1.477709, DES_b2:1.738489, DES_b3:1.620947, DES_b4:1.962905, DES_b5:2.061378, DES_DzS1:0.003615505, DES_DzS2:-0.02467024, DES_DzS3:0.02731843, DES_DzS4:-0.05860599, DES_m1:0.04670242, DES_m2:0.01681293, DES_m3:-0.003576742, DES_m4:0.01273669, DES_AIA:0.6885432, DES_alphaIA:-0.008803587
[2 : mcmc] Initial point: ombh2:0.02244094, omch2:0.1181104, H0:67.736, logA:3.040276, ns:0.9709289, omk:-0.0006007525, w:-0.7968705, wa:-0.7655362, tau:0.05525625, A_planck:0.9993413, calib_100T:0.9996387, calib_217T:0.9981446, A_cib_217:44.63784, xi_sz_cib:0.3801157, A_sz:6.045817, ksz_norm:5.675437, gal545_A_100:6.166859, gal545_A_143:10.48994, gal545_A_143_217:10.14862, gal545_A_217:76.86048, ps_A_100_100:239.4902, ps_A_143_143:31.59264, ps_A_143_217:40.76925, ps_A_217_217:121.607, galf_TE_A_100:0.1155986, galf_TE_A_100_143:0.1540269, galf_TE_A_100_217:0.544674, galf_TE_A_143:0.2837667, galf_TE_A_143_217:0.7849412, galf_TE_A_217:2.363021, DES_DzL1:0.003117714, DES_DzL2:0.002392154, DES_DzL3:0.002103641, DES_DzL4:-0.00591887, DES_DzL5:-0.008232313, DES_b1:1.440227, DES_b2:1.685149, DES_b3:1.630987, DES_b4:1.979471, DES_b5:2.105889, DES_DzS1:-0.004751653, DES_DzS2:-0.0317832, DES_DzS3:-0.0001454839, DES_DzS4:-0.03830876, DES_m1:0.003314337, DES_m2:-0.005635238, DES_m3:-0.02677006, DES_m4:0.02435357, DES_AIA:0.521304, DES_alphaIA:-1.325487
[3 : mcmc] Initial point: ombh2:0.02253404, omch2:0.1177752, H0:66.68511, logA:3.058207, ns:0.9679595, omk:-0.001886929, w:-0.8575862, wa:-0.4354515, tau:0.06036522, A_planck:1.003104, calib_100T:0.9999663, calib_217T:0.9988349, A_cib_217:51.09076, xi_sz_cib:0.3083462, A_sz:3.599204, ksz_norm:7.452705, gal545_A_100:7.437093, gal545_A_143:12.5047, gal545_A_143_217:16.44311, gal545_A_217:88.90734, ps_A_100_100:245.5505, ps_A_143_143:31.21603, ps_A_143_217:24.33498, ps_A_217_217:100.7772, galf_TE_A_100:0.0861488, galf_TE_A_100_143:0.1955448, galf_TE_A_100_217:0.509976, galf_TE_A_143:0.3648059, galf_TE_A_143_217:0.7208691, galf_TE_A_217:1.722586, DES_DzL1:0.00756885, DES_DzL2:-0.01112213, DES_DzL3:-0.002029036, DES_DzL4:-0.0009077926, DES_DzL5:-0.008044257, DES_b1:1.473136, DES_b2:1.710627, DES_b3:1.674257, DES_b4:1.994127, DES_b5:2.184813, DES_DzS1:-0.0211984, DES_DzS2:-0.008531034, DES_DzS3:-0.003726577, DES_DzS4:-0.0205448, DES_m1:-0.02914757, DES_m2:-0.02931022, DES_m3:-0.005824037, DES_m4:-0.01277812, DES_AIA:0.3168567, DES_alphaIA:2.935892
[0 : run] Aborting MPI due to error
----
clik version plc_3.1
  smica
Checking likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/hi_l/plik/plik_rd12_HM_v22b_TTTEEE.clik' on test data. got -1172.47 expected -1172.47 (diff -4.34054e-07)
----
Checking lensing likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/lensing/smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8.clik_lensing' on test data. got -4.42102
cmbant commented

Looks similar to the temperature checking issue that was fixed, from recent temperature-related changes. for @JesusTorrado to check when back.

To workaround you can just comment out these checks.

@cmbant, sure. Do you know where I could find and comment that out? Thanks.

cmbant commented

Just search for the error message (The sum of logpriors in the sample is not consist)

Hi @cmbant,

I have a similar issue and I would like to confirm if it is safe to deactivate the following check as well:

    self.collection = SampleCollection(
  File "/global/common/software/desi/users/adematti/perlmutter/cosmodesiconda/20230725-1.0.0/conda/lib/python3.10/site-packages/cobaya/collection.py", line 289, in __init__
    raise LoggedError(
cobaya.log.LoggedError: Error when loading samples: The sample seems to have an inconsistent temperature.
cmbant commented

The temperature error should be fixed/worked around in latest Cobaya master - were you using that?

@JesusTorrado, had any chance to look at fix for all these new read accuracy errors?

Not yet. I was doing some I/O experiments. I'll get to it very soon!

@mishakb could you please check if the new branch fix_post_prior_test fixes your issue?

The easiest way is to install with pip from that branch with

pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test

Probably fixed by #322. Please reopen if it can still be reproduced.

Hello,
I am fetting the following error related to inconsistent temperature, and tolerance in one of my cobaya runs.

2024-08-07 14:24:51,806 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,806 [0 : samplecollection] WARNING Needed to relax tolerances when checking consistency of log probabilities and temperature (if present).
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR Error when loading samples: The sample seems to have an inconsistent temperature.

Is it related to this issue? Can it be solved also by installing with the following?
pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test

I think that's already merged. Can you attach chains/code to reproduce the issue?

@SukanB can you share the files?

Hi @cmbant , thanks for your response. I only made changes in the file classy/source/background.c, to modify the existing scalar field potential for dark energy. I attach the modified code and the output file here. Also, please note that this is after I resume a previous run that has stopped before.
ftoutput.txt
backgroundft.txt

Thanks, but could you attach zip of the actual offending chain files (FTPLDU/ftpdu*)

@SukanB or email directly if you don't want it public

Thanks for emailing the file. OK, so the temperature thing is a bit of a red herring, the issue is the last line of chain files not having a complete set of columns, and hence being filled with NaN when loaded into the collection (presumably from walltime kill happening during file write or before flush).

@SukanB Can you try #378