*ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent.
Closed this issue · 17 comments
I have this error when I try to resume a job. I was able to resume it at least one time but this second tie it gives this. I tried several times but with same message. His is the job.out file content:
[0 : output] Found existing info files with the requested output prefix: 'results/ow0waCDM_all'
[0 : output] Let's try to resume/load.
[2 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[0 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[1 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[1 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[3 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[3 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : output] Found an old sample. Resuming.
[0 : prior] *WARNING* External prior 'SZ' loaded. Mind that it might not be normalized!
[0 : camb] `camb` module loaded successfully from /global/cfs/cdirs/desicollab/users/adematti/perlmutter/cosmodesiconda/20221205-1.0.0/conda/lib/python3.10/site-packages/camb
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['fsigmar', 'DV_over_rd'].
[0 : planck_2018_highl_plik.ttteee] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : planck_2018_lensing.clik] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : mcmc] Resuming from previous sample!
[0 : prior] *WARNING* There are unbounded parameters (['A_planck', 'calib_100T', 'calib_217T', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217', 'DES_DzL1', 'DES_DzL2', 'DES_DzL3', 'DES_DzL4', 'DES_DzL5', 'DES_DzS1', 'DES_DzS2', 'DES_DzS3', 'DES_DzS4', 'DES_m1', 'DES_m2', 'DES_m3', 'DES_m4']). Prior bounds are given at 0.9999995 confidence level. Beware of likelihood modes at the edge of the prior
[1 : samplecollection] Loaded 990 sample points from 'results/ow0waCDM_all.2.txt'
[2 : samplecollection] Loaded 1011 sample points from 'results/ow0waCDM_all.3.txt'
[0 : samplecollection] Loaded 1079 sample points from 'results/ow0waCDM_all.1.txt'
[3 : samplecollection] Loaded 1084 sample points from 'results/ow0waCDM_all.4.txt'
[0 : samplecollection] *ERROR* The sum of logpriors in the sample is not consistent.
[0 : samplecollection] *ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent.
[1 : mcmc] Initial point: ombh2:0.02261121, omch2:0.1181356, H0:69.84661, logA:3.045463, ns:0.971925, omk:-0.0009965226, w:-0.9430333, wa:-0.4804295, tau:0.05782186, A_planck:1.001588, calib_100T:0.9993421, calib_217T:0.9989519, A_cib_217:51.14609, xi_sz_cib:0.3915068, A_sz:4.471394, ksz_norm:3.81948, gal545_A_100:7.050906, gal545_A_143:13.35773, gal545_A_143_217:18.55076, gal545_A_217:94.86781, ps_A_100_100:319.5084, ps_A_143_143:37.68866, ps_A_143_217:35.88573, ps_A_217_217:105.5367, galf_TE_A_100:0.128669, galf_TE_A_100_143:0.1368194, galf_TE_A_100_217:0.4279111, galf_TE_A_143:0.2070875, galf_TE_A_143_217:0.6186202, galf_TE_A_217:1.842039, DES_DzL1:0.004783368, DES_DzL2:-0.003013851, DES_DzL3:0.0008851392, DES_DzL4:0.004369828, DES_DzL5:0.003481381, DES_b1:1.477709, DES_b2:1.738489, DES_b3:1.620947, DES_b4:1.962905, DES_b5:2.061378, DES_DzS1:0.003615505, DES_DzS2:-0.02467024, DES_DzS3:0.02731843, DES_DzS4:-0.05860599, DES_m1:0.04670242, DES_m2:0.01681293, DES_m3:-0.003576742, DES_m4:0.01273669, DES_AIA:0.6885432, DES_alphaIA:-0.008803587
[2 : mcmc] Initial point: ombh2:0.02244094, omch2:0.1181104, H0:67.736, logA:3.040276, ns:0.9709289, omk:-0.0006007525, w:-0.7968705, wa:-0.7655362, tau:0.05525625, A_planck:0.9993413, calib_100T:0.9996387, calib_217T:0.9981446, A_cib_217:44.63784, xi_sz_cib:0.3801157, A_sz:6.045817, ksz_norm:5.675437, gal545_A_100:6.166859, gal545_A_143:10.48994, gal545_A_143_217:10.14862, gal545_A_217:76.86048, ps_A_100_100:239.4902, ps_A_143_143:31.59264, ps_A_143_217:40.76925, ps_A_217_217:121.607, galf_TE_A_100:0.1155986, galf_TE_A_100_143:0.1540269, galf_TE_A_100_217:0.544674, galf_TE_A_143:0.2837667, galf_TE_A_143_217:0.7849412, galf_TE_A_217:2.363021, DES_DzL1:0.003117714, DES_DzL2:0.002392154, DES_DzL3:0.002103641, DES_DzL4:-0.00591887, DES_DzL5:-0.008232313, DES_b1:1.440227, DES_b2:1.685149, DES_b3:1.630987, DES_b4:1.979471, DES_b5:2.105889, DES_DzS1:-0.004751653, DES_DzS2:-0.0317832, DES_DzS3:-0.0001454839, DES_DzS4:-0.03830876, DES_m1:0.003314337, DES_m2:-0.005635238, DES_m3:-0.02677006, DES_m4:0.02435357, DES_AIA:0.521304, DES_alphaIA:-1.325487
[3 : mcmc] Initial point: ombh2:0.02253404, omch2:0.1177752, H0:66.68511, logA:3.058207, ns:0.9679595, omk:-0.001886929, w:-0.8575862, wa:-0.4354515, tau:0.06036522, A_planck:1.003104, calib_100T:0.9999663, calib_217T:0.9988349, A_cib_217:51.09076, xi_sz_cib:0.3083462, A_sz:3.599204, ksz_norm:7.452705, gal545_A_100:7.437093, gal545_A_143:12.5047, gal545_A_143_217:16.44311, gal545_A_217:88.90734, ps_A_100_100:245.5505, ps_A_143_143:31.21603, ps_A_143_217:24.33498, ps_A_217_217:100.7772, galf_TE_A_100:0.0861488, galf_TE_A_100_143:0.1955448, galf_TE_A_100_217:0.509976, galf_TE_A_143:0.3648059, galf_TE_A_143_217:0.7208691, galf_TE_A_217:1.722586, DES_DzL1:0.00756885, DES_DzL2:-0.01112213, DES_DzL3:-0.002029036, DES_DzL4:-0.0009077926, DES_DzL5:-0.008044257, DES_b1:1.473136, DES_b2:1.710627, DES_b3:1.674257, DES_b4:1.994127, DES_b5:2.184813, DES_DzS1:-0.0211984, DES_DzS2:-0.008531034, DES_DzS3:-0.003726577, DES_DzS4:-0.0205448, DES_m1:-0.02914757, DES_m2:-0.02931022, DES_m3:-0.005824037, DES_m4:-0.01277812, DES_AIA:0.3168567, DES_alphaIA:2.935892
[0 : run] Aborting MPI due to error
----
clik version plc_3.1
smica
Checking likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/hi_l/plik/plik_rd12_HM_v22b_TTTEEE.clik' on test data. got -1172.47 expected -1172.47 (diff -4.34054e-07)
----
Checking lensing likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/lensing/smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8.clik_lensing' on test data. got -4.42102
Looks similar to the temperature checking issue that was fixed, from recent temperature-related changes. for @JesusTorrado to check when back.
To workaround you can just comment out these checks.
Just search for the error message (The sum of logpriors in the sample is not consist)
Hi @cmbant,
I have a similar issue and I would like to confirm if it is safe to deactivate the following check as well:
self.collection = SampleCollection(
File "/global/common/software/desi/users/adematti/perlmutter/cosmodesiconda/20230725-1.0.0/conda/lib/python3.10/site-packages/cobaya/collection.py", line 289, in __init__
raise LoggedError(
cobaya.log.LoggedError: Error when loading samples: The sample seems to have an inconsistent temperature.
The temperature error should be fixed/worked around in latest Cobaya master - were you using that?
@JesusTorrado, had any chance to look at fix for all these new read accuracy errors?
Not yet. I was doing some I/O experiments. I'll get to it very soon!
@mishakb could you please check if the new branch fix_post_prior_test
fixes your issue?
The easiest way is to install with pip from that branch with
pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test
Probably fixed by #322. Please reopen if it can still be reproduced.
Hello,
I am fetting the following error related to inconsistent temperature, and tolerance in one of my cobaya runs.
2024-08-07 14:24:51,806 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,806 [0 : samplecollection] WARNING Needed to relax tolerances when checking consistency of log probabilities and temperature (if present).
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR Error when loading samples: The sample seems to have an inconsistent temperature.
Is it related to this issue? Can it be solved also by installing with the following?
pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test
I think that's already merged. Can you attach chains/code to reproduce the issue?
Hi @cmbant , thanks for your response. I only made changes in the file classy/source/background.c, to modify the existing scalar field potential for dark energy. I attach the modified code and the output file here. Also, please note that this is after I resume a previous run that has stopped before.
ftoutput.txt
backgroundft.txt
Thanks, but could you attach zip of the actual offending chain files (FTPLDU/ftpdu*)
Thanks for emailing the file. OK, so the temperature thing is a bit of a red herring, the issue is the last line of chain files not having a complete set of columns, and hence being filled with NaN when loaded into the collection (presumably from walltime kill happening during file write or before flush).