fjosw/ADjson.jl

Missing measurements in json data are not compatible with ADerrors.jl

Bunniies opened this issue · 4 comments

Hi Fabian,

I noticed that the current stable release is not compatible with ADerrors.jl package when there are missing measurements in the data. In more details, I have to read json files with data measured every two configurations and load them through ADerrors, but I get the error message:
ERROR: Irregular Monte Carlo chains for multiple replica cannot be safely read into ADerrors.
However, by looking at the ADerrors.jl documentation (https://ific.uv.es/~alramos/docs/ADerrors/api/#ADerrors.uwreal) there are actually two extra parameters that can be passed when creating an uwreal object with missing measurements. These parameters are:

idm::Type Vector{Int64}. idm[n] labels the configuration where data[n]is measured.
nms::Type Int64. The total number of measurements in the ensemble

As an example with multiple replicas:
Observable measured on the even configurations
2, 4, 6, ..., 200 on an emsemble of length 200
with two replica of lengths 75, 125
a = uwreal(data_a[1:500], "Observable with gaps in an ensemble with replica", [75, 125], collect(2:2:200), 200)

where in this case sum(replica)== sum([75,125]) must be equal to nms=200

I would be very interested in solving this issue with the current ADjson version to make it compatible with missing measurements in multiple replicas and I am willing to discuss this with you.
Thank you in advance,
Alessandro

fjosw commented

Hi Alessandro, thanks for opening this issue. I think the reason why I added the explicit error message at the time was that ADerrors and pyerrors handle spaced data differently. In ADerrors the rescaled samples are stored in memory while in pyerrors the rescaling to the total number of configurations is only done when calculating the error. To be consistent with the data format one would have to undo this scaling when writing to a file from ADerrors.
As you pointed out this should be less of an issue when reading the data, one would just have to identify the correct spacing and pass it to the uwreal constructor. Maybe @s-kuberski also has something to say about this as he mainly designed the file format.

I always had the intention to add the missing features to the code when needed and I am happy that is seems useful for you. Unfortunately, I won't have much time this week to give it a closer look but if you want to give it a go I am happy to review and accept a pull request.

In case that is helpful here is a short example which creates a file for testing purposes using pyerrors.

import numpy as np
import pyerrors as pe

obs = pe.Obs([np.random.normal(0.3, 0.1, 75), np.random.normal(0.3, 0.1, 125)], ["ens|r0", "ens|r1"], [range(1, 150, 2), range(1, 250, 2)])
pe.input.json.dump_to_json([obs], "test_file")

Hi Alessandro,
I had a look at the import function. I think that most of the functionalities are already there. What remains to be done is:
1.) Instead of checking

for ch in 1:length(cnfg_numbers)
    if ch != cnfg_numbers[ch]

in https://github.com/fjosw/ADjson.jl/blob/a3592641596ba7e986fc60fdb5bc1a0339c867a4/src/ADjson.jl#LL104C25-L105C54, we would only want to throw an exception if there are irregularities in the MC chain. In pyerrors, we would check

gaplist = numpy.unique(nump.diff(cnfg_numbers))
if len(gaplist) != 1:
[...]
gap = gaplist[0]

If this does not evaluate to True, there are missing measurements an an exception would have to be thrown.

2.) In https://github.com/fjosw/ADjson.jl/blob/a3592641596ba7e986fc60fdb5bc1a0339c867a4/src/ADjson.jl#LL112C21-L112C63 we would want to do something like

append!(rep_lengths, cnfg_numbers[-1] - cnfg_numbers[0] + 1)

to compute the actual length of the replica (including the omitted configurations).

3.) When calling uwreal, the configuration numbers, as well as the length of the replica would have to be passed. This would amount to calling

uwreal(Vector{Float64}(conc_deltas[i]), element["id"], int_cnfg_numbers, rep_lengths)

or

uwreal(Vector{Float64}(conc_deltas[i]), element["id"], int_cnfg_numbers, rep_lengths[0])

when gap > 1 (and similarly for the case of lists in the lines below).

I'm not experienced in coding in julia. If you like, you could try to implement these changes. Otherwise, I can have a try.

Hi Fabian, Simon,

Thank you for your help. I had a closer look at the code and managed to solve the issue. The check condition in https://github.com/fjosw/ADjson.jl/blob/a3592641596ba7e986fc60fdb5bc1a0339c867a4/src/ADjson.jl#LL104C25-L105C54

if length(element["replica"]) > 1
    for ch in 1:length(cnfg_numbers)
         if ch != cnfg_numbers[ch]
              error("Irregular Monte Carlo chains for multiple replica cannot be safely read into ADerrors.")
         end
     end
end

was evaluating True even when missing measurements were absent, but the observables were not measured on all configurations. This was exactly my case, as I was trying to read data measured every 4 configurations and without missing measurements.

Therefore I simply changed the checking to

if length(element["replica"]) > 1
    gaplist = diff(cnfg_numbers)
     if  !all(gaplist[1] .== gaplist)
          error("Irregular Monte Carlo chains for multiple replica cannot be safely read into ADerrors.")
     end
 end

In such a way that if a missing measurement is detected then the error is triggered, while for equally spaced measurements the code continues normally. I also checked the subsequent calls to uwreal and they look consistent with the ADerrors.jl documentation in the case of spaced measurements. All the relevant information that need to be passed to uwreal were already stored in rep_lengths and int_cnfg_numbers and the code works without issues.

I will further check for possible bugs but this seems to work just fine.

Thanks again to both of you,
Alessandro

fjosw commented

Sounds good! Can you open a pull request with these changes once you have finished testing?