possible mistake in mfdb_interval documentation

Question

possible mistake in mfdb_interval documentation

Closed this issue 7 years ago · 9 comments

This code from the help file of mfdb_interval indicates that when specifying intervals, the lower bound is inclusive and the upper bound is open-ended for all intervals. In practice though, it appears that the function aggregates with the lower bound open and the upper bound inclusive. Is this just a mistake in the help file?

"## Create groups len40: [40, 60), len60: [60, inf) (but will be described as [60, 80) in the GADGET model)
g1 <- mfdb_interval("len", c(40, 60, 80), open_ended = c("upper"))"

Thanks,
Pamela

Answer 1 · 2018-01-02T22:26:44.000Z

I have also noticed this issue. As a workaround I have simply added a desired maximum length onto the end of the argument to vect, but it probably should get corrected in the mfdb package. Pam, just to clarify you're saying that in the above example your length aggfile for whichever component you are creating will only have length aggregations up to 80, but no length group for everything above that, correct? That is the issue I was having.

Answer 2 · 2018-01-03T08:34:08.000Z

Hiya Paul! Actually no, with Bjarki's help we have found this is likely a problem with using length data that are not integers. But I also just realized that this problem is weird enough that I need to just use the real example from my data:

Let's say these are my data:
Length Count

1.00 : 1125.9259
1.05 : 2848.2495
1.10 : 5933.5657
1.15 : 19033.4530
1.20 : 48470.2480
1.25 : 134606.1640
1.30 : 236715.6261
1.35 : 433520.1095
1.40 : 577904.8897
1.45 : 386259.3622
1.50 : 241126.0718
1.55 : 127828.4487

Defining ldist.ins1 using the command:

ldist.ins1 <-
  mfdb_sample_count(mdb, 
                    c('age', 'length'), 
                    c(list(
                      data_source = 'iceland-ldist',
                      sampling_type = sample_sources,
                      age = mfdb_interval("all",c(minage,maxage),
                                       open_ended = c("upper","lower")),
                      length = mfdb_interval("len", 
                                             seq(from=minlength, to=maxlength, by = 0.1),
                                             open_ended = c("upper","lower"))),
                      defaults))

results in this:

len1 : 3974.18 (sum of 1.0 & 1.05)
len1.1 : 24967.00 (sum of (1.1 and 1.15)
len1.2 : 419792.00 (sum of 1.2, 1.25 and 1.3)
len1.3 : 1011420.00 (sum of 1.35 and 1.4)
len1.4 : 386259.00 (1.45 only)
len1.5 : 368954.00 (sum of 1.50 and 1.55)

Answer 3 · 2018-01-03T09:12:26.000Z

I guess what @pfrater is referring to is the open_ended argument. So say if you want everything above 80 to be added to the query you can set the query interval with:

mfdb_interval("len", c(10,20,80), open_ended = 'upper')

which results in a query where everything above 20 is aggregated, that is the number of fish in [20,\infty) are calculated, but the resulting aggregation file will report the length bin as [20,80]. Similarly you specify open_ended as "lower" or "both".

Answer 4 · 2018-01-03T11:04:37.000Z

Hrm. This test case works on 6.x:

    # Import a survey for the data we are interested in
    # https://github.com/mareframe/mfdb/issues/52
    mfdb_import_survey(mdb, data_source = "mfdb_nonint_groups",
        table_string("
year    month   areacell        species length  count
1998    1       45G01           COD     1.0     2
1998    1       45G01           COD     1.05    8
1998    1       45G01           COD     1.1     10
1998    1       45G01           COD     1.15    4
        "))

    # Group by length without na_group, NA's shouldn't be visible
    agg_data <- mfdb_sample_count(mdb, c('length'), list(
        data_source = "mfdb_nonint_groups",
        length = mfdb_interval("len", c(1.0, 1.1, 1.2), open_ended="upper"),
        null = NULL))
    ok(cmp(unattr(agg_data[[1]]), data.frame(
        year = c("all"),
        step = c('all'),
        area = c('all'),
        length = c('len1', 'len1.1'),
        number = c(10, 14),
        stringsAsFactors = FALSE)), "We can group by non-integer groups")

This isn't c(1.0,1.1) as you suggest, I added 1.2 to get the second group. Am I getting something wrong in the above? There's also a possibility that there's an already-fixed bug, interesting behaviour with non-integer intervals does ring vague bells.

Answer 5 · 2018-01-03T13:18:28.000Z

Aha, yep this is a bug, not a mistake in the documentation.

Answer 6 · 2018-01-03T13:34:12.000Z

Sorry - the example data I gave previously (and you used for your example) might not work because I'm not 100% sure that the problem is mfdb_interval, and the problem is stranger there than what I described originally. I just edited my previous comment to have data and code taken exactly as I'm implementing in 6.x (before realizing you already responded).

len1 : 3974.18 (sum of 1.0 & 1.05)
len1.1 : 24967.00 (sum of (1.1 and 1.15)
len1.2 : 419792.00 (sum of 1.2, 1.25 and 1.3)
len1.3 : 1011420.00 (sum of 1.35 and 1.4)
len1.4 : 386259.00 (1.45 only)
len1.5 : 368954.00 (sum of 1.50 and 1.55)
len1.6 : sum of 1.6 and 1.65
len1.7 : sum of 1.7, 1.75 and 1.8
len1.8 : sum of 1.85 and 1.9
len1.9 : 1.95 only
len2.0 : sum of 2.0, 2.05, 2.10
len2.1 : 2.15 only
len2.2 : sum of 2.2, 2.15, 2.3
len2.3 : 2.35 only

Answer 7 · 2018-01-03T13:58:57.000Z

There's definitely a problem in mfdb_interval, it may not be the only thing going on here mind. It's telling postgresql to compare floating-point and exact-precision numbers, which will only work some of the time. Once I've decided how to fix it I'll push something to test.

Answer 8 · 2018-01-05T16:49:55.000Z

@pamelajwoods Can you try the latest 6.x and see if this sorts your problems?

Answer 9 · 2018-01-09T09:43:01.000Z

Sorry for the late reply - this works perfectly now and corresponds with the help file. Thanks!