Make mfdb/g3_data aggregation attributes more visible
MikkoVihtakari opened this issue · 3 comments
Copying the debate from a Teams chat here as it may be of an interest for @lentinj too. It started from us realising that missing data are handled as zeros in the model as long as they are within the aggregation bins. Also 0/0 appear to be handled as zeros. If this is how gadget has always been, and if changing this behaviour would break everything, we just need to learn to cope with it. Hence some suggestions for better visibility of the important aggregation information that are currently hidden in attributes()
[06:27] Vihtakari, Mikko
I have two improvement suggestions at least: 1) make g3data class which is a simple data frame but uses an explicit print method (print.g3data), which prints head() of the data frame together with summary information of central attributes. A bit like tibble printing combined with lm or other model printing.
[06:28] Vihtakari, Mikko
2) Add my data figures to gadgetplots and add the length attribute bins to those figures as background grid.
[06:28] Vihtakari, Mikko
With those two changes the attributes will be more visible for the user. One could also have tests for cases where data exceed the attribute limits, etc.
[06:31] Vihtakari, Mikko
I am fine with missing values being 0 if that's how gadget has always worked (although 0/0 is still confusing me), but I am not fine hiding all these important details in attributes instead of letting the user know about them very clearly.
[06:34] Vihtakari, Mikko
Also carrying over these data attributes to g3_fit object could be an idea. We could then add them to the plots to remind the user how they aggregated their data and to see inconsistencies in the aggregation.
Also, I am confused which aggregation parameters (in the attributes) assume zeros and which do not. Seems that it is fine to have missing years in the data even though aggregation parameters are defined for that year.
If this is how gadget has always been, and if changing this behaviour would break everything, we just need to learn to cope with it.
I'm pretty sure it is. At very least it was how gadget2 worked in @bthe 's head :)
However, if the data is a more accurate reflection of the aggregation than the attributes are, is it worth just removing them and letting g3l_distribution figure it out from the data?
make g3data class which is a simple data frame but uses an explicit print method (print.g3data),
I'm up for this. The main reason these data.frames are class-less is because back in the day there was no gadget-related packages, only MFDB.
Also, I am confused which aggregation parameters (in the attributes) assume zeros and which do not. Seems that it is fine to have missing years in the data even though aggregation parameters are defined for that year.
Basically, yes. Gaps in year/step mean don't do any comparison, otherwise fall back to whatever ifmissing
is set to: https://gadget-framework.github.io/gadget3/reference/likelihood_distribution.html#obs-data-and-data-aggregation
Related to this conversation is #112, which adds parsing of factor levels generated by cut()
to achieve the same thing as the MFDB-style attributes, so the following produces groups gadget3 will understand:
# Generate a bunch of nonsense that doesn't fill the full length range
ldist.lln.raw <- data.frame(
year = c(1999, 2000),
age = sample(5:9, 100, replace = TRUE),
length = sample(10:70, 100, replace = TRUE),
number = 1,
stringsAsFactors = FALSE)
ldist.lln.raw |>
group_by(year = year, age = age, length = cut(length, breaks = seq(10, 100, by = 10), right = FALSE)) |>
summarise(number = sum(number))
Arguably it's just shunting the hidden attributes somewhere else, but it's shunting them somewhere tibble understands.