`galah_group_by()` produces different results when order of grouping variables is changed
shandiya opened this issue · 1 comments
When galah_group_by()
is used with more than one variable, different numbers of rows are returned if the order of variables is changed.
galah version
1.5.1
To Reproduce
reg <- c("Gibson Desert",
"Little Sandy Desert",
"Southern Volcanic Plain",
"Flinders Lofty Block")
# IBRA then year
ibra_year <- galah_call() |>
galah_filter(cl1048 == reg,
year >= 1971,
year <= 2020) |>
galah_group_by(cl1048, year) |>
atlas_counts()
> ibra_year
# A tibble: 15 × 3
year cl1048 count
<chr> <chr> <int>
1 2020 Southern Volcanic Plain 316147
2 2020 Flinders Lofty Block 99965
3 2020 Little Sandy Desert 7
4 2019 Southern Volcanic Plain 231025
5 2019 Flinders Lofty Block 86448
6 2019 Little Sandy Desert 102
7 2019 Gibson Desert 71
8 2018 Southern Volcanic Plain 237771
9 2018 Flinders Lofty Block 77158
10 2018 Gibson Desert 1988
11 2018 Little Sandy Desert 447
12 2016 Southern Volcanic Plain 168460
13 2016 Flinders Lofty Block 99542
14 2016 Gibson Desert 301
15 2016 Little Sandy Desert 138
# year then IBRA
year_ibra <- galah_call() |>
galah_filter(cl1048 == reg,
year >= 1971,
year <= 2020) |>
galah_group_by(year, cl1048) |>
atlas_counts()
> year_ibra
# A tibble: 199 × 3
cl1048 year count
<chr> <chr> <int>
1 Southern Volcanic Plain 2020 316147
2 Southern Volcanic Plain 2018 237771
3 Southern Volcanic Plain 2019 231025
4 Southern Volcanic Plain 2017 181471
5 Southern Volcanic Plain 2015 179698
6 Southern Volcanic Plain 2016 168460
7 Southern Volcanic Plain 2014 120252
8 Southern Volcanic Plain 2011 102043
9 Southern Volcanic Plain 2013 86229
10 Southern Volcanic Plain 2012 77885
# ℹ 189 more rows
# ℹ Use `print(n = ...)` to see more rows
Expected behaviour
The same number of rows should be returned irrespective of grouping order, with the only difference being the order of columns in the returned tibble.
The good news is that galah 2.0.0 has fixed this issue (yay)
However, there is a row limit set internally to make slice_head()
and arrange()
functions work correctly in complex queries. This limit of 30 rows is (at the moment) opaque to the user.
What this means in this case is that running the first query without setting a higher limit using atlas_counts(limit = )
will return 120 rows. This is because each each region in reg
will be limited to only 30 rows but the full year range in the query is 50.
library(galah)
reg <- c("Gibson Desert",
"Little Sandy Desert",
"Southern Volcanic Plain",
"Flinders Lofty Block")
# IBRA then year (with no limit)
ibra_year <- galah_call() |>
galah_filter(cl1048 == reg,
year >= 1971,
year <= 2020) |>
galah_group_by(cl1048, year) |>
atlas_counts()
ibra_year
#> # A tibble: 120 × 3
#> cl1048 year count
#> <chr> <chr> <int>
#> 1 Southern Volcanic Plain 2020 319082
#> 2 Southern Volcanic Plain 2018 238959
#> 3 Southern Volcanic Plain 2019 232903
#> 4 Southern Volcanic Plain 2017 182618
#> 5 Southern Volcanic Plain 2015 180192
#> 6 Southern Volcanic Plain 2016 169479
#> 7 Southern Volcanic Plain 2014 120798
#> 8 Southern Volcanic Plain 2011 102496
#> 9 Southern Volcanic Plain 2013 86669
#> 10 Southern Volcanic Plain 2012 78345
#> # ℹ 110 more rows
# IBRA then year (with a high limit)
ibra_year <- galah_call() |>
galah_filter(cl1048 == reg,
year >= 1971,
year <= 2020) |>
galah_group_by(cl1048, year) |>
atlas_counts(limit = 1000)
ibra_year
#> # A tibble: 199 × 3
#> cl1048 year count
#> <chr> <chr> <int>
#> 1 Southern Volcanic Plain 2020 319082
#> 2 Southern Volcanic Plain 2018 238959
#> 3 Southern Volcanic Plain 2019 232903
#> 4 Southern Volcanic Plain 2017 182618
#> 5 Southern Volcanic Plain 2015 180192
#> 6 Southern Volcanic Plain 2016 169479
#> 7 Southern Volcanic Plain 2014 120798
#> 8 Southern Volcanic Plain 2011 102496
#> 9 Southern Volcanic Plain 2013 86669
#> 10 Southern Volcanic Plain 2012 78345
#> # ℹ 189 more rows
Created on 2023-12-22 with reprex v2.0.2
As a temporary fix to avoid this unexpected internal limit, the limit has been increased to 10,000 on the dev
branch and a message will now appear if you happen to hit that limit (which should be very rare).
A proper fix will involve figuring out how to avoid the need to set a limit internally for slice_head()
and arrange()
to work