AtlasOfLivingAustralia/galah-R

`galah_group_by()` produces different results when order of grouping variables is changed

shandiya opened this issue · 1 comments

When galah_group_by() is used with more than one variable, different numbers of rows are returned if the order of variables is changed.

galah version
1.5.1

To Reproduce

reg <- c("Gibson Desert", 
         "Little Sandy Desert", 
         "Southern Volcanic Plain",
         "Flinders Lofty Block")

# IBRA then year
ibra_year <- galah_call() |> 
  galah_filter(cl1048 == reg, 
               year >= 1971,
               year <= 2020) |> 
  galah_group_by(cl1048, year) |> 
  atlas_counts() 

> ibra_year
# A tibble: 15 × 3
   year  cl1048                   count
   <chr> <chr>                    <int>
 1 2020  Southern Volcanic Plain 316147
 2 2020  Flinders Lofty Block     99965
 3 2020  Little Sandy Desert          7
 4 2019  Southern Volcanic Plain 231025
 5 2019  Flinders Lofty Block     86448
 6 2019  Little Sandy Desert        102
 7 2019  Gibson Desert               71
 8 2018  Southern Volcanic Plain 237771
 9 2018  Flinders Lofty Block     77158
10 2018  Gibson Desert             1988
11 2018  Little Sandy Desert        447
12 2016  Southern Volcanic Plain 168460
13 2016  Flinders Lofty Block     99542
14 2016  Gibson Desert              301
15 2016  Little Sandy Desert        138

# year then IBRA
year_ibra <- galah_call() |> 
  galah_filter(cl1048 == reg, 
               year >= 1971,
               year <= 2020) |> 
  galah_group_by(year, cl1048) |> 
  atlas_counts()

> year_ibra
# A tibble: 199 × 3
   cl1048                  year   count
   <chr>                   <chr>  <int>
 1 Southern Volcanic Plain 2020  316147
 2 Southern Volcanic Plain 2018  237771
 3 Southern Volcanic Plain 2019  231025
 4 Southern Volcanic Plain 2017  181471
 5 Southern Volcanic Plain 2015  179698
 6 Southern Volcanic Plain 2016  168460
 7 Southern Volcanic Plain 2014  120252
 8 Southern Volcanic Plain 2011  102043
 9 Southern Volcanic Plain 2013   86229
10 Southern Volcanic Plain 2012   77885
# ℹ 189 more rows
# ℹ Use `print(n = ...)` to see more rows

Expected behaviour
The same number of rows should be returned irrespective of grouping order, with the only difference being the order of columns in the returned tibble.

The good news is that galah 2.0.0 has fixed this issue (yay)

However, there is a row limit set internally to make slice_head() and arrange() functions work correctly in complex queries. This limit of 30 rows is (at the moment) opaque to the user.

What this means in this case is that running the first query without setting a higher limit using atlas_counts(limit = ) will return 120 rows. This is because each each region in reg will be limited to only 30 rows but the full year range in the query is 50.

library(galah)

reg <- c("Gibson Desert", 
         "Little Sandy Desert", 
         "Southern Volcanic Plain",
         "Flinders Lofty Block")

# IBRA then year (with no limit)
ibra_year <- galah_call() |> 
  galah_filter(cl1048 == reg, 
               year >= 1971,
               year <= 2020) |> 
  galah_group_by(cl1048, year) |> 
  atlas_counts() 

ibra_year
#> # A tibble: 120 × 3
#>    cl1048                  year   count
#>    <chr>                   <chr>  <int>
#>  1 Southern Volcanic Plain 2020  319082
#>  2 Southern Volcanic Plain 2018  238959
#>  3 Southern Volcanic Plain 2019  232903
#>  4 Southern Volcanic Plain 2017  182618
#>  5 Southern Volcanic Plain 2015  180192
#>  6 Southern Volcanic Plain 2016  169479
#>  7 Southern Volcanic Plain 2014  120798
#>  8 Southern Volcanic Plain 2011  102496
#>  9 Southern Volcanic Plain 2013   86669
#> 10 Southern Volcanic Plain 2012   78345
#> # ℹ 110 more rows

# IBRA then year (with a high limit)
ibra_year <- galah_call() |> 
  galah_filter(cl1048 == reg, 
               year >= 1971,
               year <= 2020) |> 
  galah_group_by(cl1048, year) |> 
  atlas_counts(limit = 1000) 

ibra_year
#> # A tibble: 199 × 3
#>    cl1048                  year   count
#>    <chr>                   <chr>  <int>
#>  1 Southern Volcanic Plain 2020  319082
#>  2 Southern Volcanic Plain 2018  238959
#>  3 Southern Volcanic Plain 2019  232903
#>  4 Southern Volcanic Plain 2017  182618
#>  5 Southern Volcanic Plain 2015  180192
#>  6 Southern Volcanic Plain 2016  169479
#>  7 Southern Volcanic Plain 2014  120798
#>  8 Southern Volcanic Plain 2011  102496
#>  9 Southern Volcanic Plain 2013   86669
#> 10 Southern Volcanic Plain 2012   78345
#> # ℹ 189 more rows

Created on 2023-12-22 with reprex v2.0.2

As a temporary fix to avoid this unexpected internal limit, the limit has been increased to 10,000 on the dev branch and a message will now appear if you happen to hit that limit (which should be very rare).

A proper fix will involve figuring out how to avoid the need to set a limit internally for slice_head() and arrange() to work