@slice_max only returning one row, despite no ties
Closed this issue ยท 7 comments
Here's some cycling-flavored data for a reprex ๐ I'm on Julia v.1.10.4.
using DataFrames
using Tidier
#> (@v1.10) pkg> status DataFrames, Tidier
#> Status `~/.julia/environments/v1.10/Project.toml`
#> [a93c6f00] DataFrames v1.6.1
#> [f0413319] Tidier v1.4.0
df = DataFrame(
:city => [
"New York city, New York",
"Chicago city, Illinois",
"Portland city, Oregon",
"Los Angeles city, California",
"San Francisco city, California",
"Washington city, District of Columbia",
"Seattle city, Washington",
"Philadelphia city, Pennsylvania",
"Boston city, Massachusetts",
"Minneapolis city, Minnesota",
"Austin city, Texas",
"Denver city, Colorado",
"San Diego city, California",
"Madison city, Wisconsin",
"Oakland city, California"
],
:total_commuters => [
4007171,
1318960,
347260,
1938625,
500469,
362204,
419233,
662522,
362198,
231240,
540911,
377214,
715730,
146715,
215534,
],
:bike_commuters => [
48601,
22449,
21982,
20495,
19429,
16647,
14801,
14397,
8873,
8465,
8266,
8181,
7188,
7186,
6540,
]
)
Let's say that I want to pull the top 10 cities with the most bike commuters. I'd expect that I could use the @slice_max
macro and do something simple like this:
@slice_max(df, bike_commuters, n = 10)
However, this returns:
1ร3 DataFrame
Row โ city total_commuters bike_commuters
โ String Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ New York city, New York 4007171 48601
Returning a single row in spite of the n
value is expected behavior if with_ties = false
according to the docs, but the default should be with_ties = true
and there are no obvious ties in this data. So I would actually expect the result to be this:
10ร3 DataFrame
Row โ city total_commuters bike_commuters
โ String Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ New York city, New York 4007171 48601
2 โ Chicago city, Illinois 1318960 22449
3 โ Portland city, Oregon 347260 21982
4 โ Los Angeles city, California 1938625 20495
5 โ San Francisco city, California 500469 19429
6 โ Washington city, District of Colโฆ 362204 16647
7 โ Seattle city, Washington 419233 14801
8 โ Philadelphia city, Pennsylvania 662522 14397
9 โ Boston city, Massachusetts 362198 8873
10 โ Minneapolis city, Minnesota 231240 8465
If I pop over to R with the same dataset, this is the result I get from df |> slice_max(bike_commuters, n = 10)
, even if I shuffle the dataset around such that the bike_commuters
column is not already ordered descending. I would expect @slice_max
in Julia to behave the same way in this case.
If I try other values of n
with the @slice_max
macro, I still get back the 1x3 DataFrame.
Thanks for catching this. I haven't looked at the slices in a while.
I think the issue might be due to this block of code, where n is not being captured. Although I'm surprised because the doc strings work.
for expr in exprs
if @capture(expr, lhs_ = rhs_)
expr_dict[lhs] = rhs
if lhs == :missing_rm
missing_rm = rhs
elseif lhs == :prop
arranged = true
end
else
column = expr
end
end
if haskey(expr_dict, :with_ties)
with_ties = expr_dict[:with_ties]
end
I'll play around with it and get back to you shortly
Curious, are the doc strings successfully building the only validation for the output of most macros?
I'm wondering if it's worth adding some additional tests beyond just the pivoting functions, potentially even just replicating some from the dplyr source code to catch some of the less straightforward edgecases/make sure the code is robust.
I'd be happy to help with that if that's something that'd be beneficial!
I will defer to @kdpsingh on the additional test as he's the Tidier Maven.
as far as the slice_max
bug, I have fixed it. I just need to make the corresponding changes for slice_min
julia> @slice_max(df, bike_commuters, n = 7)
7ร3 DataFrame
Row โ city total_commuters bike_commuters
โ String Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ New York city, New York 4007171 48601
2 โ Chicago city, Illinois 1318960 48601
3 โ Portland city, Oregon 347260 21982
4 โ Los Angeles city, California 1938625 20495
5 โ San Francisco city, California 500469 19429
6 โ Washington city, District of Colโฆ 362204 16647
7 โ Seattle city, Washington 419233 14801
julia> df = DataFrame(
a = ["a", "b","a", "b", "a","b" ,"a", "b"],
b = [0.3, 2, missing, 3, 6, 5, 7, 7],
c = [0.2, 0.2, 0.2, missing, 1, missing, 5, 6]);
julia> @chain df begin
@group_by a
@slice_max b n = 3
end
GroupedDataFrame with 2 groups based on key: a
First Group (3 rows): a = "a"
Row โ a b c
โ String Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ a 7.0 5.0
2 โ a 6.0 1.0
3 โ a 0.3 0.2
โฎ
Last Group (3 rows): a = "b"
Row โ a b c
โ String Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ b 7.0 6.0
2 โ b 5.0 missing
3 โ b 3.0 missing
julia> @chain df begin
@group_by a
@slice_max b
end
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = "a"
Row โ a b c
โ String Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ a 7.0 5.0
โฎ
Last Group (1 row): a = "b"
Row โ a b c
โ String Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ b 7.0 6.0
Thanks @mistermichaelll for reporting and @drizk1 for the fix! I will review and merge.
No problem @kdpsingh !
I was wondering if you saw my suggestion about adding additional unit tests that reflect the ones in the original dplyr
package? I think that making the test suite more robust could go a long way towards building trust in the Tidier
ecosystem as it starts seeing more adoption.
Like I said, I'm happy to take that on but just wanted to see if that's something you think would be beneficial!
We would love to have additional tests and would really appreciate your help with this.
In general, my preference has been to build all/most tests as doctests (in the docstrings) so that users can also get a clear view into should and shouldn't work.
But I'm open to additional tests in the tests/
folder depending on what you have in mind.
Feel free to open an issue and/or a PR!