Different group_by result due to doing joins with `by_chund_id`
SMousavi90 opened this issue · 3 comments
I have this lines of code which produces different results with and without using diskframe.
a.df -> the diskframe with 2735110 rows
the group_by line:
result <- a.df %>%
group_by(col1,col2,col3,col4) %>%
summarize(tot4 = sum(col4), tot5 = sum(col5)) %>%
chunk_ungroup()
after the execution the result has 2735110 rows
but the same line with data frame (or at least when I collect(a.df)) returns different number of rows: 273511 rows
result <- collect(a.df) %>%
group_by(col1,col2,col3,col4) %>%
summarize(tot4 = sum(col4), tot5 = sum(col5)) %>%
ungroup
I cannot and should not collect the a.df here because it will be so big in future.
any suggestion or advice on this?
Thanks in advance
That's not good. I will see if I can reproduce it. Can you tell me the types of col1 col2 etc? How many unique values in each?
The problem was in previous steps I had done some joins with the option merge_by_chunk_id = TRUE and the result of those joins were different from the ones when using data frames. I still cannot distinguish between these these two ways of joining the data. why do we set merge_by_chunk_id to TRUE while the generated data isn't what we expect. anyway I set merge_by_chunk_id to FALSE and the issue seemed to be resolved but at the end I got an error for the stack size limit!
I see. A proper article on how join work is long overdue.