Different group_by result due to doing joins with `by_chund_id`

Question

Different group_by result due to doing joins with `by_chund_id`

SMousavi90 opened this issue 3 years ago · 3 comments

I have this lines of code which produces different results with and without using diskframe.

a.df -> the diskframe with 2735110 rows

the group_by line:

result <- a.df %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
    chunk_ungroup()

after the execution the result has 2735110 rows

but the same line with data frame (or at least when I collect(a.df)) returns different number of rows: 273511 rows

result <- collect(a.df) %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
   ungroup

I cannot and should not collect the a.df here because it will be so big in future.
any suggestion or advice on this?

Thanks in advance

Answer 1 · 2021-09-16T10:28:42.000Z

That's not good. I will see if I can reproduce it. Can you tell me the types of col1 col2 etc? How many unique values in each?

Answer 2 · 2021-09-16T16:00:23.000Z

The problem was in previous steps I had done some joins with the option merge_by_chunk_id = TRUE and the result of those joins were different from the ones when using data frames. I still cannot distinguish between these these two ways of joining the data. why do we set merge_by_chunk_id to TRUE while the generated data isn't what we expect. anyway I set merge_by_chunk_id to FALSE and the issue seemed to be resolved but at the end I got an error for the stack size limit!

Answer 3 · 2021-09-18T06:06:23.000Z

I see. A proper article on how join work is long overdue.