DiskFrame/disk.frame

Different group_by result due to doing joins with `by_chund_id`

SMousavi90 opened this issue · 3 comments

I have this lines of code which produces different results with and without using diskframe.

a.df -> the diskframe with 2735110 rows

the group_by line:

result <- a.df %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
    chunk_ungroup()

after the execution the result has 2735110 rows

but the same line with data frame (or at least when I collect(a.df)) returns different number of rows: 273511 rows

result <- collect(a.df) %>%
    group_by(col1,col2,col3,col4) %>%
    summarize(tot4 = sum(col4), tot5 = sum(col5)) %>% 
   ungroup

I cannot and should not collect the a.df here because it will be so big in future.
any suggestion or advice on this?

Thanks in advance

That's not good. I will see if I can reproduce it. Can you tell me the types of col1 col2 etc? How many unique values in each?

The problem was in previous steps I had done some joins with the option merge_by_chunk_id = TRUE and the result of those joins were different from the ones when using data frames. I still cannot distinguish between these these two ways of joining the data. why do we set merge_by_chunk_id to TRUE while the generated data isn't what we expect. anyway I set merge_by_chunk_id to FALSE and the issue seemed to be resolved but at the end I got an error for the stack size limit!

I see. A proper article on how join work is long overdue.