traitecoevo/austraits

`pivot_wider` issues with methods table for Crous_2013

Closed this issue · 2 comments

Currently trying to work on trait_pivot_wider for the new version of AusTraits.

Following @ehwenk advice to group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) before pivot_wider(), I was able to successfully pivot wide the traits table for Crous_2013

Crous_2013$traits %>%
  group_by(dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value) 

After joining locations and contexts

Crous_2013 %>% 
  join_locations() %>% 
  purrr::pluck("traits") %>%  
  group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value)

Crous_2013 %>% 
  join_contexts() %>% 
  purrr::pluck("traits") %>%  
  group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value)

But failing to pivot after joining in methods

Crous_2013 %>% 
 join_methods() %>% 
 purrr::pluck("traits") %>%  
 group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
 pivot_wider(names_from = trait_name,
             values_from = value) 

I noticed there are duplicated rows in Crous_2013$methods

  join_methods() %>% 
  purrr::pluck("methods") %>%  
  duplicated()

 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Repeating the above steps for another study seems to be okay. e.g. austraits %>% extract_dataset("Crous_2019")
Is this unique duplication issue of Crous_2013?

NB: Need to use join_ on develop branch!

I considered adding distinct() to join_methods function e.g.

join_methods2 <- function(austraits, vars =  c("methods", "year_collected_start", "year_collected_end", "collection_type")) {
  austraits$methods %>% 
    dplyr::select(c("dataset_id", "trait_name"), tidyselect::any_of(vars))
    distinct() -> methods
  
  austraits$traits <- austraits$traits %>%
      dplyr::left_join(by=c("dataset_id", "trait_name"),
                       methods)
    
    austraits
  }

While this solution allows me to pivot_wider() successfully

Crous_2013 %>% 
join_methods2() %>% 
purrr::pluck("traits") %>%  
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
pivot_wider(names_from = trait_name,
            values_from = value) 

The pivot_wide is adding on more rows that the other joins above

Crous_2013 %>% 
  join_methods2() %>% 
  purrr::pluck("traits") %>%  
  group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value) %>% 
  nrow() # 325 This is the same as using join_methods() above but doesn't create list cols in the traits table

  
Crous_2013 %>% 
  join_contexts() %>% 
  purrr::pluck("traits") %>%  
  group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value) %>% 
  nrow() # 294

Not sure where to best begin debugging this? Suggestions would be great! @dfalster @ehwenk

@ehwenk and I have solved this.
Firstly, distinct in join_methods2 is a good idea otherwise left_join will add more rows for every dataset_id and trait_name combo. This has been implemented

Secondly,while this may pivot wider successfully, it may not be particularly useful for the user

Crous_2013 %>% 
  join_methods2() %>% 
  purrr::pluck("traits") %>%  
  group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>% 
  pivot_wider(names_from = trait_name,
              values_from = value)

It populates the dataframe with lots of NAs because methods is unique for every unique combo of trait_name, dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type. It essentially produces wider table that is very long and very wide. It also doesn't make logistical sense. So current recommendation to users to not join_methods if you want to pivot wider. this is a caveat of relational databases.