`pivot_wider` issues with methods table for Crous_2013
Closed this issue · 2 comments
Currently trying to work on trait_pivot_wider
for the new version of AusTraits.
Following @ehwenk advice to group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type)
before pivot_wider()
, I was able to successfully pivot wide the traits table for Crous_2013
Crous_2013$traits %>%
group_by(dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
After joining locations and contexts
Crous_2013 %>%
join_locations() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
Crous_2013 %>%
join_contexts() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
But failing to pivot after joining in methods
Crous_2013 %>%
join_methods() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
I noticed there are duplicated rows in Crous_2013$methods
join_methods() %>%
purrr::pluck("methods") %>%
duplicated()
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Repeating the above steps for another study seems to be okay. e.g. austraits %>% extract_dataset("Crous_2019")
Is this unique duplication issue of Crous_2013?
NB: Need to use join_
on develop branch!
I considered adding distinct()
to join_methods
function e.g.
join_methods2 <- function(austraits, vars = c("methods", "year_collected_start", "year_collected_end", "collection_type")) {
austraits$methods %>%
dplyr::select(c("dataset_id", "trait_name"), tidyselect::any_of(vars))
distinct() -> methods
austraits$traits <- austraits$traits %>%
dplyr::left_join(by=c("dataset_id", "trait_name"),
methods)
austraits
}
While this solution allows me to pivot_wider()
successfully
Crous_2013 %>%
join_methods2() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
The pivot_wide is adding on more rows that the other joins above
Crous_2013 %>%
join_methods2() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value) %>%
nrow() # 325 This is the same as using join_methods() above but doesn't create list cols in the traits table
Crous_2013 %>%
join_contexts() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value) %>%
nrow() # 294
Not sure where to best begin debugging this? Suggestions would be great! @dfalster @ehwenk
@ehwenk and I have solved this.
Firstly, distinct
in join_methods2
is a good idea otherwise left_join will add more rows for every dataset_id and trait_name combo. This has been implemented
Secondly,while this may pivot wider successfully, it may not be particularly useful for the user
Crous_2013 %>%
join_methods2() %>%
purrr::pluck("traits") %>%
group_by(., dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type) %>%
pivot_wider(names_from = trait_name,
values_from = value)
It populates the dataframe with lots of NAs because methods is unique for every unique combo of trait_name, dataset_id, taxon_name, observation_id, original_name, source_id, method_id, value_type. It essentially produces wider table that is very long and very wide. It also doesn't make logistical sense. So current recommendation to users to not join_methods if you want to pivot wider. this is a caveat of relational databases.