bioXgeo/neotropical_plants

trying to update occurrence data with cleaned species names

hazeljanderson opened this issue · 2 comments

GBIF_occ_subset_harmonized <- GBIF_occ_subset %>% mutate(species = replace(species, species %in% lookup_table$Name_submitted, lookup_table$Name_matched))

Created lookup table with column of Name_submitted and column of Name_matched. I want to update the species name in the GBIF data by matching the species name to the lookup table Name_submitted then replacing it with the Name matched.

I tried the code above, but I'm not confident it worked. It also gave me this error:

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `species = replace(species, species %in% lookup_table$Name_submitted,
  lookup_table$Name_matched)`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length

I've also tried
GBIF_occ_subset_harmonized <- merge(GBIF_occ_subset, lookup_table, by.x = "species", by.y = "Name_submitted")

GBIF_occ_subset has 1,975,499 rows and GBIF_occ_subset_harmonized results in 4,403,679 rows which is not right. The number of rows should stay the same and just add one column

the extra rows are because the join is replicating matched rows, but I need to see where things may be duplicated.

would you post the table columns and about 5 rows of data here to make it really clear what you are starting with and what you are hoping to create? it doesn't have to be pretty. the other option is to put the CSVs for lookup_table and GBIF_occ_subset on google drive or something.

here is my example for what I think you are describing:

lookup_table:

Name_submitted, Name_matched
"Genus speciesa", "Genus speciesb"
etc

GBIF_occ_subset:

species, other columns, etc
"Genus speciesb", "other data", etc

Outcome: GBIF_occ_subset_harmonized:

species, ...
"Genus speciesa"

If so, I think you may want to look into mutating joins: https://dplyr.tidyverse.org/referenc/mutate-joins.html I think the x in this documentation should be the lookup_table as that has the 'key' you want to keep, and y are the occurrences.

Instead of replacing the contents of GBIF_occ_subset$species in your output, could you add a new column from the lookup table into your output, something like GBIF_occ_subset_harmonized$Name_submitted