forestgeo/fgeo.misc

Using the `root_columns` results in duplicated data for each quadrat

Opened this issue · 2 comments

When running the code without the root_columns command to add “team” the data consolidates properly. However, when I try to add the team column the data is repeated for each quadrat – one with the team included and one has NA’s for team.
-- @jess-shue

@jess-shue,

Can you add as much information as you can to make this issue stand alone?

A reproducible example would be best. Or whatever you can share to show the problem. What are the dependencies, the functions, inputs and outputs you use?

I suspect the problem is in the internal method used to join tables. What you want seems to be the results of something like left_join() and what you get seems to be the result of something like full_join().

From https://dplyr.tidyverse.org/reference/join.html

Mutating joins combine variables from the two data.frames:

inner_join()
return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join()
return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join()
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join()
return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Filtering joins keep cases from the left-hand data.frame:

semi_join()
return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

anti_join()
return all rows from x where there are not matching values in y, keeping just columns from x.

Nesting joins create a list column of data.frames:

nest_join()
return all rows and all columns from x. Adds a list column of tibbles. Each tibble contains all the rows from y that match that row of x. When there is no match, the list column is a 0-row tibble with the same column names and types as y. nest_join() is the most fundamental join since you can recreate the other joins from it. An inner_join() is a nest_join() plus an tidyr::unnest(), and left_join() is a nest_join() plus an unnest(.drop = FALSE). A semi_join() is a nest_join() plus a filter() where you check that every element of data has at least one row, and an anti_join() is a nest_join() plus a filter() where you check every element has zero rows.

4112019-07-30EK.xlsx
4122019-07-30EK.xlsx
Hi Mauro,
Attached are two files from our most recent hectare. When running them through the following code two copies of each are made as stated: one with a date, but no team, and one with a team but no date. I believe we would need to dive into the backside of the function 'xlff_to_list':

Run the devtools to connect to fgeo.misc from GitHub - install any packages needed if "there is no package called xxx" appears in the console

devtools::install_github("forestgeo/fgeo.misc")

install.packages("fgeo.tool")
install.packages("tidyverse")

call in necessary packages (if not installed, use install.packages then library for each package)

library(tidyverse)
library(purrr)
library(dplyr)
library(fgeo.tool)
library(fgeo.misc)
library(writexl)
library(here)

Connect to the file location in Dropbox. Change based on Hectare folders

files.directory <- here("raw/H4")

Connect to the consolidated files folder for saving the single sheet workbooks once running the code. Change for Hectare folders

files.export <- here("consolidated/H4")

Create a list of the files without writing them to a location

list_of_dataframes <- xlff_to_list(files.directory, first_census = FALSE, root_columns = "team")

Consolidate to 1 file

producing duplicate rows using the root_columns to add 'team' to the output - one set of quadrat data has the date, and the other copy of the quadrat has the team

single_dataframe <- reduce(list_of_dataframes, bind_rows)