DirtyHarry - a blunt instrument for data mapping (v 0.2 Beta)

I know what you’re thinking: 'Did he get all of the coding rigth?' Well, to tell you the truth, in all this excitement, I’ve lost track myself.... you’ve got to ask yourself one question: 'Do I feel lucky?' Well, do you, punk?

Rational

For the benefit of mankind we want data to be standardized as facilitate sharing, for the benefit of ourself we want them to be documented and structured in order to reuse them. For biodiversity data, DwC is and the DwC-Archive are commonly used standards, e.g. by GBIF. We also use that internally for projects aggregating large amount of data, such as the INVAFISH project. However, data are not collected in this way, for many reasons. Some of them are actually quite good - such that quickly noting down results in a messy field/lab situation requires the use of codes and abbreviations.

What is this?

A quick and dirty tool to map tabular data (think of spreadsheets) with different column-names into a Darwin Core "event-core" structure. It takes as input a spreadsheet workbook (.xlsx) with the orginal data and in two sheets that must have a naming that starts on "event" and "occurrence", a sheet containing the mapping table called "mapping_table" describing the translation using a faul home brewed syntax, and sheet called transation_table" containing translation values (or be empty in case this is not called for in the translation_table. More explanation on the data input-formats as well as tech stuff is given below or at DirtyHarry's git-hub site.

What it is not!

This is not a data cleaning tool. Field values are parsed raw or translated, but without any sort of quality check. The input data must be clean with line one being the header line and all fields supposed to be numeric and integers containing numeric and integers values etc. If you are lucky, straight out errors in the input data will throw back errors and warning messages, or will cause the tool to happily crash. If you're not, errors are passed on to the output data and haunt mankind for eternity. Data qualiity check functionalty may be included later if we decide that this tool is remotely resembeling something usefull.

Technical summary

The core is an R function described in "function_DirtyHarry.R" that takes as input an user defined .xlsx workbook with one or several sheets containing the raw umapped data, a mapping table and a translation table (optional). See "text_shiny_mainpage_vignet.md" for description and requerments. In addition it requires an input tables called "terms_and_tables" with the columns "term" and "table" describing how to map the data in terms of which terms should end up in which output tables. At the moment this is read from the NOFA_db using the script "dataIO_nofa_db.R", this also returns the table "NOFA_controlled_vocabulary" which is a list of accepted values given for each term. This is not implemented in current version, but could be used as part of a primary quality check to assure that all values are actually valid. The function_DirtyHarry.R script first flattens out the data, then map and translate, gives some throwback of rude quality check (e.g. is the destination_term part of the standard given in terms_and_tables), and returns a .xlsx file with the mapped values.