/tnrs_db

Scripts that build the TNRS database

Primary LanguagePHP

TNRS DATABASE LOADING SCRIPTS

Purpose: Creates, populates and indexes the TNRS database
Author: Brad Boyle (ojalaquellueva@gmail.com)
Dependent applications:

  • TNRS web application: http://tnrs.iplantcollaborative.org/
  • TNRS batch application: https://github.com/ojalaquellueva/TNRSbatch
  • TNRS api: https://github.com/ojalaquellueva/TNRSapi
  • TNRS R package: https://github.com/EnquistLab/RTNRS

IMPORTANT!

  • This repository is a fork of the now-deprecated sub-repository https://github.com/iPlantCollaborativeOpenSource/TNRS/tree/master/tnrs3_db_scripts, which it replaces

CONTENTS

INTRODUCTION

Files in this repository contains all scripts needed to build the current (v4.0) reference database for the Taxonomic Name Resolution Service or TNRS (http://tnrs.iplantcollaborative.org). You can also use these scripts to build a custom TNRS database using your own taxonomic sources, and to add new data sources to an existing TNRS database.

VERSION

Code version (this application): 4.1
Code version note: Based on v4.0, but PHP code updated to compatibility with PHP 7.2
Version date: 27 March 2019
Most recent TNRS database compatible with this version: TNRS DB 4.0
Most recent TNRS database public release date: 28 August 2015
Most recent TNRS application code compatible with this DB: TNRS 4.0

REQUIREMENTS

SOFTWARE

  • *nix OS
  • PHP 7.0+
  • MySQL 5.7+ (may work vith versions as early as 5.3)
  • Perl 5.22.0+ (for dbfdump and Tropicos api download)
  • Perl utility DBD-XBase-1.05 (for command dbfdump)

TNRS COMPATIBILITY

  • Compatible with TNRS version 4.0

SOURCE DATA

Extracts from the following data sources must be present in the data directories within their respective import directores (e.g., import_tropicos/data/). These data sources are not imported automatically; they must be obtained separately in advance of running this application. For details on requesting, extracting and preparing source data, see the individual readme in the respective import directories.

OVERVIEW OF TNRS DATABASE CONSTRUCTION

This section provides an over of the major steps performed by the TNRS DB loading files. Files performing a set of related operations are group into a single directory, where they are called by a single master scripts bearing the same base name as the containing directory. The master script (extension .php) calls all others (extensions .inc).

The main steps are as follows:

  1. create_tnrs_core.php (in create_tnrs_core/)
    • Creates empty tnrs database
  2. import.php (in import_[sourceName]/)
    • Imports raw data for an individual taxonomic source into MySQL and performs initial loading to the staging table (nameStaging)
    • Steps specific to an individual source are in this directory
  3. prepare_staging.php (in prepare_staging/)
    • Finishes structuring and populating staging table (nameStaging)
    • These operations are universal, not source-specific
  4. load_core_db.php (in load_core_db/)
    • Normalizes the contents of the staging table to the core database
  5. make_genus_family_lookups.php (in genus_family_lookups/)
    • Builds lookup tables of current and historic genus-in-family classifications, based on GRIN taxonomy website
  6. taxamatch_tables.php (in taxamatch_tables/)
    • Denormalizes names in core database into lookup tables used by TaxaMatch fuzzy matching application
  7. build_classifications.php (in build_classifications/)
    • Builds table 'higherClassification', which classifies all names from all sources according to any source for which isHigherClassification=1 (set in params.inc for that source)

For a build from multiple sources, step 1 is run ONCE. Steps 2-4 are run for EACH source. Finally, steps 5-7 are run ONCE.

The entire process is sequenced by the master script load_tnrs.php, which calls all others. Before running this script, you MUST set global parameters (in global_params.inc) and source-specific parameters (in params.inc in the import directory for each source). See instructions in load_tnrs.php, global_params.inc. For details on setting source-specific parameters, see the readme file in import_dwcExample/.

An individual source can be refreshed without rebuilding the entire database by loading source only and setting $replace_db=false (in global_params.inc). This will run steps 2-7 above, replacing only names linked uniquely to the source in question. For a faster replace, set $replace=false in params.inc for the source being refreshed. Only entirely new names from that source will be added. Existing names (and metadata such as source urls and date of access) will not be changed. WARNING: Reloading of individual sources not recommended until this option has been updated.

INSTRUCTIONS

The file "load_tnrs.php" runs a complete build of the TNRS database. This script calls all others. Prior to running this script:

  1. Check that all key parameters are correctly set in global_params.inc (in application root directory).
  2. Obtain data file(s) for each source and copy to the data folder within that source's import directory (e.g., import_tropicos/data/).
  3. Set source-specific parameters in the params.inc file for that source (also in import directory).

TEST DB

For a preliminary test build (recommended) of the TNRS database using small extracts of a subset of data sources, do the following.

  1. Edit global parameters (file "global_params.inc")
  • Set $src_array as to use example data source, as follows (just uncomment if already present):
$src_array=array(
	"dwcExample",
	"gccExample",
	"usdaExample"
);
  • Ensure that paths to directories "functions/" and "global_utilities/" are correct (parameters $local_utilities_path and $global_utilities_path in section "Include paths and filenames").
  • Set server and MySQL connection parameters in section "Db connection info"
  1. Run the master script:
php tnrs_db.php

Example database build should complete in under a minute.

CURRENT TNRS PRODUCTION DATABASE

For a complete build of the current TNRS database (ver. 4.0) using all existing data sources, do the following.

  1. Edit global parameters (file global_params.inc)
  • Set $src_array as follows:
$src_array=array(
	"tpl",
	"gcc",
	"ildis",
	"tropicos",
	"usda",
	"ncbi"
);
  • Ensure that paths to directories "functions/" and "global_utilities/" are correct (parameters $local_utilities_path and $global_utilities_path in section "Include paths and filenames").
  • Set server and MySQL connection parameters in section "Db connection info"
  1. Run the master script:
php tnrs_db.php

Database build should take several hours to a full day, depending on system resources. Each step, including any errors detected, will echo to the terminal screen. To avoid tying up a terminal or accidentally aborting due to a lost connection, I recommend running the process remotely using the unix "screen" utility.

REFRESH DATA FOR EXISTING SOURCES

The procedure to refresh one or more data sources, keeping all remaining data sources unchanged, is identical to the procedure above. The only difference is that you must first replace the old raw data file for the source to be refreshed with the new (refreshed) data file, and update any source-specific parameters as needed.

For example, if you are refreshing ONLY the data for Tropicos, do the following:

  1. Copy the new tropicos data file to the tropicos data directory: import_tropicos/data/. You can name the file whatever you want, but I recommend that the file name include the date of download.
  2. Edit the source parameters file ("params.inc") in the source directory (in this example, "import_tropicos/".
    • Change $dateAccessed to the download date of the data file
    • Change $namesfile to the name of the data file. This is critical.

Repeat for any other refreshed data source, then run the remaining steps from the preceding section (BUILD OF CURRENT TNRS PRODUCTION DATABASE).

The above procedure will work if and only if the schema of the refreshed data source is the same as the current version. If the schema has changed, the import for that source will almost certain fail. Because a complete build can take many hours, I recommend running a trial build using only the single source to be refreshed. If that source loads successfully, then the complete build should also run successfully. If the import fails, you will need to diagnose how the new data dump differs from the previous source, and either (a) obtain a new dump conforming to the old schema, or (b) modify the import scripts for that source.

Note that refreshing one or more sources requires rebuilding the entire database. Although the global parameters file (global_params.inc) contains a parameter that in theory allows you to refresh only an individual source within an existing database (this is done by setting "$replace_db = false;"), I do NOT recommend you use this option. This option has not been tested since several major changes were made to the database scripts, and may introduce anomalies if used.

ADD NEW SOURCE - TNRS SIMPLE DARWIN CORE

This simplest way to build a custom TNRS database which includes one or more new data sources this is to ensure that the data extract for each source conforms to the TNRS Simple Darwin Core schema. This schema is described in the readme found within the TNRS Simple Darwin Core (DWC) example import directory ("import_dwcExample/"). Please refer to this readme for detailed specifications of this schema.

Every source MUST have a unique short, alphanumeric code ([sourceName], above), without spaces or punctuation (e.g., "tropicos", "tpl", "gcc", etc.). I recommend using only lowercase alphabetic characters. This short name will be displayed in the TNRS application as the code for the source, so choose carefully.

Once you have restructured you new data extract to match the TNRS Simplified Darwin Core schema, do the following:

  1. Make a new copy of the entire DWC example directory
  2. Rename the the new directory, changing the suffix "dwcExample" to the code of the new source (e.g., "import_mynewsource").
  3. Place the data extract within the data subdirectory of the import directory.
  4. Edit the source parameters file ("params.inc") in the source import directory:
    • Set $sourceName to the code for this source. [critical]
    • Set $sourceNameFull to a longer, human-readable name of this source
    • Set $sourceUrl to the url where data was obtained [optional]
    • Set $dateAccessed to the download date of the data file
    • Set $namesfile to the name of the data file. [critical]
  5. Modify the parameter $src_array in the global parameters file ("global_params.inc") to include the code for the new source.

Now run a trial build using only the new source, to confirm that it loads successfully on its own (to do so, set $src_array equal only to the code for the new source). If the source file fails to import correctly or does not import at all, try adjusting the MySQL import parameters at the end of the source parameters file. Most commonly you will need to adjust "$lines_terminated_by" (either "\n", "\r" or "\r\n") and "$fields_terminated_by" (typically "," or "\t"). Also check that accented characters are loaded properly. If this is not the case, try setting "$fix_chars=true" in the source parameter file.

Once each new source has been test-loaded separately, you may then perform a complete build by modifying $src_array to include the codes for all data sources.

ADD NEW SOURCE - OTHER FORMATS

If for some reason you cannot restructure the new data to conform to the TNRS Simplified Darwin Core Schema, you will need to write a new set of custom import scripts. As a general approach to learn how to write a custom import script, try the following:

  1. Examine in detail the Simplified DWC import scripts (in "import_dwcExample"). These are the simplest.
  2. Examine more complex custom import scripts, in particular for Tropicos.
  3. Learn the structure of the staging table ("nameStaging"). Your goal is to restructure the new data into this table, performing all necessary validations. To produce an example of this table, run a partial import of a single existing source which you know loads successfully, for example Tropicos. Before doing so, set a breakpoint to stop the operation immediately after import. This is done by inserting an abort command ("die();") after the line in "load_tnrs.php" that runs the import for an single source ("include_once $src_dir."import.php";", currently line 151). Now, get into the new database and example the contents of the staging table. If you can import your data into this table and ensure that it satisfied all key valdations, then it will import successfully to the final database.

CHANGE LOG

Version 3.6.3.

  1. Added scripts in directory tropicos_fixes/. These are fixes specific to tropicos. Are not part of the pipeline, but should be run separately after completing build of the database. See readme in tropicos_fixes.

Version 3.6:

  1. Changed Tropicos import routine to include three additional fields: NomenclatureStatusID, NomenclatureStatusName, Symbol. These fields are returned by Tropicos API, and provide additional information regarding nomenclatural status. Values in the three fields are equivalent representations of the same value. For names where the Tropicos ComputedAcceptance algorithm does not provide a taxonomic opinion (as indicated by NULL value of acceptance, the value of NomenclatureStatusName is transfered to acceptance WHERE NomenclatureStatusName IN ('Illegitimate','Invalid'). NomenclatureStatusName='nom. rej.' is translated as "Rejected name' and transfered to acceptance. The goal of transferring these values is to alert the user that the name in question is problematic, even if Tropicos does not provide a link to the accepted name.

Version 4.0:

  1. Changes to support the import of new data sources TPL (The Plant List) and ILDIS (International Legume Database & Information Service).
  2. Automated tropicos_fixes/ (added to main TNRS pipeline)
  3. Standardized rank names to non-abbreviated form
    • Previous db had mix of abbreviated and full rank names
    • Standard botanical rank indicator abbreviations
      still used to form infraspecific taxon names

Version 4.1 (27 March 2019)

  1. Updated PHP code to php7.2
  2. Updated main README and README for dwcExample