AtlasOfLivingAustralia/data-management

NVA date update

Opened this issue · 2 comments

NVA eventDate has an Z at the end of some dates within their dataset which needs updating to be parsed by pipelines.

Ticket Update: September 26 2024

Issue: Fix data format for the Tasmanian Natural Values Atlas (NVA) dr710.

Solution: Successfully remove the "Z" from date entries (e.g., changed "01-06-2017Z" to "01-06-2017").

Actions Taken:

  • Downloaded the data from DwCA-exports
  • Conducted a thorough review of the data.
  • Corrected the date format.
  • Adjusted multiple columns, specifically individualCount, from float to integer data type.
  • Resolved issues with the DwCA format, including correcting headers and performing necessary manipulations.
  • Attempted to load data onto collectory-test; however, the data load failed during the DwCA to Verbatim step.

Error Log:
INFO [2024-09-25 08:00:37,649+0000] [main] au.org.ala.pipelines.util.VersionInfo: git.remote.origin.url=https://github.com/gbif/pipelines
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Adding step 1: Options
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Non-HDFS Input path: /data/biocache-load/dr710
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Unexpected error during DWCA-AVRO conversion dr710 step
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Error 1 occurred on 1

Issues Encountered:

  • Identified multiple issues with the DwCA file, including:
  1. Unidentified columns.
  2. Duplicate columns that were empty.
  • Reworked the DwCA to create a new TSV file.
  • Created the DwCA locally and loaded the data onto collectory-test again.

Successfully loaded the data onto Databox and production environments.

Loaded Data for Review:
Test: Collections Test - DR710
Production: Collections Production - DR710

Prod UUID count logs:
24/09/26 08:20:48 INFO SparkContext: Successfully stopped SparkContext
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs:
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: newUuids: 0.0, preservedUuids: 1121933.0, orphanedUniqueKeys: 0.0
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Percentage UUID change: 0, allowed percentage: 50, override percentage check: false

  • Status: Awaiting indexing