NVA date update
Opened this issue · 2 comments
NVA eventDate
has an Z
at the end of some dates within their dataset which needs updating to be parsed by pipelines.
Ticket Update: September 26 2024
Issue: Fix data format for the Tasmanian Natural Values Atlas (NVA) dr710.
Solution: Successfully remove the "Z" from date entries (e.g., changed "01-06-2017Z" to "01-06-2017").
Actions Taken:
- Downloaded the data from DwCA-exports
- Conducted a thorough review of the data.
- Corrected the date format.
- Adjusted multiple columns, specifically individualCount, from float to integer data type.
- Resolved issues with the DwCA format, including correcting headers and performing necessary manipulations.
- Attempted to load data onto collectory-test; however, the data load failed during the DwCA to Verbatim step.
Error Log:
INFO [2024-09-25 08:00:37,649+0000] [main] au.org.ala.pipelines.util.VersionInfo: git.remote.origin.url=https://github.com/gbif/pipelines
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Adding step 1: Options
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Non-HDFS Input path: /data/biocache-load/dr710
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Unexpected error during DWCA-AVRO conversion dr710 step
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Error 1 occurred on 1
Issues Encountered:
- Identified multiple issues with the DwCA file, including:
- Unidentified columns.
- Duplicate columns that were empty.
- Reworked the DwCA to create a new TSV file.
- Created the DwCA locally and loaded the data onto collectory-test again.
Successfully loaded the data onto Databox and production environments.
Loaded Data for Review:
Test: Collections Test - DR710
Production: Collections Production - DR710
Prod UUID count logs:
24/09/26 08:20:48 INFO SparkContext: Successfully stopped SparkContext
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs:
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: newUuids: 0.0, preservedUuids: 1121933.0, orphanedUniqueKeys: 0.0
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Percentage UUID change: 0, allowed percentage: 50, override percentage check: false
- Status: Awaiting indexing