/CriticalMAAS

Project overview, roadmap and initial result reports

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

UW-Macrostrat CriticalMAAS

Project overviews, roadmaps and milestone reports for Macrostrat's contribution to the DARPA/USGS CriticalMAAS program.

Key links

  • dev2.macrostrat.org: Development instance of Macrostrat housing new CriticalMAAS capabilities
  • v2.macrostrat.org: Staging instance of 'core' Macrostrat services using the streamlined CriticalMAAS codebase and infrastructure
  • macrostrat.org: Macrostrat's v1 production website

Codebase

This is an index of the software repositories that have been contributed to in the Macrostrat / UW–Madison CriticalMAAS effort, as of Milestone 4 (March 2024).

Macrostrat core system

Macrostrat's core system consists of a database of geological maps, stratigraphic columns, and the data dictionaries and lexicons that describe them. The core codebase contains the capabilities required to run a functional instance of Macrostrat, including its ingestion pipelines, APIs, and web interfaces. This system is the primary deliverable of the Macrostrat CriticalMAAS effort; by the end of the project, it will be well-documented and deployable by USGS staff and other users with a macrostrat up command.

All of Macrostrat's core repositories have been released under the Apache 2.0 license.

  • UW-Macrostrat/macrostrat: Macrostrat's core system, including database definitions, control scripts, and ingestion pipelines.

Key dependencies

  • UW-Macrostrat/macrostrat-api: Macrostrat's current (v1-2) API, which provides access to Macrostrat's web services for current functionality
  • UW-Macrostrat/tileserver: Server for vector and raster tiles to GIS software and TA3 performers
  • UW-Macrostrat/api-v3: Macrostrat's v3 API, which will be the primary API for new capabilities, including CriticalMAAS
  • UW-Macrostrat/python-libraries: Python libraries used throughout Macrostrat server and control applications, including tools for database access and data processing.

Web interface

Macrostrat's web interface is the primary way of exposing Macrostrat's capabilities to users. Through CriticalMAAS, we are adding new capabilities including access to individual maps, a staged ingestion interface, and new human interfaces for managing links between geological units and the scientific literature.

Geologic metadata curation

We are working towards building better rock-record descriptions from the geological literature, both by discovering concepts linked to known geological units (relationship extraction) and finding new units (named entity resolution) in papers and reports indexed by xDD. Several machine-learning approaches are being deployed towards this knowledge-graph curation problem:

  • UW-Macrostrat/macrostrat-xdd: System overview and documentation.
  • UW-Macrostrat/unsupervised-kg: Unsupervised knowledge graph construction to discover new entities and relationships from geological literature (Devesh Sarda; UW–Madison computer science).
  • UW-Macrostrat/llm-kg-generator: LLM-assisted graph generation to extracts geological facts, operating over the scientific literature to characterize batches of geologic names (Bill Xia; UW–Madison computer science)

Tools initiated outside of CriticalMAAS are also being adapted to support these metadata curation workflows:

  • UW-xDD/text2graph_llm: A tool to transform textual data into structured graph representations, using LLMs to identify and extract relationships between locations and geological entities from text.
  • UW-Madison-DSI/ask-xDD: A chat interface and API endpoint for accessing academic information via Retrieval-Augmented Generation (RAG). The prototype currently covers topics such as geoscience, climate change, and COVID-19.

An infrastructure (UW-Macrostrat/macrostrat-xdd; placeholder for future development) will orchestrate the bulk deployment of these models against Macrostrat's existing lexicon of geologic names; results will be standardized, linked to data dictionaries, and stored in Macrostrat's database. A feedback interface (currently in development; see early prototype) will allow users to correct extracted descriptions, as well as establish de novo relationships. After initial deployment, this pipeline will be extended with "named entity recognition" capabilities to identify and describe rock units not known to Macrostrat.

Geologic map editing

Candidate geologic maps staged by TA1 will require validation before they are ingested into Macrostrat and used by TA3. Some feature extractions may require manual editing before acceptance; this is a labor-intensive process using current GIS workflows. We are developing a map editing workflow based on iterative topology to greatly speed the production of complete, topologically correct geologic maps.

Editing will be possible via:

  • Web-based editing tools in the Mapboard platform
  • Standard GIS platforms such as QGIS, via direct connection to PostGIS or a WFS-T service (under evaluation)
  • Mapboard GIS, an iPad app for geologic mapping with a focus on rapid, intuitive drawing.

Note: Natural drawing capabilities of Mapboard GIS are not available under an open-source license. These can be omitted entirely if desired, but they provide a significant improvement in the speed and capability of map editing.

Program coordination

Macrostrat has contributed to the development of shared infrastructure for the CriticalMAAS program, including data formats, schemas, and shared libraries.

  • DARPA-CriticalMAAS/ta1-geopackage: a GeoPackage-based data format for validating and storing TA1 output
  • DARPA-CriticalMAAS/schemas: A repository for schemas and data formats for TA1-3 integrations (started by UW–Macrostrat and subsequently contributed to by all TA teams)
  • UW-xDD/document-store: A supplemental store for public/user provided PDFs that provides full-text access, integrates with xDD APIs. Note: This repository is being integrated into the CDR codebase.

External integrations

Macrostrat is integrated with systems that provide additional functionality relevant to CriticalMAAS. Major adjustments to these systems are out-of-scope for the CriticalMAAS project, but integrations with Macrostrat can provide useful capabilities to the CriticalMAAS program.

  • Corelle: Paleogeographic rotation system compatible with GPlates
  • xDD: A system for extracting geologic metadata from the scientific literature
  • COSMOS: A system for extracting structural elements (figures, tables, etc.) from papers
  • Weaver: Ingestion, curation, performant filtering, and visualization of massive geological point datasets (for TA2 data integration)

Documentation

Documentation is associated with individual project repositories. Additionally, some broad documentation is being coordinated across the project towards building a cohesive, deployable system.

Macrostrat's documentation website provides a high-level overview of the Macrostrat system and its capabilities and links to more detailed documentation for individual components. Going forward, this system-level documentation will be centralized in the UW-Macrostrat/docs repository. A brief overview of Macrostrat documentation resources is provided below:

Core system setup

Developer-focused documentation on running Macrostrat's core system can be found in the UW-Macrostrat/macrostrat repository.

API usage

Map interface examples

The v2.macrostrat.org/dev and v2.macrostrat.org/map/dev websites contain examples of Macrostrat's map services for raster, vector, and point-based data.

Document extraction system

TODO

  • Build a documentation website (preliminary website is up)
  • Document new Macrostrat APIs using the OpenAPI specification in order to conform to common standards.
  • Establish documentation websites for shared web components and Python libraries
  • Create and document a process for setting up an empty Macrostrat database, populated with data dictionaries

Project documents

Compiled documents are stored in a S3 and linked to below. To download all documents, run make in the root directory of this repository.

Project Milestones

Phase 1

# Milestone Exit Criteria Target Date Product
0 Specifications for TA 1-3 Specifiation Plans 09/2023
1 Detailed research plan for Phase 1 Milestone Report 10/2023
2 Initial code/documentation release Milestone Report/Code 12/2023
3 Report detailing progress of research and technology and gaps Milestone Report 02/2024
4 Code/documentation/data release and Milestone Report Milestone Report/Code 03/2024
5 Report detailing progress, capabilities, gaps, and final integration plans Milestone Report 04/2024
6 Report with challenge evaluation results, working code and documentation and end-to-end integration Milestone Report/Code 07/2024

Appendix: Out-of scope codebases

Archived repositories

These repositories are not currently being actively developed, having been integrated into Macrostrat's core system at UW-Macrostrat/macrostrat

Private repositories

Document extraction handling

These document handling repositories are maintained as part of the xDD system, and will be called on as necessary for the CriticalMAAS project.