/nsaph-core-platform

Core Data Platform providing generic infrastructure for data hosting and processing

Primary LanguagePythonOtherNOASSERTION

NSAPH Data Platform

Overview

The data platform provides generic infrastructure for NSAPH Data Platform It depends on nsaph_util package, but it augments it with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.

The package is under intensive development, therefore its development branches contain some obsolete modules and utilities. The project structure can also be in flux.

Tool Examples

Examples of tools included in this package are:

Project Structure

The package is under intensive development, the project structure is in flux

Top level directories are:

- doc
- resources
- src

Doc directory contains documentation.

Resource directory contains resources that must be loaded in the data platform for its normal functioning. For example, they contain mappings between US states, counties, fips and zip codes. See details in Resources section.

Src directory contains software source code. See details in Software Sources section.

Software Sources

The directories under sources are:

- airflow
- commonwl
- html
- plpgsql
- python
- r
- superset
- yml

They are described in more details in the corresponding sections. Here is a brief overview:

  • airflow contains code and configuration for Airflow. Most of the content is deprecated as it is now transferred to the deployment package or the specific pipelines. However, this directory is intended to contain Airflow plugins that are generic for all NSAPH pipelines
  • commonwl contains reusable workflows, packaged as tools that can and should be used by all NSAPH pipelines. Examples of such tools are: introspection of CSV files, indexing tables, linking tables with GIS information for easy mapping, creation of a Superset datasource.
  • html is a deprecated directory for HTML documents
  • plpgsql contains PostgreSQL procedures and functions implemented in PL/pgSQL language
  • python contains Python code. See more details.
  • r contains R utilities. Probably will be deprecated
  • superset contains definitions of reusable Superset datasets and dashboards
  • yml contains various YAML files used by the platform.

Python packages

NSAPH Package

This is the main package containing the majority of the code. Modules and subpackages included in nsaph package are described below.

Subpackage for Data Modelling
  • nsaph.data_model

Implements version 2 of the data modelling toolkit.

Version 1 was focused on loading into the database already processed data saved as flat files. It inferred data model from the data files structure and accompanied README files. The inferred data model is converted to database schema by generating appropriate DDL.

Version 2 focuses on generating code required to do the actual processing. The main concept is a knowledge domain, or just a domain. Domain model is define in a YAML file as described in the documentation. The main module that processes the YAML definition of the domain is domain.py. Another module, inserter handles parallel insertion of the data into domain tables.

Auxiliary modules perform various maintenance tasks. Module index2 builds indices for a given tables or for all tables within a domain. Module utils provides convenience function wrappers and defines class DataReader that abstracts reading CSV and FST files. In other words, DataReader provides uniform interface to reading columnar files in two (potentially more) different formats.

Module Database Connection Wrapper
  • nsaph.db

Module db is a PostgreSQL connection wrapper. It reads connection parameters from an ini file and connects to the database. It can transparently connect over ssh tunnel when required.

Loader Subpackage
  • nsaph.loader

A set of utilities to manipulate data.

Module data_loader Implements parallel loading data into a PostgreSQL database. It is also responsible for loading DDL and creation of view, both virtual and materialized.

Module index_builder is a utility to build indices and monitor the build progress.

Subpackage to describe and implement user requests [Incomplete]
  • nsaph.requests

Package nsaph.requests contains some code that is intended to be used for fulfilling user requests. Its development is currently put on hold.

Module hdf5_export exports result of SQL query as an HDF5 file. The structure of the HDF5 is described by a YAML request definition.

Module query generates SQL query from a YAML request definition.

Subpackage with miscellaneous utilities
  • nsaph.util

Package nsaph.util contains:

  • Support for packaging resources in two modules resources and pg_json_dump. The latter module imports and exports PostgreSQL (pg) tables as JSONLines format.
  • Module net contains one method resolving host to localhost. This method is required by Airflow.
  • Module executors implements a
    ThreadPoolExecutor with a bounded queue. It is used to prevent out of memory (OOM) errors when processing huge files (to prevent loading the whole file into memory before dispatching it for processing).

YAML files

The majority of files are data model definitions. For now, they are included in nsaph package because they are used by different utilities and thus, expected to be stored in a specific location.

Beside data model files, there are YAML files for:

  • Conda environments, required for NSAPH pipelines. Unless we will be able to create a single environment that accomodate all pipelines we will probably deprecate them and move into corresponding pipeline repositories.
  • Sample user requests for future downstream pipelines that create user workspaces from the database. File example_request.yml is used by sample request handler

Resources

Resources are organized in the following way:

- ${database schema}/
    - ddl file for ${resource1}
    - content of ${resource1} in JSON Lines format (*.json.gz)
    - ddl file for ${resource2}
    - content of ${resource2} in JSON Lines format (*.json.gz)

Resources can be packaged when a wheel is built. Support for packaging resources during development and after a package is deployed is provided by resources module.

Another module, pg_json_dump, provides support for packaging tables as resources in JSONLines format. This format is used natively by some DBMSs.