The data platform provides generic infrastructure for NSAPH Data Platform It depends on nsaph_util package, but it augments it with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.
The package is under intensive development, therefore its development branches contain some obsolete modules and utilities. The project structure can also be in flux.
Examples of tools included in this package are:
- Universal Data Loader
- A utility to monitor progress of long-running database processes like indexing.
- A utility to infer database schema and generate DDL from a CSV file
- A utility to link a table to GIS from a CSV file
- A wrapper around database connection to PostgreSQL
- A utility to import/export JSONLines files into/from PostgreSQL
- An Executor with a bounded queue
The package is under intensive development, the project structure is in flux
Top level directories are:
- doc
- resources
- src
Doc directory contains documentation.
Resource directory contains resources that must be loaded in the data platform for its normal functioning. For example, they contain mappings between US states, counties, fips and zip codes. See details in Resources section.
Src directory contains software source code. See details in Software Sources section.
The directories under sources are:
- airflow
- commonwl
- html
- plpgsql
- python
- r
- superset
- yml
They are described in more details in the corresponding sections. Here is a brief overview:
- airflow contains code and configuration for Airflow. Most of the content is deprecated as it is now transferred to the deployment package or the specific pipelines. However, this directory is intended to contain Airflow plugins that are generic for all NSAPH pipelines
- commonwl contains reusable workflows, packaged as tools that can and should be used by all NSAPH pipelines. Examples of such tools are: introspection of CSV files, indexing tables, linking tables with GIS information for easy mapping, creation of a Superset datasource.
- html is a deprecated directory for HTML documents
- plpgsql contains PostgreSQL procedures and functions implemented in PL/pgSQL language
- python contains Python code. See more details.
- r contains R utilities. Probably will be deprecated
- superset contains definitions of reusable Superset datasets and dashboards
- yml contains various YAML files used by the platform.
This is the main package containing the majority of the code.
Modules and subpackages included in nsaph
package are described below.
nsaph.data_model
Implements version 2 of the data modelling toolkit.
Version 1 was focused on loading into the database already processed data saved as flat files. It inferred data model from the data files structure and accompanied README files. The inferred data model is converted to database schema by generating appropriate DDL.
Version 2 focuses on generating code required to do the actual processing. The main concept is a knowledge domain, or just a domain. Domain model is define in a YAML file as described in the documentation. The main module that processes the YAML definition of the domain is domain.py. Another module, inserter handles parallel insertion of the data into domain tables.
Auxiliary modules perform various maintenance tasks. Module index2 builds indices for a given tables or for all tables within a domain. Module utils provides convenience function wrappers and defines class DataReader that abstracts reading CSV and FST files. In other words, DataReader provides uniform interface to reading columnar files in two (potentially more) different formats.
nsaph.db
Module db is a PostgreSQL
connection wrapper. It reads connection parameters from
an ini
file and connects to the database. It can
transparently connect over ssh tunnel when required.
nsaph.loader
A set of utilities to manipulate data.
Module data_loader Implements parallel loading data into a PostgreSQL database. It is also responsible for loading DDL and creation of view, both virtual and materialized.
Module index_builder is a utility to build indices and monitor the build progress.
nsaph.requests
Package nsaph.requests
contains some code that is
intended to be used for fulfilling user requests. Its
development is currently put on hold.
Module hdf5_export exports result of SQL query as an HDF5 file. The structure of the HDF5 is described by a YAML request definition.
Module query generates SQL query from a YAML request definition.
nsaph.util
Package nsaph.util
contains:
- Support for packaging resources in two modules resources and pg_json_dump. The latter module imports and exports PostgreSQL (pg) tables as JSONLines format.
- Module net contains
one method resolving host to
localhost
. This method is required by Airflow. - Module executors
implements a
ThreadPoolExecutor with a bounded queue. It is used to prevent out of memory (OOM) errors when processing huge files (to prevent loading the whole file into memory before dispatching it for processing).
The majority of files are data model definitions. For now, they are included in nsaph package because they are used by different utilities and thus, expected to be stored in a specific location.
Beside data model files, there are YAML files for:
- Conda environments, required for NSAPH pipelines. Unless we will be able to create a single environment that accomodate all pipelines we will probably deprecate them and move into corresponding pipeline repositories.
- Sample user requests for future downstream pipelines that create user workspaces from the database. File example_request.yml is used by sample request handler
Resources are organized in the following way:
- ${database schema}/
- ddl file for ${resource1}
- content of ${resource1} in JSON Lines format (*.json.gz)
- ddl file for ${resource2}
- content of ${resource2} in JSON Lines format (*.json.gz)
Resources can be packaged when a wheel is built. Support for packaging resources during development and after a package is deployed is provided by resources module.
Another module, pg_json_dump, provides support for packaging tables as resources in JSONLines format. This format is used natively by some DBMSs.