/origins

Data introspection, indexer, and semantic analyzer

Primary LanguagePythonOtherNOASSERTION

Origins

Build Status Coverage Status

Origins supports extacting structural and descriptive metadata from data resources such as relational databases, document stores, web services, and structured text files.

Origins provides a uniform programmatic interface to accessing metadata from various resources. It removes the need to know how PostgreSQL stores it's schema information, how REDCap data dictionaries are structured, and how get all the fields and their occurrences across documents in a MongoDB collection.

Quick Usage

Import origins and connect to a resource. This example uses a SQLite database that comes with the respository.

>>> import origins
>>> db = origins.connect('sqlite', path='./tests/data/chinook.sqlite')
>>> db.tables
(Table('Album'),
 Table('Artist'),
 Table('Customer'),
 Table('Employee'),
 Table('Genre'),
 Table('Invoice'),
 Table('InvoiceLine'),
 Table('MediaType'),
 Table('Playlist'),
 Table('PlaylistTrack'),
 Table('Track'))

>>> db.tables['Employee'].columns['Title'].props
{'default_value': None,
 'index': 3,
 'name': 'Title',
 'nullable': True,
 'primary_key': 0,
 'type': 'NVARCHAR(30)'}

For a more thorough example, walk through the Origins Introduction example.

Goals

Given an element and it's resource, extract enough metadata to support accessing the underlying data.

For physical resources, this provides a foundation for extraction of additional information such as statistics about the data itself and arbitrary queries. For logical resources, the metadata could be used for constructing the data model itself.

Some elements are either explicitly or semantically related, such as being enforced through a referential constraint or a synonym in an ontology. These are generally referred to as relationships.

Outcomes/Use Cases

  • General information purposes
    • Registry of resources for a project, team, organization, etc.
    • Cross-organization collaboration
    • Annotate with high-level transformations required to move data out of systems
  • Data provenance
    • Includes lineage of data
    • Invalidation of references
      • If A references B and B changes, the reference to B may no longer be valid
  • Definition of logical resources for ETL workflows
    • Shopping cart of elements across resources that may be required or useful for a project
  • "Common Data Elements"
    • Logical resource containing the canonical elements that other physical elements map to across resources
  • Normalized data access layer
    • Metadata can be used to generate queries, statements, expressions, web requests, etc. for fetching the underlying data
    • As with any programmatic access layer (e.g. an ORM), the data model needs to be expressed somewhere that can be mapped to the underlying system's interface, i.e. a language (SQL), API or protocol

Resource Backends

A backend is composed of a client and optionally a set of classes for each structural component in the metadata, such as database, table, and column for relationship databases.

The client does all the heavy lifting of connecting to the resource and extracting the metadata. If the structural classes are available (all built-in backends have these), the metadata can be accessed and traversed using a simple hiearchy/graph API (see the Quick Usage example above).

Builtin Backends

Backends are grouped by type. The first section lists the backend name and any dependencies that must be installed for the backend.

One or more options can be passed to the backend. Hierarchy lists the path from the origin to the elements, for example db.tables will access the table nodes. db.tables['Employee'].columns will access all the columns on the Employee table.

Relational Databases

Options

  • database - name of the database
  • host - host of the server
  • port - port of the server
  • user - user for authentication
  • password - password for authentication

Hierarchy

  • database
  • schemas (PostgreSQL only)*
  • tables
  • columns

Note: In addition to the PostgreSQL backend supporting schemas, it also provides direct access to the tables under the public schema via the tables property.

Document Stores

Options

  • database - name of the database
  • host - host of the server
  • port - port of the server
  • user - user for authentication
  • password - password for authentication

Hierarchy

  • database
  • collections
  • fields*

Note: the fields of nested documents are not indexed, however this could be implemented as an option if a use case presents itself.

Files

  • delimited - General backend for accessing fixed width delimited files.
  • csv - Alias for the , delimiter
  • tab - Alias for the \t delimiter
  • datadict General backend for data dictionary-style delimited files.

Options

  • path - Path to the file
  • delimiter - The delimiter between fields, defaults to comma
  • header - A list/tuple of column names. If not specified, the header will be detected if it exists, otherwise the column names will be the indices.
  • sniff - The number of bytes to use of the file when detecting the header.
  • dialect - csv.Dialect instance. This will be detected if not specified.

Hierarchy

  • file
  • columns (delimited) | fields (datadict)

Excel

Options

  • path - Path to the file
  • headers - If True, the first row on each sheet will be assume to be the header. If False the column indices will be used. If a list/tuple, the columns will apply to the first sheet. If a dict, keys are the sheet names and the values are a list/tuple of column names for the sheet.

Hierarchy

  • workbook
  • sheets
  • columns

Note: Sheets are assumed to be fixed width based on the first row.

File System

Options

  • path - Path to directory

Hierarchy

  • directory
  • files

Variant Call Format (VCF) Files

Options

  • path - Path to VCF file

Hierarchy

  • file
  • field

REDCap (via MySQL)

  • redcap-mysql - depends on MySQL backend

Options

  • project - name of the project to access
  • database - name of the database (defaults to 'redcap')
  • host - host of the server
  • port - port of the server
  • user - user for authentication
  • password - password for authentication

Hierarchy

  • project
  • forms
  • fields

REDCap (via API)

Options

  • url - REDCap API URL
  • token - REDCap API token for the project
  • name - Name of the project being accessed (this is merely an identifier). Note, this is required since PyCap does not currently export the name of the project itself through it's APIs.

Hierarchy

  • project
  • forms
  • fields

REDCap (via CSV data dictionary)

Options

  • path - Path to the REDCap data dictionary CSV file

Hierarchy

  • project
  • forms
  • fields

Harvest API

Options

  • url - Harvest API URL
  • token - Harvest API token if authentication is required

Hierarchy

  • application
  • categories
  • concepts
  • fields

Data Access Layer (DAL)

(Work in Progress)

View the example using SQLite

Resource Exporter

The resource exporter returns a JSON-compatible format representing the nodes and relationships from the "connect" API above. This format is intended to be consumed by the Origins graph API, however it can serve as a general purpose format for other consumers.

See an example usage

Deprecated APIs

  • Support for exporting to Neo4j has been removed in Origins in favor of the APIs available in graphlib which Origins currently depends on.