Origins supports extacting structural and descriptive metadata from data resources such as relational databases, document stores, web services, and structured text files.
Origins provides a uniform programmatic interface to accessing metadata from various resources. It removes the need to know how PostgreSQL stores it's schema information, how REDCap data dictionaries are structured, and how get all the fields and their occurrences across documents in a MongoDB collection.
Import origins
and connect to a resource. This example uses a SQLite database that comes with the respository.
>>> import origins
>>> db = origins.connect('sqlite', path='./tests/data/chinook.sqlite')
>>> db.tables
(Table('Album'),
Table('Artist'),
Table('Customer'),
Table('Employee'),
Table('Genre'),
Table('Invoice'),
Table('InvoiceLine'),
Table('MediaType'),
Table('Playlist'),
Table('PlaylistTrack'),
Table('Track'))
>>> db.tables['Employee'].columns['Title'].props
{'default_value': None,
'index': 3,
'name': 'Title',
'nullable': True,
'primary_key': 0,
'type': 'NVARCHAR(30)'}
For a more thorough example, walk through the Origins Introduction example.
Given an element and it's resource, extract enough metadata to support accessing the underlying data.
For physical resources, this provides a foundation for extraction of additional information such as statistics about the data itself and arbitrary queries. For logical resources, the metadata could be used for constructing the data model itself.
Some elements are either explicitly or semantically related, such as being enforced through a referential constraint or a synonym in an ontology. These are generally referred to as relationships.
- General information purposes
- Registry of resources for a project, team, organization, etc.
- Cross-organization collaboration
- Annotate with high-level transformations required to move data out of systems
- Data provenance
- Includes lineage of data
- Invalidation of references
- If A references B and B changes, the reference to B may no longer be valid
- Definition of logical resources for ETL workflows
- Shopping cart of elements across resources that may be required or useful for a project
- "Common Data Elements"
- Logical resource containing the canonical elements that other physical elements map to across resources
- Normalized data access layer
- Metadata can be used to generate queries, statements, expressions, web requests, etc. for fetching the underlying data
- As with any programmatic access layer (e.g. an ORM), the data model needs to be expressed somewhere that can be mapped to the underlying system's interface, i.e. a language (SQL), API or protocol
A backend is composed of a client and optionally a set of classes for each structural component in the metadata, such as database, table, and column for relationship databases.
The client does all the heavy lifting of connecting to the resource and extracting the metadata. If the structural classes are available (all built-in backends have these), the metadata can be accessed and traversed using a simple hiearchy/graph API (see the Quick Usage example above).
Backends are grouped by type. The first section lists the backend name and any dependencies that must be installed for the backend.
One or more options can be passed to the backend. Hierarchy lists the path from the origin to the elements, for example db.tables
will access the table nodes. db.tables['Employee'].columns
will access all the columns on the Employee
table.
sqlite
postgresql
- requires psycopg2mysql
- requires PyMySQL or MySQL-pythonoracle
- requires cx_Oracle
Options
database
- name of the databasehost
- host of the serverport
- port of the serveruser
- user for authenticationpassword
- password for authentication
Hierarchy
database
schemas
(PostgreSQL only)*tables
columns
Note: In addition to the PostgreSQL backend supporting schemas, it also provides direct access to the tables under the public
schema via the tables
property.
Options
database
- name of the databasehost
- host of the serverport
- port of the serveruser
- user for authenticationpassword
- password for authentication
Hierarchy
database
collections
fields
*
Note: the fields of nested documents are not indexed, however this could be implemented as an option if a use case presents itself.
delimited
- General backend for accessing fixed width delimited files.csv
- Alias for the,
delimitertab
- Alias for the\t
delimiterdatadict
General backend for data dictionary-style delimited files.
Options
path
- Path to the filedelimiter
- The delimiter between fields, defaults to commaheader
- A list/tuple of column names. If not specified, the header will be detected if it exists, otherwise the column names will be the indices.sniff
- The number of bytes to use of the file when detecting the header.dialect
-csv.Dialect
instance. This will be detected if not specified.
Hierarchy
file
columns
(delimited) |fields
(datadict)
Options
path
- Path to the fileheaders
- IfTrue
, the first row on each sheet will be assume to be the header. IfFalse
the column indices will be used. If a list/tuple, the columns will apply to the first sheet. If a dict, keys are the sheet names and the values are a list/tuple of column names for the sheet.
Hierarchy
workbook
sheets
columns
Note: Sheets are assumed to be fixed width based on the first row.
directory
Options
path
- Path to directory
Hierarchy
directory
files
Options
path
- Path to VCF file
Hierarchy
file
field
redcap-mysql
- depends on MySQL backend
Options
project
- name of the project to accessdatabase
- name of the database (defaults to 'redcap')host
- host of the serverport
- port of the serveruser
- user for authenticationpassword
- password for authentication
Hierarchy
project
forms
fields
Options
url
- REDCap API URLtoken
- REDCap API token for the projectname
- Name of the project being accessed (this is merely an identifier). Note, this is required since PyCap does not currently export the name of the project itself through it's APIs.
Hierarchy
project
forms
fields
redcap-csv
Options
path
- Path to the REDCap data dictionary CSV file
Hierarchy
project
forms
fields
Options
url
- Harvest API URLtoken
- Harvest API token if authentication is required
Hierarchy
application
categories
concepts
fields
(Work in Progress)
View the example using SQLite
The resource exporter returns a JSON-compatible format representing the nodes and relationships from the "connect" API above. This format is intended to be consumed by the Origins graph API, however it can serve as a general purpose format for other consumers.
See an example usage
- Support for exporting to Neo4j has been removed in Origins in favor of the APIs available in graphlib which Origins currently depends on.