/hdc

Primary LanguagePythonApache License 2.0Apache-2.0

Hashmap Data Cataloger

Table of Contents

About

The Hashmap Data Cataloger utility that can be used to catalog(read) data assets such as Databases, Schemas, and Tables from a given source system and map(write) them into a given destination system.

Using Hashmap Data Cataloger

hashmap-data-cataloger (hdc) is can be invoked from the command line interface (next section) or as a library of APIs.

Setup

Package Installation

This tool is available on PyPi and can be installed as:

pip install hashmap-data-cataloger

This will install the hashmap-data-cataloger and all of its dependencies. This is a pypi package and can be installed as

Connection configuration file setup

The hdc tool is a configuration driven application that depends on 3 types of configurations encoded as YAML.

Application Configuration

The hdc tool uses this YAML file to define the supported sources, destinations, and corresponding mappers in order to self-configure itself to enable the 'map' or 'catalog' functions. The layout of this file looks like this.

The default version comes with pre-configured sources, destinations, and mappers that can be used as is, while invoking 'hdc' from CLI or through an API call (see examples below). The user would only need to update the connection profile for each source/destination individually under the section 'connection_profiles'. The profile names being updated should match the one of the profile names in the profile.yml file.

You can override the default version file from CLI by using the '-c' option followed by the path of the custom YAML file. However, it must conform to the format linked above.

To create a default YAML configuration file do the following:

  1. Using any text editor create a file like this and save as 'app_config.yml'
  2. Create a hidden directory in the User's root with the name '.hdc'
  3. Move the 'app_config.yml' into the hidden directory created above.
Connection Profile Configuration

The hdc tool uses this YAML file to configure/provide the necessary connection details for source and destination databases. The elements required in the YAML file and their layout looks like this. Presently, the connections are secured via user credentials.

You cannot override this file from CLI and therefore will need to be made available beforehand as follows:

  1. Using any text editor create a file like this and save as 'profile.yml'
  2. Create a hidden directory in the User's root with the name '.hdc'
  3. Move the 'profile.yml' into the hidden directory created above.
Log Settings Configuration

The hdc tool uses this YAML file to configure the log settings (Python's logging). The elements required in the YAML file and their layout looks like this.

You can override this file from CLI using the '-l' option followed by the path of the custom YAML file. However, it must conform to the format linked above.

To create a default YAML configuration file do the following:

  1. Using any text editor create a file like this and save as 'log_settings.yml'
  2. Create a hidden directory in the User's root with the name '.hdc'
  3. Move the 'log_settings.yml' into the hidden directory created above.

Running from CLI

Once the package is installed along-with its dependencies, invoke it from the command line as:

usage: hdc [-h] -r {catalog,map} -s SOURCE [-d DESTINATION] [-c APP_CONFIG] [-l LOG_SETTINGS] 

optional arguments:
 -h, --help            show this help message and exit
 -r {catalog,map}, --run {catalog,map}
                       One of 'catalog' or 'map'
 -s SOURCE, --source SOURCE
                       Name of any one of sources configured in hdc.yml
 -d DESTINATION, --destination DESTINATION
                       Name of any one of destinations configured in hdc.yml
 -c APP_CONFIG, --app_config APP_CONFIG
                       Path to application config (YAML) file if other than default
 -l LOG_SETTINGS, --log_settings LOG_SETTINGS
                       Path to log settings (YAML) file if other than default
For example:
  python3 -m hdc -r catalog -s oracle  
  python3 -m hdc -r map -s oracle -d snowflake         
  python3 -m hdc -r map -s netezza -d snowflake

Using as API

Other applications could import hdc as a library and make use of the cataloging or mapping functions as explained below.

  1. AssetMapper. - Provides a method 'map_assets()' to kickoff the crawling, mapping, and writing of data assets from a given source system to a target system based on the connection profile parameters for each. > An AssetMapper object can be created in the following manner:
   asset_mapper = AssetMapper(source = '', destination = '')
   result: bool = asset_mapper.map_assets()

'source' a str name of any one of the sources configured in the default app_config.yml.

'destination' a str name of any one of the destinations configured in the default app_config.yml

  1. Cataloger - Provides a method 'obtain_catalog()' to kickoff a crawler process against a given source system and pull the data asset information according to the connection profile parameters. > > A Cataloger object can be created in the following manner:
   cataloger = Cataloger(source = '' )
   result: pandas.DataFrame = cataloger.obtain_catalog()

'source' a str name of any one of the sources configured in the default app_config.yml.

Future Roadmap

Selective crawling

At present the hdc tool crawls through the entire hierarchy of a given database (all schemas, all tables under all schemas). This can be fine tuned to allow to crawl through only selected schemas under a given database.

Using external Key Store

For enhanced user authentication

Allow configuration of external Key Stores for storing user authentication details required while connecting with source or destination systems. The application shall be able to interact with the external KS based on the configuration provided.

This is to provide a stronger security option instead of directly configuring the user credentials in the profile.yml file.

Notes to developers

OOP Design

UML Class Diagram

Extending capability

Adding new crawler

TBD

Adding new mapper

TBD

Adding new creator

TBD