metadata-driven

Build Status

A package that can be called with metadata in json format.

Imagine running a Databricks job from DataFactory. The Databricks job is just a pure function that accepts the details of the job from DataFactory (stored in git of course).

The complexity of the processing can be expanded upon, but the package will force a consistent, and hopefully tested way of reading, transforming, joining, aggregating, and writing data.

So, for example:

[
        {
                "input": ["a"],
                "output": {"path": "b"},
                "transformations": []
        },
        {
                "input": ["b", "c"],
                "output": {"path": "d"},
                "transformations": []
        },
        {
                "input": ["b", "c"],
                "output": {"path": "e"},
                "transformations": []
        }
]

This metadata could be passed to the package, which would generate datasets "d" and "e", using "a", "b", and "c".

Development

  1. Clone the repository.

  2. Next, if you haven't done so already, create a virtual environment by running the next command, or by creating your own virualenv using your preferred tool.

     pip install pipenv
     pipenv install -d
     pipenv shell
    
  3. Install the project in editable/develop mode:

     make dev
    
  4. Run the python code:

     python -m metadata_driven.main mnt/demo/*.csv
    
  5. Now that the project is installed, we want to see whether the tests run succesfully:

     make lint
     make test
    
  6. Now, some caching folders may have been generated. Whenever you want to clean up your project, run:

     make clean
    
  7. Make sure that you create a feature branch if you are about to make changs to this repository:

     git checkout -b feature/my-feature
    
  8. After implementing your feature, run the tests again:

     make lint && make test
    

TODO

  1. Install jsonschema for proper template validation.