metadata-driven

A package that can be called with metadata in json format.

Imagine running a Databricks job from DataFactory. The Databricks job is just a pure function that accepts the details of the job from DataFactory (stored in git of course).

The complexity of the processing can be expanded upon, but the package will force a consistent, and hopefully tested way of reading, transforming, joining, aggregating, and writing data.

So, for example:

[
        {
                "input": ["a"],
                "output": {"path": "b"},
                "transformations": []
        },
        {
                "input": ["b", "c"],
                "output": {"path": "d"},
                "transformations": []
        },
        {
                "input": ["b", "c"],
                "output": {"path": "e"},
                "transformations": []
        }
]

This metadata could be passed to the package, which would generate datasets "d" and "e", using "a", "b", and "c".

Development

Clone the repository.
Next, if you haven't done so already, create a virtual environment by running the next command, or by creating your own virualenv using your preferred tool.
```
 pip install pipenv
 pipenv install -d
 pipenv shell
```
Install the project in editable/develop mode:
```
 make dev
```

Run the python code:

 python -m metadata_driven.main mnt/demo/*.csv

Now that the project is installed, we want to see whether the tests run succesfully:
```
 make lint
 make test
```
Now, some caching folders may have been generated. Whenever you want to clean up your project, run:
```
 make clean
```
Make sure that you create a feature branch if you are about to make changs to this repository:
```
 git checkout -b feature/my-feature
```
After implementing your feature, run the tests again:
```
 make lint && make test
```

TODO

Install jsonschema for proper template validation.

Menziess/metadata-driven-pyspark

metadata-driven

Development

TODO