A package that can be called with metadata in json format.
Imagine running a Databricks job from DataFactory. The Databricks job is just a pure function that accepts the details of the job from DataFactory (stored in git of course).
The complexity of the processing can be expanded upon, but the package will force a consistent, and hopefully tested way of reading, transforming, joining, aggregating, and writing data.
So, for example:
[
{
"input": ["a"],
"output": {"path": "b"},
"transformations": []
},
{
"input": ["b", "c"],
"output": {"path": "d"},
"transformations": []
},
{
"input": ["b", "c"],
"output": {"path": "e"},
"transformations": []
}
]
This metadata could be passed to the package, which would generate datasets "d" and "e", using "a", "b", and "c".
-
Clone the repository.
-
Next, if you haven't done so already, create a virtual environment by running the next command, or by creating your own virualenv using your preferred tool.
pip install pipenv pipenv install -d pipenv shell
-
Install the project in
editable/develop
mode:make dev
-
Run the python code:
python -m metadata_driven.main mnt/demo/*.csv
-
Now that the project is installed, we want to see whether the tests run succesfully:
make lint make test
-
Now, some caching folders may have been generated. Whenever you want to clean up your project, run:
make clean
-
Make sure that you create a feature branch if you are about to make changs to this repository:
git checkout -b feature/my-feature
-
After implementing your feature, run the tests again:
make lint && make test
- Install
jsonschema
for proper template validation.