/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

Primary LanguagePythonApache License 2.0Apache-2.0

Koheesio

Koheesio logo
CI/CD CI - Test CD - Release Koheesio
Package PyPI - Version PyPI - Python Version PyPI - Downloads
Meta Hatch project linting - Ruff types - Mypy docstring - numpydoc code style - black License - Apache 2.0

Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.

Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.

Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.

What sets Koheesio apart from other libraries?"

Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.

Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...

We invite contributions from all, promoting collaboration and innovation in the data engineering community.

Koheesio Core Components

Here are the key components included in Koheesio:

  • Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.

    ┌─────────┐        ┌──────────────────┐        ┌──────────┐
    │ Input 1 │───────▶│                  ├───────▶│ Output 1 │
    └─────────┘        │                  │        └────√─────┘
                       │                  │
    ┌─────────┐        │                  │        ┌──────────┐
    │ Input 2 │───────▶│       Step       │───────▶│ Output 2 │
    └─────────┘        │                  │        └──────────┘
                       │                  │
    ┌─────────┐        │                  │        ┌──────────┐
    │ Input 3 │───────▶│                  ├───────▶│ Output 3 │
    └─────────┘        └──────────────────┘        └──────────┘
    
  • Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.

  • Logger: This is a class for logging messages at different levels.

Installation

You can install Koheesio using either pip or poetry.

Using Pip

To install Koheesio using pip, run the following command in your terminal:

pip install koheesio

Using Hatch

If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml.

[dependencies]
koheesio = "<version>"

Using Poetry

If you're using poetry for package management, you can add Koheesio to your project with the following command:

poetry add koheesio

or add the following line to your pyproject.toml (under [tool.poetry.dependencies]), making sure to replace ... with the version you want to have installed:

koheesio = {version = "..."}

Features

Koheesio also provides some additional features that can be useful in certain scenarios. These include:

  • Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations module;
    • Installable through the se extra.
    • SE Provides Data Quality checks for Spark DataFrames. For more information, refer to the Spark Expectations docs.
  • Box: Available through the koheesio.steps.integration.box module

    • Installable through the box extra.
    • Box is a cloud content management and file sharing service for businesses.
  • SFTP: Available through the koheesio.steps.integration.spark.sftp module;

    • Installable through the sftp extra.
    • SFTP is a network protocol used for secure file transfer over a secure shell.

Note:
Some of the steps require extra dependencies. See the Features section for additional info.
Extras can be done by adding features=['name_of_the_extra'] to the toml entry mentioned above

Contributing

How to Contribute

We welcome contributions to our project! Here's a brief overview of our development process:

  • Code Standards: We use pylint, black, and mypy to maintain code standards. Please ensure your code passes these checks by running make check. No errors or warnings should be reported by the linter before you submit a pull request.

  • Testing: We use pytest for testing. Run the tests with make test and ensure all tests pass before submitting a pull request.

  • Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.

For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.

Additional Resources