CI/CD | |
Package | |
Meta |
Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.
Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.
Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.
Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.
Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...
We invite contributions from all, promoting collaboration and innovation in the data engineering community.
Here are the key components included in Koheesio:
-
Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
┌─────────┐ ┌──────────────────┐ ┌──────────┐ │ Input 1 │───────▶│ ├───────▶│ Output 1 │ └─────────┘ │ │ └────√─────┘ │ │ ┌─────────┐ │ │ ┌──────────┐ │ Input 2 │───────▶│ Step │───────▶│ Output 2 │ └─────────┘ │ │ └──────────┘ │ │ ┌─────────┐ │ │ ┌──────────┐ │ Input 3 │───────▶│ ├───────▶│ Output 3 │ └─────────┘ └──────────────────┘ └──────────┘
-
Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
-
Logger: This is a class for logging messages at different levels.
You can install Koheesio using either pip or poetry.
To install Koheesio using pip, run the following command in your terminal:
pip install koheesio
If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your
pyproject.toml
.
[dependencies]
koheesio = "<version>"
If you're using poetry for package management, you can add Koheesio to your project with the following command:
poetry add koheesio
or add the following line to your pyproject.toml
(under [tool.poetry.dependencies]
), making sure to replace ...
with the version you want to have installed:
koheesio = {version = "..."}
Koheesio also provides some additional features that can be useful in certain scenarios. These include:
- Spark Expectations: Available through the
koheesio.steps.integration.spark.dq.spark_expectations
module;- Installable through the
se
extra. - SE Provides Data Quality checks for Spark DataFrames. For more information, refer to the Spark Expectations docs.
- Installable through the
-
Box: Available through the
koheesio.steps.integration.box
module- Installable through the
box
extra. - Box is a cloud content management and file sharing service for businesses.
- Installable through the
-
SFTP: Available through the
koheesio.steps.integration.spark.sftp
module;- Installable through the
sftp
extra. - SFTP is a network protocol used for secure file transfer over a secure shell.
- Installable through the
Note:
Some of the steps require extra dependencies. See the Features section for additional info.
Extras can be done by addingfeatures=['name_of_the_extra']
to the toml entry mentioned above
We welcome contributions to our project! Here's a brief overview of our development process:
-
Code Standards: We use
pylint
,black
, andmypy
to maintain code standards. Please ensure your code passes these checks by runningmake check
. No errors or warnings should be reported by the linter before you submit a pull request. -
Testing: We use
pytest
for testing. Run the tests withmake test
and ensure all tests pass before submitting a pull request. -
Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.
For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.