/data-consumer-pipeline

Pipeline for processing and consuming streaming data from Pub/Sub, integrating with Dataflow for real-time data processing

Primary LanguagePythonMIT LicenseMIT

Data Consumer Pipeline: Data Pipeline for ingest data in near real time

Project Status Python Version License

Black pylint

Codecov

Project Summary

Pipeline for processing and consuming streaming data from Pub/Sub, integrating with Dataflow for real-time data processing

Development Stack

My Skills

Cloud Stack (GCP)

Pub/SubDataflowBigQuery

  • Pub/Sub: Messaging service provided by GCP for sending and receiving messages between FastAPI and Dataflow pipeline.
  • Dataflow: Serverless data processing service provided by GCP for executing the ETL process.
  • BigQuery: Fully managed, serverless data warehouse provided by GCP for storing and analyzing large datasets.

Continuous Integration and Continuous Deployment (CI/CD, DevOps)

My Skills

Contributing

See the following docs:

Project Highlights:

  • Hexagonal Architecture: Adoption of Hexagonal Architecture to decouple the core logic from external dependencies, ensuring that any current data source can be replaced seamlessly in case of unavailability. This is facilitated by the use of adapters, which act as intermediaries between the core application and the external services.

  • Comprehensive Testing: Development of tests to ensure the quality and robustness of the code at various stages of the ETL process

  • Configuration Management: Use of a configuration module to manage project_id and others env variables, providing flexibility and ease of adjustment.

  • Continuous Integration and Continuous Deployment: Use of CI/CD pipelines to automate the build, test and deployment processes, ensuring that the application is always up-to-date and ready for use.

  • Code Quality: Use of code quality tools such as linters and formatters to ensure that the codebase is clean, consistent and easy to read.

  • Documentation: Creation of detailed documentation to facilitate the understanding and use of the application, including installation instructions, usage examples and troubleshooting guides.

Data Pipeline Process:

  1. Data Extraction: The data extraction process consists of making requests to the API to obtain the data. The requests are made in parallel workers using Cloud Dataflow to optimize the process. The data is extracted in JSON format.
  2. Data Transformation: The data transformation process consists of converting the data to BigQuery Schema. The transformation is done using Cloud Dataflow in parallel workers to optimize the process.
  3. Data Loading: The data loading process consists of loading the data into BigQuery. The data is loaded in parallel workers using Cloud Dataflow to optimize the process.