/aftafa

yet another lightweight ELT project in Python

Primary LanguagePythonMIT LicenseMIT

🏺 aftafa data pipeline

Work in progress lightweight Python ELT library with e-commerce (OZON, Wildberries, Yandex Market, etc.) domain

⚠️ This project is under heavy development and yet to be a usable library, so many features are missing actually and its structure surely will be modified/refactored and documented

Overview

This module can be helpful even if you're not trying to build a decent ETL pipeline, but rather want to fetch data from a marketplace API via a convenient client. But keep in mind that API methods provided in this module can only fetch data (not to confuse with HTTP methods) but can't change them in a way it is documented by a source vendor (e. g. for marketplaces it means: change prices, add new products, refresh dropshipping stocks and etc). Inspired by dlt, meltano, cloudquery, benthos (bento), file.d.

Usage

Architecture overview.

foo@bar:~$ git clone https://github.com/makualiyev/aftafa.git
foo@bar:~$ cd aftafa
foo@bar:~$ python3 -m venv venv
foo@bar:~$ source venv/bin/acitvate
foo@bar:~$ python3 -m pip install -e .
foo@bar:~$ python3 -m pip install -r requirements.dev.txt
foo@bar:~$ python3 -m aftafa --version

You can check a working example of a pipeline moving data from email to a raw file here.

Development stack / ideas

  • Current:
    • OS: Windows 10 / Ubuntu 22
    • Text Editor, IDE: Visual Studio Code / Vim
    • Linters, type checkers and etc: no idea (mypy, pylint)
    • Database: PostgreSQL 14 / PostgreSQL 15, duckdb
    • Python libraries: SQLAlchemy, pydantic, pandas, xlwings
    • Development design: I assume it can be called TDD
  • Plans:
  • Ideas:
    • Dev concerns
      • too many dependencies that cause bloating, venv folder weighs 375,7 Mb

TODO list

  • remove unused dependencies
    • remove xlwings
    • replace click with argparse
    • gspread maybe?
    • remove jupyter + ipython
  • configuration file parser
  • DataSource source class implementation, change naive extract method to return generators (yield data -> more efficient)
    • JSONDataSource -> JSONDataDestination is not working correctly, check it with supply_orders JSON file from OZON.
  • implement logging and change all print statements
  • add poetry to the project
  • add schema resolver in utils. It'll help us with parsing the Swagger/OpenAPI documentation
  • add test suite --- python -m unittest test\client\ozon\test_ozon_supplier.py
  • OData client -> write a parser for $metadata XML file for ec