/yaetos_jobs

Examples of data pipelines using yeatos

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Yaetos Project

Continuous Integration Pypi Users Documentation Medium

yaetos_jobs

This repository consists of data pipelines using yaetos data framework (github.com/arthurprevot/yaetos). The code for these data pipelines is found in the "jobs" folder. Most pipelines are setup with small sample inputs so they should work out of the box.

Generative AI:

  • Data pipeline to pull information out of ChatGPT programmatically, to feed into datasets.
  • Data pipeline to fine-tune a "small" open source LLM called Albert, for classification, and to run inferences. The model is small enough to run from a laptop in minutes for the test case (no need for GPU).
  • Data pipeline to feed documents (pdf, text) to privateGPT vector database to add knowledge to local LLM.

Scientific (Climate data, image processing):

  • Data pipeline to process carbon emissions data from climate-trace (https://climatetrace.org/), with a sample dashboard available here
  • Data pipeline to process images (could be satellite, medical, etc) to find contours (@ scale, using Spark).

Sales/Marketing:

  • Data pipeline to pull employee contact information out of Apollo.io for a set of companies.
  • Data pipeline to pull information from Github contributors using Github API.

Other:

  • Data pipeline to showcase Yaetos core functionalities, using public wikipedia data.

Lots of room for improvements. Contributions welcome. Feel free to reach out at arthur@yaetos.com.