/LLM-Engineering

Primary LanguagePythonMIT LicenseMIT

LLM-Engineering

Dependencies

  • Python 3.11
  • Poetry 1.8.3
  • Docker 26.0.0

Install

poetry install --without aws
poetry self add 'poethepoet[poetry_plugin]'
pre-commit install

We run all the scripts using Poe the Poet. You don't have to do anything else but install it as a Poetry plugin.

Configure sensitive information

After you have installed all the dependencies, you must create a .env file with sensitive credentials to run the project.

First, copy our example by running the following:

cp .env.example .env # The file has to be at the root of your repository!

Now, let's understand how to fill in all the variables inside the .env file to get you started.

OpenAI

To authenticate to OpenAI, you must fill out the OPENAI_API_KEY env var with an authentication token.

→ Check out this tutorial to learn how to provide one from OpenAI.

HuggingFace

To authenticate to HuggingFace, you must fill out the HUGGINGFACE_ACCESS_TOKEN env var with an authentication token.

→ Check out this tutorial to learn how to provide one from HuggingFace.

LinkedIn Crawling [Optional]

This step is optional. You can finish the project without this step.

But in case you want to enable LinkedIn crawling, you have to fill in your username and password:

LINKEDIN_USERNAME = "str"
LINKEDIN_PASSWORD = "str"

For this to work, you also have to:

  • disable 2FA
  • disable suspicious activity

We also recommend to:

  • create a dummy profile for crawling
  • crawl only your data

Important

Find more configuration options in the settings.py file. Every variable from the Settings class can be configured through the .env file.

Run Locally

Local Infrastructure

Warning

You need Docker installed.

Start:

poetry poe local-infrastructure-up

Stop:

poetry poe local-infrastructure-down

Warning

When running on MacOS, before starting the server, export the following environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue.

ZenML is now accessible at:

Web UI: localhost:8237

Default credentials: - username: default - password:

→🔗 More on ZenML

Qdrant is now accessible at:

REST API: localhost:6333 Web UI: localhost:6333/dashboard GRPC API: localhost:6334

→🔗 More on Qdrant

MongoDB is now accessible at:

database URI: mongodb://decodingml:decodingml@127.0.0.1:27017 database name: twin

AWS Infrastructure

We will fill this section in the future. So far it is available only in the 11th Chapter of the book.

Run Pipelines

All the pipelines will be orchestrated behind the scenes by ZenML.

To see the pipelines running and their results:

  • go to your ZenML dashboard
  • go to the Pipelines section
  • click on a specific pipeline (e.g., feature_engineering)
  • click on a specific run (e.g., feature_engineering_run_2024_06_20_18_40_24)
  • click on a specific step or artifact to find more details about the run

But first, let's understand how we can run all our ML pipelines ↓

Data pipelines

Run the data collection ETL:

poetry poe run-digital-data-etl

Warning

You must have Chrome installed on your system for the LinkedIn and Medium crawlers to work (which use Selenium under the hood). Based on your Chrome version, the Chromedriver will be automatically installed to enable Selenium support. Note that you can run everything using our Docker image if you don't want to install Chrome. For example, to run all the pipelines combined you can run poetry poe run-docker-end-to-end-data-pipeline. Note that the command can be tweaked to support any other pipeline.

Important

To add additional links to collect from, go to configs_digital_data_etl_[your_name].yaml and add them to the links field. Also, you can create a completely new file and specify it at run time, like this: python -m llm_engineering.interfaces.orchestrator.run --run-etl --etl-config-filename configs_digital_data_etl_[your_name].yaml

Run the feature engineering pipeline:

poetry poe run-feature-engineering-pipeline

Run the dataset generation pipeline:

poetry poe run-generate-instruct-datasets-pipeline

Run all of the above compressed into a single pipeline:

poetry poe run-end-to-end-data-pipeline

Utility pipelines

Export ZenML artifacts to JSON:

poetry poe run-export-artifact-to-json-pipeline

Training pipelines

poetry poe run-training-pipeline

Linting & Formatting (QA)

Check and fix your linting issues:

poetry poe lint-check
poetry poe lint-fix

Check and fix your formatting issues:

poetry poe format-check
poetry poe format-fix