- Python 3.11
- Poetry 1.8.3
- Docker 26.0.0
poetry install --without aws
poetry self add 'poethepoet[poetry_plugin]'
pre-commit install
We run all the scripts using Poe the Poet. You don't have to do anything else but install it as a Poetry plugin.
After you have installed all the dependencies, you must create a .env
file with sensitive credentials to run the project.
First, copy our example by running the following:
cp .env.example .env # The file has to be at the root of your repository!
Now, let's understand how to fill in all the variables inside the .env
file to get you started.
To authenticate to OpenAI, you must fill out the OPENAI_API_KEY
env var with an authentication token.
→ Check out this tutorial to learn how to provide one from OpenAI.
To authenticate to HuggingFace, you must fill out the HUGGINGFACE_ACCESS_TOKEN
env var with an authentication token.
→ Check out this tutorial to learn how to provide one from HuggingFace.
This step is optional. You can finish the project without this step.
But in case you want to enable LinkedIn crawling, you have to fill in your username and password:
LINKEDIN_USERNAME = "str"
LINKEDIN_PASSWORD = "str"
For this to work, you also have to:
- disable 2FA
- disable suspicious activity
We also recommend to:
- create a dummy profile for crawling
- crawl only your data
Important
Find more configuration options in the settings.py file. Every variable from the Settings
class can be configured through the .env
file.
Warning
You need Docker installed.
Start:
poetry poe local-infrastructure-up
Stop:
poetry poe local-infrastructure-down
Warning
When running on MacOS, before starting the server, export the following environment variable:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue.
Web UI: localhost:8237
Default credentials:
- username
: default
- password
:
→🔗 More on ZenML
REST API: localhost:6333 Web UI: localhost:6333/dashboard GRPC API: localhost:6334
→🔗 More on Qdrant
database URI: mongodb://decodingml:decodingml@127.0.0.1:27017
database name: twin
We will fill this section in the future. So far it is available only in the 11th Chapter of the book.
All the pipelines will be orchestrated behind the scenes by ZenML.
To see the pipelines running and their results:
- go to your ZenML dashboard
- go to the
Pipelines
section - click on a specific pipeline (e.g.,
feature_engineering
) - click on a specific run (e.g.,
feature_engineering_run_2024_06_20_18_40_24
) - click on a specific step or artifact to find more details about the run
But first, let's understand how we can run all our ML pipelines ↓
Run the data collection ETL:
poetry poe run-digital-data-etl
Warning
You must have Chrome installed on your system for the LinkedIn and Medium crawlers to work (which use Selenium under the hood). Based on your Chrome version, the Chromedriver will be automatically installed to enable Selenium support. Note that you can run everything using our Docker image if you don't want to install Chrome. For example, to run all the pipelines combined you can run poetry poe run-docker-end-to-end-data-pipeline
. Note that the command can be tweaked to support any other pipeline.
Important
To add additional links to collect from, go to configs_digital_data_etl_[your_name].yaml
and add them to the links
field. Also, you can create a completely new file and specify it at run time, like this: python -m llm_engineering.interfaces.orchestrator.run --run-etl --etl-config-filename configs_digital_data_etl_[your_name].yaml
Run the feature engineering pipeline:
poetry poe run-feature-engineering-pipeline
Run the dataset generation pipeline:
poetry poe run-generate-instruct-datasets-pipeline
Run all of the above compressed into a single pipeline:
poetry poe run-end-to-end-data-pipeline
Export ZenML artifacts to JSON:
poetry poe run-export-artifact-to-json-pipeline
poetry poe run-training-pipeline
Check and fix your linting issues:
poetry poe lint-check
poetry poe lint-fix
Check and fix your formatting issues:
poetry poe format-check
poetry poe format-fix