astronomer/astronomer-cosmos

[Bug] cosmos 1.4.1 does not create virtualenv when using `ExecutionMode.VIRTUALENV`

marco9663 opened this issue · 1 comments

Context

I am developing on google cloud composer version composer-2.7.1-airflow-2.6.3 which has dbt-core==1.5.4 installed. Below is a part of the code.

with DAG(
    dag_id=f"transformation",
    start_date=datetime(2023, 9, 18),
    schedule=dag_config.schedule_interval,
    catchup=False,
    max_active_runs=1,
    doc_md=doc_md,
) as dag:
    transform = DbtTaskGroup(
        group_id="transformations",
        project_config=dbt_project_config,
        profile_config=dbt_profile_config,
        render_config=RenderConfig(
            load_method=LoadMode.DBT_LS_FILE,
            dbt_ls_path=Path(
                DAGS_CONFIG_PATH
                / f"dbt_ls_{dag_config.name}__{dag_entry.name}.txt"
            ),
        ),
        execution_config=ExecutionConfig(
            execution_mode=ExecutionMode.VIRTUALENV,
        ),
        operator_args={
            "py_system_site_packages": True,
            "py_requirements": ["dbt-snowflake~=1.5.5", "dbt-core~=1.5.11"],
            "install_deps": True,
        },
    )

It was supposed to create a virtualenv, install dbt-core version 1.5.11, and use the dbt binary of that version in subprocess.

Previously, using ExecutionMode.VIRTUALENV in Cosmos 1.3.2 worked fine. However, after the introduction of the dbt runner invocation #850, it appears Cosmos now uses the default installed dbt version and does not create a virtual environment. Below is part of the airflow log:

...
[2024-05-27, 14:26:39 UTC] {local.py:185} INFO - dbtRunner is available. Using dbtRunner for invoking dbt.
...
[2024-05-27, 14:26:42 UTC] {eventmgr.py:62} INFO - �[0m14:26:42  Running with dbt=1.5.4
[2024-05-27, 14:26:42 UTC] {logging_mixin.py:150} INFO - 14:26:42  Running with dbt=1.5.4
...

That PR also confirms that virtual environment support with dbt runner is not yet implemented.

It may be possible to have this work for virtualenv and would be better suited for a follow-up PR.

I tried to set the invocation mode explicitly to InvocationMode.SUBPROCESS but it is only configurable for ExecutionMode.LOCAL.

if self.invocation_mode and self.execution_mode != ExecutionMode.LOCAL:
raise CosmosValueError("ExecutionConfig.invocation_mode is only configurable for ExecutionMode.LOCAL.")

Hey there, @marco9663! I'm here to help you with any issues you're facing. Whether it's bug fixes, answering questions, or becoming a contributor, I've got your back. Let's work together to solve this!

The issue you're encountering is due to the changes introduced in PR #850, which added support for InvocationMode.DBT_RUNNER but did not fully implement virtual environment support for this mode [1].

To ensure that Cosmos 1.4.1 creates a virtual environment and uses the specified dbt version when using ExecutionMode.VIRTUALENV, you need to configure the ExecutionConfig class appropriately. Here is an example configuration:

from cosmos.config import ExecutionConfig
from cosmos.constants import ExecutionMode

# Specify the dbt version in py_requirements
py_requirements = ["dbt-snowflake~=1.5.5", "dbt-core~=1.5.11"]

# Create the ExecutionConfig with ExecutionMode.VIRTUALENV
execution_config = ExecutionConfig(
    execution_mode=ExecutionMode.VIRTUALENV,
    py_requirements=py_requirements
)

# Ensure the dbt project path is set if needed
execution_config.dbt_project_path = "/path/to/dbt/project"

Additionally, ensure you are using the appropriate operator for executing dbt commands within a virtual environment. For example, you can use DbtBuildVirtualenvOperator:

from cosmos.operators.virtualenv import DbtBuildVirtualenvOperator

dbt_build_task = DbtBuildVirtualenvOperator(
    task_id='dbt_build',
    py_requirements=["dbt-snowflake~=1.5.5", "dbt-core~=1.5.11"],
    dbt_project_path="/path/to/dbt/project"
)

Here is a more comprehensive example of a DAG configuration that uses Cosmos to render a dbt project with a virtual environment:

import os
from datetime import datetime
from pathlib import Path

from cosmos import DbtDag, ExecutionConfig, ExecutionMode, ProfileConfig, ProjectConfig
from cosmos.profiles import PostgresUserPasswordProfileMapping

DEFAULT_DBT_ROOT_PATH = Path(__file__).parent / "dbt"
DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH))
PROJECT_NAME = "jaffle_shop"

profile_config = ProfileConfig(
    profile_name="default",
    target_name="dev",
    profile_mapping=PostgresUserPasswordProfileMapping(
        conn_id="airflow_db",
        profile_args={"schema": "public"},
    ),
)

example_virtualenv = DbtDag(
    project_config=ProjectConfig(
        DBT_ROOT_PATH / "jaffle_shop",
    ),
    profile_config=profile_config,
    execution_config=ExecutionConfig(
        execution_mode=ExecutionMode.VIRTUALENV,
    ),
    operator_args={
        "py_system_site_packages": False,
        "py_requirements": ["dbt-snowflake~=1.5.5", "dbt-core~=1.5.11"],
        "install_deps": True,
        "emit_datasets": False,
    },
    schedule_interval="@daily",
    start_date=datetime(2023, 1, 1),
    catchup=False,
    dag_id="example_virtualenv",
    default_args={"retries": 2},
)

This configuration ensures that Cosmos will create a virtual environment, install the specified dbt version, and execute the dbt commands within that environment [2][3][4][5].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.