tensorflow/tfx

DataFlow Job in TFX pipeline fails after running for an hour

sumansaurav-talentica opened this issue · 6 comments

If the bug is related to a specific library below, please raise an issue in the
respective repo directly:

System information

  • Have I specified the code to reproduce the issue (Yes, No):Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), - Vertex AI
    Interactive Notebook, Google Cloud, etc):
  • TensorFlow version: 2.13.1
  • TFX Version:1.14.0
  • Python version:3.10.12
  • Python dependencies (from pip freeze output):

Describe the current behavior
Pipeline fails in the first step where it has to import data from BQ using Dataflow job
Describe the expected behavior
It should successfully import the data, as earlier
Standalone code to reproduce the issue
BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,

# Temporary overrides of defaults.
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2'#

]
Other info / logs
Logs attached
downloaded-logs-20240108-182510.csv

@sumansaurav-talentica,

This is a known issue #6386 and the current workaround is to ssh to your container like docker run --rm -it --entrypoint=/bin/bash YOUR_CONTAINER_IMAGE and check if python3-venv package is installed or
add ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 in the TFX docker image before building the container and use this container for Dataflow jobs.
Thank you!

Can you please suggest code and steps on how can I add "ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1" in the TFX docker image before building the container.

This is my code where I am creating runner

BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2'
]

PIPELINE_DEFINITION_FILE = 'test_pipeline.json'

runner = tfx.orchestration.experimental.KubeflowV2DagRunner(
config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(),
output_filename=PIPELINE_DEFINITION_FILE)
_ = runner.run(
_create_pipeline(
pipeline_name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
query=QUERY,
module_file=os.path.join(MODULE_ROOT, _trainer_module_file),
endpoint_name=ENDPOINT_NAME,
project_id=GOOGLE_CLOUD_PROJECT,
region=GOOGLE_CLOUD_REGION,
beam_pipeline_args=BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS))

Thanks for the solution @singhniraj08 , it worked and I am putting it here.
Since I was creating tfx pipeline on colab and running it on vertex ai, below is the code I ran.

!gcloud artifacts repositories create REPO-NAME
--repository-format=docker
--location=REGION
--async

!gcloud auth configure-docker REGION-docker.pkg.dev

dockerfile_content = """
FROM tensorflow/tfx:1.14.0

ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
"""

with open("Dockerfile", "w") as dockerfile:
dockerfile.write(dockerfile_content)

!gcloud builds submit --tag REGION-docker.pkg.dev/PROJECT-ID/REPO-NAME/dataflow/DOCKERNAME:TAG

and finally I passed this new custom docker image container in beam_pipeline_args

BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2',
'--sdk_container_image=us-central1-docker.pkg.dev/calm-snowfall-385011/chicago-taxi/dataflow/tfx114:1.0'
]

@sumansaurav-talentica,

We have a similar issue to track this issues and the long term solution for this issue is to add the environment variable to TFX base image to avoid these issues in future. This is blocked by other issue #6468. Once that issue is fixed, we will implement the environment variable in tFX base image. I would request you to close this issue and follow similar issue for update.
Thank you!

thanks for the support

Are you satisfied with the resolution of your issue?
Yes
No